Series comparison

-[PULL 00/53] tcg patch queue
+[PULL 00/27] tcg patch queue
-The following changes since commit d530697ca20e19f7a626f4c1c8b26fccd0dc4470:
+The following changes since commit 7fe6cb68117ac856e03c93d18aca09de015392b0:
-  Merge tag 'pull-testing-updates-100523-1' of https://gitlab.com/stsquad/qemu into staging (2023-05-10 16:43:01 +0100)
+  Merge tag 'pull-target-arm-20230530-1' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-05-30 08:02:05 -0700)
 are available in the Git repository at:
-  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230511
+  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230530
-for you to fetch changes up to b2d4d6616c22325dff802e0a35092167f2dc2268:
+for you to fetch changes up to 276d77de503e8f5f5cbd3f7d94302ca12d1d982e:
-  target/loongarch: Do not include tcg-ldst.h (2023-05-11 06:06:04 +0100)
+  tests/decode: Add tests for various named-field cases (2023-05-30 10:55:39 -0700)
 ----------------------------------------------------------------
-target/m68k: Fix gen_load_fp regression
+Improvements to 128-bit atomics:
-accel/tcg: Ensure fairness with icount
+  - Separate __int128_t type and arithmetic detection
-disas: Move disas.c into the target-independent source sets
+  - Support 128-bit load/store in backend for i386, aarch64, ppc64, s390x
-tcg: Use common routines for calling slow path helpers
+  - Accelerate atomics via host/include/
-tcg/*: Cleanups to qemu_ld/st constraints
+Decodetree:
-tcg: Remove TARGET_ALIGNED_ONLY
+  - Add named field syntax
-accel/tcg: Reorg system mode load/store helpers
+  - Move tests to meson
 ----------------------------------------------------------------
-Jamie Iles (2):
+Peter Maydell (5):
-      cpu: expose qemu_cpu_list_lock for lock-guard use
+      docs: Document decodetree named field syntax
-      accel/tcg/tcg-accel-ops-rr: ensure fairness with icount
+      scripts/decodetree: Pass lvalue-formatter function to str_extract()
       scripts/decodetree: Implement a topological sort
       scripts/decodetree: Implement named field support
       tests/decode: Add tests for various named-field cases
-Richard Henderson (49):
+Richard Henderson (22):
-      target/m68k: Fix gen_load_fp for OS_LONG
+      tcg: Fix register move type in tcg_out_ld_helper_ret
-      accel/tcg: Fix atomic_mmu_lookup for reads
+      accel/tcg: Fix check for page writeability in load_atomic16_or_exit
-      disas: Fix tabs and braces in disas.c
+      meson: Split test for __int128_t type from __int128_t arithmetic
-      disas: Move disas.c to disas/
+      qemu/atomic128: Add x86_64 atomic128-ldst.h
-      disas: Remove target_ulong from the interface
+      tcg/i386: Support 128-bit load/store
-      disas: Remove target-specific headers
+      tcg/aarch64: Rename temporaries
-      tcg/i386: Introduce prepare_host_addr
+      tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
-      tcg/i386: Use indexed addressing for softmmu fast path
+      tcg/aarch64: Simplify constraints on qemu_ld/st
-      tcg/aarch64: Introduce prepare_host_addr
+      tcg/aarch64: Support 128-bit load/store
-      tcg/arm: Introduce prepare_host_addr
+      tcg/ppc: Support 128-bit load/store
-      tcg/loongarch64: Introduce prepare_host_addr
+      tcg/s390x: Support 128-bit load/store
-      tcg/mips: Introduce prepare_host_addr
+      accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
-      tcg/ppc: Introduce prepare_host_addr
+      accel/tcg: Extract store_atom_insert_al16 to host header
-      tcg/riscv: Introduce prepare_host_addr
+      accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
-      tcg/s390x: Introduce prepare_host_addr
+      accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
-      tcg: Add routines for calling slow-path helpers
+      accel/tcg: Add aarch64 store_atom_insert_al16
-      tcg/i386: Convert tcg_out_qemu_ld_slow_path
+      tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
-      tcg/i386: Convert tcg_out_qemu_st_slow_path
+      decodetree: Add --test-for-error
-      tcg/aarch64: Convert tcg_out_qemu_{ld,st}_slow_path
+      decodetree: Fix recursion in prop_format and build_tree
-      tcg/arm: Convert tcg_out_qemu_{ld,st}_slow_path
+      decodetree: Diagnose empty pattern group
-      tcg/loongarch64: Convert tcg_out_qemu_{ld,st}_slow_path
+      decodetree: Do not remove output_file from /dev
-      tcg/mips: Convert tcg_out_qemu_{ld,st}_slow_path
+      tests/decode: Convert tests to meson
       tcg/ppc: Convert tcg_out_qemu_{ld,st}_slow_path
       tcg/riscv: Convert tcg_out_qemu_{ld,st}_slow_path
       tcg/s390x: Convert tcg_out_qemu_{ld,st}_slow_path
       tcg/loongarch64: Simplify constraints on qemu_ld/st
       tcg/mips: Remove MO_BSWAP handling
       tcg/mips: Reorg tlb load within prepare_host_addr
       tcg/mips: Simplify constraints on qemu_ld/st
       tcg/ppc: Reorg tcg_out_tlb_read
       tcg/ppc: Adjust constraints on qemu_ld/st
       tcg/ppc: Remove unused constraints A, B, C, D
       tcg/ppc: Remove unused constraint J
       tcg/riscv: Simplify constraints on qemu_ld/st
       tcg/s390x: Use ALGFR in constructing softmmu host address
       tcg/s390x: Simplify constraints on qemu_ld/st
       target/mips: Add MO_ALIGN to gen_llwp, gen_scwp
       target/mips: Add missing default_tcg_memop_mask
       target/mips: Use MO_ALIGN instead of 0
       target/mips: Remove TARGET_ALIGNED_ONLY
       target/nios2: Remove TARGET_ALIGNED_ONLY
       target/sh4: Use MO_ALIGN where required
       target/sh4: Remove TARGET_ALIGNED_ONLY
       tcg: Remove TARGET_ALIGNED_ONLY
       accel/tcg: Add cpu_in_serial_context
       accel/tcg: Introduce tlb_read_idx
       accel/tcg: Reorg system mode load helpers
       accel/tcg: Reorg system mode store helpers
       target/loongarch: Do not include tcg-ldst.h
-Thomas Huth (2):
+ docs/devel/decodetree.rst                         |  33 ++-
-      disas: Move softmmu specific code to separate file
+ meson.build                                       |  15 +-
-      disas: Move disas.c into the target-independent source set
+ host/include/aarch64/host/load-extract-al16-al8.h |  40 ++++
+ host/include/aarch64/host/store-insert-al16.h     |  47 ++++
- configs/targets/mips-linux-user.mak       |    1 -
+ host/include/generic/host/load-extract-al16-al8.h |  45 ++++
- configs/targets/mips-softmmu.mak          |    1 -
+ host/include/generic/host/store-insert-al16.h     |  50 ++++
- configs/targets/mips64-linux-user.mak     |    1 -
+ host/include/x86_64/host/atomic128-ldst.h         |  68 ++++++
- configs/targets/mips64-softmmu.mak        |    1 -
+ host/include/x86_64/host/load-extract-al16-al8.h  |  50 ++++
- configs/targets/mips64el-linux-user.mak   |    1 -
+ include/qemu/int128.h                             |   4 +-
- configs/targets/mips64el-softmmu.mak      |    1 -
+ tcg/aarch64/tcg-target-con-set.h                  |   4 +-
- configs/targets/mipsel-linux-user.mak     |    1 -
+ tcg/aarch64/tcg-target-con-str.h                  |   1 -
- configs/targets/mipsel-softmmu.mak        |    1 -
+ tcg/aarch64/tcg-target.h                          |  12 +-
- configs/targets/mipsn32-linux-user.mak    |    1 -
+ tcg/arm/tcg-target.h                              |   1 -
- configs/targets/mipsn32el-linux-user.mak  |    1 -
+ tcg/i386/tcg-target.h                             |   5 +-
- configs/targets/nios2-softmmu.mak         |    1 -
+ tcg/mips/tcg-target.h                             |   1 -
- configs/targets/sh4-linux-user.mak        |    1 -
+ tcg/ppc/tcg-target-con-set.h                      |   2 +
- configs/targets/sh4-softmmu.mak           |    1 -
+ tcg/ppc/tcg-target-con-str.h                      |   1 +
- configs/targets/sh4eb-linux-user.mak      |    1 -
+ tcg/ppc/tcg-target.h                              |   4 +-
- configs/targets/sh4eb-softmmu.mak         |    1 -
+ tcg/riscv/tcg-target.h                            |   1 -
- meson.build                               |    3 -
+ tcg/s390x/tcg-target-con-set.h                    |   2 +
- accel/tcg/internal.h                      |    9 +
+ tcg/s390x/tcg-target.h                            |   3 +-
- accel/tcg/tcg-accel-ops-icount.h          |    3 +-
+ tcg/sparc64/tcg-target.h                          |   1 -
- disas/disas-internal.h                    |   21 +
+ tcg/tci/tcg-target.h                              |   1 -
- include/disas/disas.h                     |   23 +-
+ tests/decode/err_field10.decode                   |   7 +
- include/exec/cpu-common.h                 |    1 +
+ tests/decode/err_field7.decode                    |   7 +
- include/exec/cpu-defs.h                   |    7 +-
+ tests/decode/err_field8.decode                    |   8 +
- include/exec/cpu_ldst.h                   |   26 +-
+ tests/decode/err_field9.decode                    |  14 ++
- include/exec/memop.h                      |   13 +-
+ tests/decode/succ_named_field.decode              |  19 ++
- include/exec/poison.h                     |    1 -
+ tcg/tcg.c                                         |   4 +-
- tcg/loongarch64/tcg-target-con-set.h      |    2 -
+ accel/tcg/ldst_atomicity.c.inc                    |  80 +------
- tcg/loongarch64/tcg-target-con-str.h      |    1 -
+ tcg/aarch64/tcg-target.c.inc                      | 243 +++++++++++++++-----
- tcg/mips/tcg-target-con-set.h             |   13 +-
+ tcg/i386/tcg-target.c.inc                         | 191 +++++++++++++++-
- tcg/mips/tcg-target-con-str.h             |    2 -
+ tcg/ppc/tcg-target.c.inc                          | 108 ++++++++-
- tcg/mips/tcg-target.h                     |    4 +-
+ tcg/s390x/tcg-target.c.inc                        | 107 ++++++++-
- tcg/ppc/tcg-target-con-set.h              |   11 +-
+ scripts/decodetree.py                             | 265 ++++++++++++++++++++--
- tcg/ppc/tcg-target-con-str.h              |    7 -
+ tests/decode/check.sh                             |  24 --
- tcg/riscv/tcg-target-con-set.h            |    2 -
+ tests/decode/meson.build                          |  64 ++++++
- tcg/riscv/tcg-target-con-str.h            |    1 -
+ tests/meson.build                                 |   5 +-
- tcg/s390x/tcg-target-con-set.h            |    2 -
+files changed, 1312 insertions(+), 225 deletions(-)
- tcg/s390x/tcg-target-con-str.h            |    1 -
+ create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
- accel/tcg/cpu-exec-common.c               |    3 +
+ create mode 100644 host/include/aarch64/host/store-insert-al16.h
- accel/tcg/cputlb.c                        | 1113 ++++++++++++++++-------------
+ create mode 100644 host/include/generic/host/load-extract-al16-al8.h
- accel/tcg/tb-maint.c                      |    2 +-
+ create mode 100644 host/include/generic/host/store-insert-al16.h
- accel/tcg/tcg-accel-ops-icount.c          |   21 +-
+ create mode 100644 host/include/x86_64/host/atomic128-ldst.h
- accel/tcg/tcg-accel-ops-rr.c              |   37 +-
+ create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
- bsd-user/elfload.c                        |    5 +-
+ create mode 100644 tests/decode/err_field10.decode
- cpus-common.c                             |    2 +-
+ create mode 100644 tests/decode/err_field7.decode
- disas/disas-mon.c                         |   65 ++
+ create mode 100644 tests/decode/err_field8.decode
- disas.c => disas/disas.c                  |  109 +--
+ create mode 100644 tests/decode/err_field9.decode
- linux-user/elfload.c                      |   18 +-
+ create mode 100644 tests/decode/succ_named_field.decode
- migration/dirtyrate.c                     |   26 +-
+ delete mode 100755 tests/decode/check.sh
- replay/replay.c                           |    3 +-
+ create mode 100644 tests/decode/meson.build
  target/loongarch/csr_helper.c             |    1 -
  target/loongarch/iocsr_helper.c           |    1 -
  target/m68k/translate.c                   |    1 +
  target/mips/tcg/mxu_translate.c           |    3 +-
  target/nios2/translate.c                  |   10 +
  target/sh4/translate.c                    |  102 ++-
  tcg/tcg.c                                 |  480 ++++++++++++-
  trace/control-target.c                    |    9 +-
  target/mips/tcg/micromips_translate.c.inc |   24 +-
  target/mips/tcg/mips16e_translate.c.inc   |   18 +-
  target/mips/tcg/nanomips_translate.c.inc  |   32 +-
  tcg/aarch64/tcg-target.c.inc              |  347 ++++-----
  tcg/arm/tcg-target.c.inc                  |  455 +++++-------
  tcg/i386/tcg-target.c.inc                 |  453 +++++-------
  tcg/loongarch64/tcg-target.c.inc          |  313 +++-----
  tcg/mips/tcg-target.c.inc                 |  870 +++++++---------------
  tcg/ppc/tcg-target.c.inc                  |  512 ++++++-------
  tcg/riscv/tcg-target.c.inc                |  304 ++++----
  tcg/s390x/tcg-target.c.inc                |  314 ++++----
  disas/meson.build                         |    6 +-
 files changed, 2788 insertions(+), 3039 deletions(-)
  create mode 100644 disas/disas-internal.h
  create mode 100644 disas/disas-mon.c
  rename disas.c => disas/disas.c (79%)

-[PULL 01/53] target/m68k: Fix gen_load_fp for OS_LONG
+Deleted patch
-Case was accidentally dropped in b7a94da9550b.
-Tested-by: Laurent Vivier <laurent@vivier.eu>
-Reviewed-by: Laurent Vivier <laurent@vivier.eu>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- target/m68k/translate.c | 1 +
-file changed, 1 insertion(+)
-diff --git a/target/m68k/translate.c b/target/m68k/translate.c
-index XXXXXXX..XXXXXXX 100644
---- a/target/m68k/translate.c
-+++ b/target/m68k/translate.c
-@@ -XXX,XX +XXX,XX @@ static void gen_load_fp(DisasContext *s, int opsize, TCGv addr, TCGv_ptr fp,
-     switch (opsize) {
-     case OS_BYTE:
-     case OS_WORD:
-+    case OS_LONG:
-         tcg_gen_qemu_ld_tl(tmp, addr, index, opsize | MO_SIGN | MO_TE);
-         gen_helper_exts32(cpu_env, fp, tmp);
-         break;
---
-.34.1

-[PULL 02/53] accel/tcg: Fix atomic_mmu_lookup for reads
+Deleted patch
-A copy-paste bug had us looking at the victim cache for writes.
-Cc: qemu-stable@nongnu.org
-Reported-by: Peter Maydell <peter.maydell@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Fixes: 08dff435e2 ("tcg: Probe the proper permissions for atomic ops")
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
-Message-Id: <20230505204049.352469-1-richard.henderson@linaro.org>
----
- accel/tcg/cputlb.c | 2 +-
-file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
-index XXXXXXX..XXXXXXX 100644
---- a/accel/tcg/cputlb.c
-+++ b/accel/tcg/cputlb.c
-@@ -XXX,XX +XXX,XX @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
-     } else /* if (prot & PAGE_READ) */ {
-         tlb_addr = tlbe->addr_read;
-         if (!tlb_hit(tlb_addr, addr)) {
--            if (!VICTIM_TLB_HIT(addr_write, addr)) {
-+            if (!VICTIM_TLB_HIT(addr_read, addr)) {
-                 tlb_fill(env_cpu(env), addr, size,
-                          MMU_DATA_LOAD, mmu_idx, retaddr);
-                 index = tlb_index(env, mmu_idx, addr);
---
-.34.1

-[PULL 48/53] tcg: Remove TARGET_ALIGNED_ONLY
+[PULL 01/27] tcg: Fix register move type in tcg_out_ld_helper_ret
-All uses have now been expunged.
+The first move was incorrectly using TCG_TYPE_I32 while the second
 move was correctly using TCG_TYPE_REG.  This prevents a 64-bit host
 from moving all 128-bits of the return value.
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+Fixes: ebebea53ef8 ("tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}")
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 ---
- include/exec/memop.h  | 13 ++-----------
+ tcg/tcg.c | 4 ++--
- include/exec/poison.h |  1 -
+file changed, 2 insertions(+), 2 deletions(-)
  tcg/tcg.c             |  5 -----
 files changed, 2 insertions(+), 17 deletions(-)
-diff --git a/include/exec/memop.h b/include/exec/memop.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/exec/memop.h
-+++ b/include/exec/memop.h
-@@ -XXX,XX +XXX,XX @@ typedef enum MemOp {
-      * MO_UNALN accesses are never checked for alignment.
-      * MO_ALIGN accesses will result in a call to the CPU's
-      * do_unaligned_access hook if the guest address is not aligned.
--     * The default depends on whether the target CPU defines
--     * TARGET_ALIGNED_ONLY.
-      *
-      * Some architectures (e.g. ARMv8) need the address which is aligned
-      * to a size more than the size of the memory access.
-@@ -XXX,XX +XXX,XX @@ typedef enum MemOp {
-      */
-     MO_ASHIFT = 5,
-     MO_AMASK = 0x7 << MO_ASHIFT,
--#ifdef NEED_CPU_H
--#ifdef TARGET_ALIGNED_ONLY
--    MO_ALIGN = 0,
--    MO_UNALN = MO_AMASK,
--#else
--    MO_ALIGN = MO_AMASK,
--    MO_UNALN = 0,
--#endif
--#endif
-+    MO_UNALN    = 0,
-     MO_ALIGN_2  = 1 << MO_ASHIFT,
-     MO_ALIGN_4  = 2 << MO_ASHIFT,
-     MO_ALIGN_8  = 3 << MO_ASHIFT,
-     MO_ALIGN_16 = 4 << MO_ASHIFT,
-     MO_ALIGN_32 = 5 << MO_ASHIFT,
-     MO_ALIGN_64 = 6 << MO_ASHIFT,
-+    MO_ALIGN    = MO_AMASK,
-     /* Combinations of the above, for ease of use.  */
-     MO_UB    = MO_8,
-diff --git a/include/exec/poison.h b/include/exec/poison.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/exec/poison.h
-+++ b/include/exec/poison.h
-@@ -XXX,XX +XXX,XX @@
- #pragma GCC poison TARGET_TRICORE
- #pragma GCC poison TARGET_XTENSA
--#pragma GCC poison TARGET_ALIGNED_ONLY
- #pragma GCC poison TARGET_HAS_BFLT
- #pragma GCC poison TARGET_NAME
- #pragma GCC poison TARGET_SUPPORTS_MTTCG
 diff --git a/tcg/tcg.c b/tcg/tcg.c
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/tcg.c
 +++ b/tcg/tcg.c
-@@ -XXX,XX +XXX,XX @@ static const char * const ldst_name[] =
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
- };
+     mov[0].dst = ldst->datalo_reg;
+     mov[0].src =
- static const char * const alignment_name[(MO_AMASK >> MO_ASHIFT) + 1] = {
+         tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
--#ifdef TARGET_ALIGNED_ONLY
+-    mov[0].dst_type = TCG_TYPE_I32;
-     [MO_UNALN >> MO_ASHIFT]    = "un+",
+-    mov[0].src_type = TCG_TYPE_I32;
--    [MO_ALIGN >> MO_ASHIFT]    = "",
++    mov[0].dst_type = TCG_TYPE_REG;
--#else
++    mov[0].src_type = TCG_TYPE_REG;
--    [MO_UNALN >> MO_ASHIFT]    = "",
+     mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
-     [MO_ALIGN >> MO_ASHIFT]    = "al+",
--#endif
+     mov[1].dst = ldst->datahi_reg;
      [MO_ALIGN_2 >> MO_ASHIFT]  = "al2+",
      [MO_ALIGN_4 >> MO_ASHIFT]  = "al4+",
      [MO_ALIGN_8 >> MO_ASHIFT]  = "al8+",
 --
 .34.1

-[PULL 53/53] target/loongarch: Do not include tcg-ldst.h
+[PULL 02/27] accel/tcg: Fix check for page writeability in load_atomic16_or_exit
-This header is supposed to be private to tcg and in fact
+PAGE_WRITE is current writability, as modified by TB protection;
-does not need to be included here at all.
+PAGE_WRITE_ORG is the original page writability.
-Reviewed-by: Song Gao <gaosong@loongson.cn>
+Fixes: cdfac37be0d ("accel/tcg: Honor atomicity of loads")
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/loongarch/csr_helper.c   | 1 -
+ accel/tcg/ldst_atomicity.c.inc | 4 ++--
- target/loongarch/iocsr_helper.c | 1 -
+file changed, 2 insertions(+), 2 deletions(-)
 files changed, 2 deletions(-)
-diff --git a/target/loongarch/csr_helper.c b/target/loongarch/csr_helper.c
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/target/loongarch/csr_helper.c
+--- a/accel/tcg/ldst_atomicity.c.inc
-+++ b/target/loongarch/csr_helper.c
++++ b/accel/tcg/ldst_atomicity.c.inc
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
- #include "exec/cpu_ldst.h"
+      * another process, because the fallback start_exclusive solution
- #include "hw/irq.h"
+      * provides no protection across processes.
- #include "cpu-csr.h"
+      */
--#include "tcg/tcg-ldst.h"
+-    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
++    if (!page_check_range(h2g(pv), 8, PAGE_WRITE_ORG)) {
- target_ulong helper_csrrd_pgd(CPULoongArchState *env)
+         uint64_t *p = __builtin_assume_aligned(pv, 8);
- {
+         return *p;
-diff --git a/target/loongarch/iocsr_helper.c b/target/loongarch/iocsr_helper.c
+     }
-index XXXXXXX..XXXXXXX 100644
+@@ -XXX,XX +XXX,XX @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
---- a/target/loongarch/iocsr_helper.c
+      * another process, because the fallback start_exclusive solution
-+++ b/target/loongarch/iocsr_helper.c
+      * provides no protection across processes.
-@@ -XXX,XX +XXX,XX @@
+      */
- #include "exec/helper-proto.h"
+-    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
- #include "exec/exec-all.h"
++    if (!page_check_range(h2g(p), 16, PAGE_WRITE_ORG)) {
- #include "exec/cpu_ldst.h"
+         return *p;
--#include "tcg/tcg-ldst.h"
+     }
+ #endif
  #define GET_MEMTXATTRS(cas) \
          ((MemTxAttrs){.requester_id = env_cpu(cas)->cpu_index})
 --
 .34.1

-[PULL 04/53] disas: Move disas.c to disas/
+[PULL 03/27] meson: Split test for __int128_t type from __int128_t arithmetic
-Reviewed-by: Thomas Huth <thuth@redhat.com>
+Older versions of clang have missing runtime functions for arithmetic
 with -fsanitize=undefined (see 464e3671f9d5c), so we cannot use
 __int128_t for implementing Int128.  But __int128_t is present,
 data movement works, and it can be used for atomic128.
 Probe for both CONFIG_INT128_TYPE and CONFIG_INT128, adjust
 qemu/int128.h to define Int128Alias if CONFIG_INT128_TYPE,
 and adjust the meson probe for atomics to use has_int128_type.
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20230503072331.1747057-80-richard.henderson@linaro.org>
 ---
- meson.build              | 3 ---
+ meson.build           | 15 ++++++++++-----
- disas.c => disas/disas.c | 0
+ include/qemu/int128.h |  4 ++--
- disas/meson.build        | 4 +++-
+files changed, 12 insertions(+), 7 deletions(-)
 files changed, 3 insertions(+), 4 deletions(-)
  rename disas.c => disas/disas.c (100%)
 diff --git a/meson.build b/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/meson.build
 +++ b/meson.build
-@@ -XXX,XX +XXX,XX @@ specific_ss.add(files('cpu.c'))
+@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_ATOMIC64', cc.links('''
+     return 0;
- subdir('softmmu')
+   }'''))
--common_ss.add(capstone)
+-has_int128 = cc.links('''
--specific_ss.add(files('disas.c'), capstone)
++has_int128_type = cc.compiles('''
 +  __int128_t a;
 +  __uint128_t b;
 +  int main(void) { b = a; }''')
 +config_host_data.set('CONFIG_INT128_TYPE', has_int128_type)
 +
 +has_int128 = has_int128_type and cc.links('''
    __int128_t a;
    __uint128_t b;
    int main (void) {
@@ -XXX,XX +XXX,XX @@ has_int128 = cc.links('''
      a = a * a;
      return 0;
    }''')
 -
- # Work around a gcc bug/misfeature wherein constant propagation looks
+ config_host_data.set('CONFIG_INT128', has_int128)
- # through an alias:
- #   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99696
+-if has_int128
-diff --git a/disas.c b/disas/disas.c
++if has_int128_type
-similarity index 100%
+   # "do we have 128-bit atomics which are handled inline and specifically not
-rename from disas.c
+   # via libatomic". The reason we can't use libatomic is documented in the
-rename to disas/disas.c
+   # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
-diff --git a/disas/meson.build b/disas/meson.build
+@@ -XXX,XX +XXX,XX @@ if has_int128
    # __alignof(unsigned __int128) for the host.
    atomic_test_128 = '''
      int main(int ac, char **av) {
 -      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], 16);
 +      __uint128_t *p = __builtin_assume_aligned(av[ac - 1], 16);
        p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
        __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
        __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
@@ -XXX,XX +XXX,XX @@ if has_int128
        config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
          int main(void)
          {
 -          unsigned __int128 x = 0, y = 0;
 +          __uint128_t x = 0, y = 0;
            __sync_val_compare_and_swap_16(&x, y, x);
            return 0;
          }
 diff --git a/include/qemu/int128.h b/include/qemu/int128.h
 index XXXXXXX..XXXXXXX 100644
---- a/disas/meson.build
+--- a/include/qemu/int128.h
-+++ b/disas/meson.build
++++ b/include/qemu/int128.h
-@@ -XXX,XX +XXX,XX @@ common_ss.add(when: 'CONFIG_RISCV_DIS', if_true: files('riscv.c'))
+@@ -XXX,XX +XXX,XX @@ static inline void bswap128s(Int128 *s)
- common_ss.add(when: 'CONFIG_SH4_DIS', if_true: files('sh4.c'))
+  * a possible structure and the native types.  Ease parameter passing
- common_ss.add(when: 'CONFIG_SPARC_DIS', if_true: files('sparc.c'))
+  * via use of the transparent union extension.
- common_ss.add(when: 'CONFIG_XTENSA_DIS', if_true: files('xtensa.c'))
+  */
--common_ss.add(when: capstone, if_true: files('capstone.c'))
+-#ifdef CONFIG_INT128
-+common_ss.add(when: capstone, if_true: [files('capstone.c'), capstone])
++#ifdef CONFIG_INT128_TYPE
-+
+ typedef union {
-+specific_ss.add(files('disas.c'), capstone)
+     __uint128_t u;
      __int128_t i;
@@ -XXX,XX +XXX,XX @@ typedef union {
  } Int128Alias __attribute__((transparent_union));
  #else
  typedef Int128 Int128Alias;
 -#endif /* CONFIG_INT128 */
 +#endif /* CONFIG_INT128_TYPE */
  #endif /* INT128_H */
 --
 .34.1

-[PULL 20/53] tcg: Add routines for calling slow-path helpers
+[PULL 04/27] qemu/atomic128: Add x86_64 atomic128-ldst.h
-Add tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
+With CPUINFO_ATOMIC_VMOVDQA, we can perform proper atomic
-and tcg_out_st_helper_args.  These and their subroutines
+load/store without cmpxchg16b.
 use the existing knowledge of the host function call abi
 to load the function call arguments and return results.
 These will be used to simplify the backends in turn.
 Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/tcg.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
+ host/include/x86_64/host/atomic128-ldst.h | 68 +++++++++++++++++++++++
-file changed, 471 insertions(+), 4 deletions(-)
+file changed, 68 insertions(+)
  create mode 100644 host/include/x86_64/host/atomic128-ldst.h
-diff --git a/tcg/tcg.c b/tcg/tcg.c
+diff --git a/host/include/x86_64/host/atomic128-ldst.h b/host/include/x86_64/host/atomic128-ldst.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/tcg/tcg.c
+index XXXXXXX..XXXXXXX
-+++ b/tcg/tcg.c
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct);
++++ b/host/include/x86_64/host/atomic128-ldst.h
- static int tcg_out_ldst_finalize(TCGContext *s);
+@@ -XXX,XX +XXX,XX @@
  #endif
 +typedef struct TCGLdstHelperParam {
 +    TCGReg (*ra_gen)(TCGContext *s, const TCGLabelQemuLdst *l, int arg_reg);
 +    unsigned ntmp;
 +    int tmp[3];
 +} TCGLdstHelperParam;
 +
 +static void tcg_out_ld_helper_args(TCGContext *s, const TCGLabelQemuLdst *l,
 +                                   const TCGLdstHelperParam *p)
 +    __attribute__((unused));
 +static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *l,
 +                                  bool load_sign, const TCGLdstHelperParam *p)
 +    __attribute__((unused));
 +static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *l,
 +                                   const TCGLdstHelperParam *p)
 +    __attribute__((unused));
 +
  TCGContext tcg_init_ctx;
  __thread TCGContext *tcg_ctx;
@@ -XXX,XX +XXX,XX @@ void tcg_raise_tb_overflow(TCGContext *s)
      siglongjmp(s->jmp_trans, -2);
  }
 +/*
-+ * Used by tcg_out_movext{1,2} to hold the arguments for tcg_out_movext.
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+ * By the time we arrive at tcg_out_movext1, @dst is always a TCGReg.
++ * Load/store for 128-bit atomic operations, x86_64 version.
 + *
-+ * However, tcg_out_helper_load_slots reuses this field to hold an
++ * Copyright (C) 2023 Linaro, Ltd.
-+ * argument slot number (which may designate a argument register or an
++ *
-+ * argument stack slot), converting to TCGReg once all arguments that
++ * See docs/devel/atomics.rst for discussion about the guarantees each
-+ * are destined for the stack are processed.
++ * atomic primitive is meant to provide.
 + */
  typedef struct TCGMovExtend {
 -    TCGReg dst;
 +    unsigned dst;
      TCGReg src;
      TCGType dst_type;
      TCGType src_type;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_movext1(TCGContext *s, const TCGMovExtend *i)
   * between the sources and destinations.
   */
 -static void __attribute__((unused))
 -tcg_out_movext2(TCGContext *s, const TCGMovExtend *i1,
 -                const TCGMovExtend *i2, int scratch)
 +static void tcg_out_movext2(TCGContext *s, const TCGMovExtend *i1,
 +                            const TCGMovExtend *i2, int scratch)
  {
      TCGReg src1 = i1->src;
      TCGReg src2 = i2->src;
@@ -XXX,XX +XXX,XX @@ static TCGHelperInfo all_helpers[] = {
  };
  static GHashTable *helper_table;
 +/*
 + * Create TCGHelperInfo structures for "tcg/tcg-ldst.h" functions,
 + * akin to what "exec/helper-tcg.h" does with DEF_HELPER_FLAGS_N.
 + * We only use these for layout in tcg_out_ld_helper_ret and
 + * tcg_out_st_helper_args, and share them between several of
 + * the helpers, with the end result that it's easier to build manually.
 + */
 +
-+#if TCG_TARGET_REG_BITS == 32
++#ifndef AARCH64_ATOMIC128_LDST_H
-+# define dh_typecode_ttl  dh_typecode_i32
++#define AARCH64_ATOMIC128_LDST_H
 +
 +#ifdef CONFIG_INT128_TYPE
 +#include "host/cpuinfo.h"
 +#include "tcg/debug-assert.h"
 +
 +/*
 + * Through clang 16, with -mcx16, __atomic_load_n is incorrectly
 + * expanded to a read-write operation: lock cmpxchg16b.
 + */
 +
 +#define HAVE_ATOMIC128_RO  likely(cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
 +#define HAVE_ATOMIC128_RW  1
 +
 +static inline Int128 atomic16_read_ro(const Int128 *ptr)
 +{
 +    Int128Alias r;
 +
 +    tcg_debug_assert(HAVE_ATOMIC128_RO);
 +    asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr));
 +
 +    return r.s;
 +}
 +
 +static inline Int128 atomic16_read_rw(Int128 *ptr)
 +{
 +    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
 +    Int128Alias r;
 +
 +    if (HAVE_ATOMIC128_RO) {
 +        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
 +    } else {
 +        r.i = __sync_val_compare_and_swap_16(ptr_align, 0, 0);
 +    }
 +    return r.s;
 +}
 +
 +static inline void atomic16_set(Int128 *ptr, Int128 val)
 +{
 +    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
 +    Int128Alias new = { .s = val };
 +
 +    if (HAVE_ATOMIC128_RO) {
 +        asm("vmovdqa %1, %0" : "=m"(*ptr_align) : "x" (new.i));
 +    } else {
 +        __int128_t old;
 +        do {
 +            old = *ptr_align;
 +        } while (!__sync_bool_compare_and_swap_16(ptr_align, old, new.i));
 +    }
 +}
 +#else
-+# define dh_typecode_ttl  dh_typecode_i64
++/* Provide QEMU_ERROR stubs. */
 +#include "host/include/generic/host/atomic128-ldst.h"
 +#endif
 +
-+static TCGHelperInfo info_helper_ld32_mmu = {
++#endif /* AARCH64_ATOMIC128_LDST_H */
 +    .flags = TCG_CALL_NO_WG,
 +    .typemask = dh_typemask(ttl, 0)  /* return tcg_target_ulong */
 +              | dh_typemask(env, 1)
 +              | dh_typemask(tl, 2)   /* target_ulong addr */
 +              | dh_typemask(i32, 3)  /* unsigned oi */
 +              | dh_typemask(ptr, 4)  /* uintptr_t ra */
 +};
 +
 +static TCGHelperInfo info_helper_ld64_mmu = {
 +    .flags = TCG_CALL_NO_WG,
 +    .typemask = dh_typemask(i64, 0)  /* return uint64_t */
 +              | dh_typemask(env, 1)
 +              | dh_typemask(tl, 2)   /* target_ulong addr */
 +              | dh_typemask(i32, 3)  /* unsigned oi */
 +              | dh_typemask(ptr, 4)  /* uintptr_t ra */
 +};
 +
 +static TCGHelperInfo info_helper_st32_mmu = {
 +    .flags = TCG_CALL_NO_WG,
 +    .typemask = dh_typemask(void, 0)
 +              | dh_typemask(env, 1)
 +              | dh_typemask(tl, 2)   /* target_ulong addr */
 +              | dh_typemask(i32, 3)  /* uint32_t data */
 +              | dh_typemask(i32, 4)  /* unsigned oi */
 +              | dh_typemask(ptr, 5)  /* uintptr_t ra */
 +};
 +
 +static TCGHelperInfo info_helper_st64_mmu = {
 +    .flags = TCG_CALL_NO_WG,
 +    .typemask = dh_typemask(void, 0)
 +              | dh_typemask(env, 1)
 +              | dh_typemask(tl, 2)   /* target_ulong addr */
 +              | dh_typemask(i64, 3)  /* uint64_t data */
 +              | dh_typemask(i32, 4)  /* unsigned oi */
 +              | dh_typemask(ptr, 5)  /* uintptr_t ra */
 +};
 +
  #ifdef CONFIG_TCG_INTERPRETER
  static ffi_type *typecode_to_ffi(int argmask)
  {
@@ -XXX,XX +XXX,XX @@ static void tcg_context_init(unsigned max_cpus)
                              (gpointer)&all_helpers[i]);
      }
 +    init_call_layout(&info_helper_ld32_mmu);
 +    init_call_layout(&info_helper_ld64_mmu);
 +    init_call_layout(&info_helper_st32_mmu);
 +    init_call_layout(&info_helper_st64_mmu);
 +
  #ifdef CONFIG_TCG_INTERPRETER
      init_ffi_layouts();
  #endif
@@ -XXX,XX +XXX,XX @@ static void tcg_reg_alloc_call(TCGContext *s, TCGOp *op)
      }
  }
 +/*
 + * Similarly for qemu_ld/st slow path helpers.
 + * We must re-implement tcg_gen_callN and tcg_reg_alloc_call simultaneously,
 + * using only the provided backend tcg_out_* functions.
 + */
 +
 +static int tcg_out_helper_stk_ofs(TCGType type, unsigned slot)
 +{
 +    int ofs = arg_slot_stk_ofs(slot);
 +
 +    /*
 +     * Each stack slot is TCG_TARGET_LONG_BITS.  If the host does not
 +     * require extension to uint64_t, adjust the address for uint32_t.
 +     */
 +    if (HOST_BIG_ENDIAN &&
 +        TCG_TARGET_REG_BITS == 64 &&
 +        type == TCG_TYPE_I32) {
 +        ofs += 4;
 +    }
 +    return ofs;
 +}
 +
 +static void tcg_out_helper_load_regs(TCGContext *s,
 +                                     unsigned nmov, TCGMovExtend *mov,
 +                                     unsigned ntmp, const int *tmp)
 +{
 +    switch (nmov) {
 +    default:
 +        /* The backend must have provided enough temps for the worst case. */
 +        tcg_debug_assert(ntmp + 1 >= nmov);
 +
 +        for (unsigned i = nmov - 1; i >= 2; --i) {
 +            TCGReg dst = mov[i].dst;
 +
 +            for (unsigned j = 0; j < i; ++j) {
 +                if (dst == mov[j].src) {
 +                    /*
 +                     * Conflict.
 +                     * Copy the source to a temporary, recurse for the
 +                     * remaining moves, perform the extension from our
 +                     * scratch on the way out.
 +                     */
 +                    TCGReg scratch = tmp[--ntmp];
 +                    tcg_out_mov(s, mov[i].src_type, scratch, mov[i].src);
 +                    mov[i].src = scratch;
 +
 +                    tcg_out_helper_load_regs(s, i, mov, ntmp, tmp);
 +                    tcg_out_movext1(s, &mov[i]);
 +                    return;
 +                }
 +            }
 +
 +            /* No conflicts: perform this move and continue. */
 +            tcg_out_movext1(s, &mov[i]);
 +        }
 +        /* fall through for the final two moves */
 +
 +    case 2:
 +        tcg_out_movext2(s, mov, mov + 1, ntmp ? tmp[0] : -1);
 +        return;
 +    case 1:
 +        tcg_out_movext1(s, mov);
 +        return;
 +    case 0:
 +        g_assert_not_reached();
 +    }
 +}
 +
 +static void tcg_out_helper_load_slots(TCGContext *s,
 +                                      unsigned nmov, TCGMovExtend *mov,
 +                                      const TCGLdstHelperParam *parm)
 +{
 +    unsigned i;
 +
 +    /*
 +     * Start from the end, storing to the stack first.
 +     * This frees those registers, so we need not consider overlap.
 +     */
 +    for (i = nmov; i-- > 0; ) {
 +        unsigned slot = mov[i].dst;
 +
 +        if (arg_slot_reg_p(slot)) {
 +            goto found_reg;
 +        }
 +
 +        TCGReg src = mov[i].src;
 +        TCGType dst_type = mov[i].dst_type;
 +        MemOp dst_mo = dst_type == TCG_TYPE_I32 ? MO_32 : MO_64;
 +
 +        /* The argument is going onto the stack; extend into scratch. */
 +        if ((mov[i].src_ext & MO_SIZE) != dst_mo) {
 +            tcg_debug_assert(parm->ntmp != 0);
 +            mov[i].dst = src = parm->tmp[0];
 +            tcg_out_movext1(s, &mov[i]);
 +        }
 +
 +        tcg_out_st(s, dst_type, src, TCG_REG_CALL_STACK,
 +                   tcg_out_helper_stk_ofs(dst_type, slot));
 +    }
 +    return;
 +
 + found_reg:
 +    /*
 +     * The remaining arguments are in registers.
 +     * Convert slot numbers to argument registers.
 +     */
 +    nmov = i + 1;
 +    for (i = 0; i < nmov; ++i) {
 +        mov[i].dst = tcg_target_call_iarg_regs[mov[i].dst];
 +    }
 +    tcg_out_helper_load_regs(s, nmov, mov, parm->ntmp, parm->tmp);
 +}
 +
 +static void tcg_out_helper_load_imm(TCGContext *s, unsigned slot,
 +                                    TCGType type, tcg_target_long imm,
 +                                    const TCGLdstHelperParam *parm)
 +{
 +    if (arg_slot_reg_p(slot)) {
 +        tcg_out_movi(s, type, tcg_target_call_iarg_regs[slot], imm);
 +    } else {
 +        int ofs = tcg_out_helper_stk_ofs(type, slot);
 +        if (!tcg_out_sti(s, type, imm, TCG_REG_CALL_STACK, ofs)) {
 +            tcg_debug_assert(parm->ntmp != 0);
 +            tcg_out_movi(s, type, parm->tmp[0], imm);
 +            tcg_out_st(s, type, parm->tmp[0], TCG_REG_CALL_STACK, ofs);
 +        }
 +    }
 +}
 +
 +static void tcg_out_helper_load_common_args(TCGContext *s,
 +                                            const TCGLabelQemuLdst *ldst,
 +                                            const TCGLdstHelperParam *parm,
 +                                            const TCGHelperInfo *info,
 +                                            unsigned next_arg)
 +{
 +    TCGMovExtend ptr_mov = {
 +        .dst_type = TCG_TYPE_PTR,
 +        .src_type = TCG_TYPE_PTR,
 +        .src_ext = sizeof(void *) == 4 ? MO_32 : MO_64
 +    };
 +    const TCGCallArgumentLoc *loc = &info->in[0];
 +    TCGType type;
 +    unsigned slot;
 +    tcg_target_ulong imm;
 +
 +    /*
 +     * Handle env, which is always first.
 +     */
 +    ptr_mov.dst = loc->arg_slot;
 +    ptr_mov.src = TCG_AREG0;
 +    tcg_out_helper_load_slots(s, 1, &ptr_mov, parm);
 +
 +    /*
 +     * Handle oi.
 +     */
 +    imm = ldst->oi;
 +    loc = &info->in[next_arg];
 +    type = TCG_TYPE_I32;
 +    switch (loc->kind) {
 +    case TCG_CALL_ARG_NORMAL:
 +        break;
 +    case TCG_CALL_ARG_EXTEND_U:
 +    case TCG_CALL_ARG_EXTEND_S:
 +        /* No extension required for MemOpIdx. */
 +        tcg_debug_assert(imm <= INT32_MAX);
 +        type = TCG_TYPE_REG;
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +    tcg_out_helper_load_imm(s, loc->arg_slot, type, imm, parm);
 +    next_arg++;
 +
 +    /*
 +     * Handle ra.
 +     */
 +    loc = &info->in[next_arg];
 +    slot = loc->arg_slot;
 +    if (parm->ra_gen) {
 +        int arg_reg = -1;
 +        TCGReg ra_reg;
 +
 +        if (arg_slot_reg_p(slot)) {
 +            arg_reg = tcg_target_call_iarg_regs[slot];
 +        }
 +        ra_reg = parm->ra_gen(s, ldst, arg_reg);
 +
 +        ptr_mov.dst = slot;
 +        ptr_mov.src = ra_reg;
 +        tcg_out_helper_load_slots(s, 1, &ptr_mov, parm);
 +    } else {
 +        imm = (uintptr_t)ldst->raddr;
 +        tcg_out_helper_load_imm(s, slot, TCG_TYPE_PTR, imm, parm);
 +    }
 +}
 +
 +static unsigned tcg_out_helper_add_mov(TCGMovExtend *mov,
 +                                       const TCGCallArgumentLoc *loc,
 +                                       TCGType dst_type, TCGType src_type,
 +                                       TCGReg lo, TCGReg hi)
 +{
 +    if (dst_type <= TCG_TYPE_REG) {
 +        MemOp src_ext;
 +
 +        switch (loc->kind) {
 +        case TCG_CALL_ARG_NORMAL:
 +            src_ext = src_type == TCG_TYPE_I32 ? MO_32 : MO_64;
 +            break;
 +        case TCG_CALL_ARG_EXTEND_U:
 +            dst_type = TCG_TYPE_REG;
 +            src_ext = MO_UL;
 +            break;
 +        case TCG_CALL_ARG_EXTEND_S:
 +            dst_type = TCG_TYPE_REG;
 +            src_ext = MO_SL;
 +            break;
 +        default:
 +            g_assert_not_reached();
 +        }
 +
 +        mov[0].dst = loc->arg_slot;
 +        mov[0].dst_type = dst_type;
 +        mov[0].src = lo;
 +        mov[0].src_type = src_type;
 +        mov[0].src_ext = src_ext;
 +        return 1;
 +    }
 +
 +    assert(TCG_TARGET_REG_BITS == 32);
 +
 +    mov[0].dst = loc[HOST_BIG_ENDIAN].arg_slot;
 +    mov[0].src = lo;
 +    mov[0].dst_type = TCG_TYPE_I32;
 +    mov[0].src_type = TCG_TYPE_I32;
 +    mov[0].src_ext = MO_32;
 +
 +    mov[1].dst = loc[!HOST_BIG_ENDIAN].arg_slot;
 +    mov[1].src = hi;
 +    mov[1].dst_type = TCG_TYPE_I32;
 +    mov[1].src_type = TCG_TYPE_I32;
 +    mov[1].src_ext = MO_32;
 +
 +    return 2;
 +}
 +
 +static void tcg_out_ld_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
 +                                   const TCGLdstHelperParam *parm)
 +{
 +    const TCGHelperInfo *info;
 +    const TCGCallArgumentLoc *loc;
 +    TCGMovExtend mov[2];
 +    unsigned next_arg, nmov;
 +    MemOp mop = get_memop(ldst->oi);
 +
 +    switch (mop & MO_SIZE) {
 +    case MO_8:
 +    case MO_16:
 +    case MO_32:
 +        info = &info_helper_ld32_mmu;
 +        break;
 +    case MO_64:
 +        info = &info_helper_ld64_mmu;
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +
 +    /* Defer env argument. */
 +    next_arg = 1;
 +
 +    loc = &info->in[next_arg];
 +    nmov = tcg_out_helper_add_mov(mov, loc, TCG_TYPE_TL, TCG_TYPE_TL,
 +                                  ldst->addrlo_reg, ldst->addrhi_reg);
 +    next_arg += nmov;
 +
 +    tcg_out_helper_load_slots(s, nmov, mov, parm);
 +
 +    /* No special attention for 32 and 64-bit return values. */
 +    tcg_debug_assert(info->out_kind == TCG_CALL_RET_NORMAL);
 +
 +    tcg_out_helper_load_common_args(s, ldst, parm, info, next_arg);
 +}
 +
 +static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
 +                                  bool load_sign,
 +                                  const TCGLdstHelperParam *parm)
 +{
 +    TCGMovExtend mov[2];
 +
 +    if (ldst->type <= TCG_TYPE_REG) {
 +        MemOp mop = get_memop(ldst->oi);
 +
 +        mov[0].dst = ldst->datalo_reg;
 +        mov[0].src = tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, 0);
 +        mov[0].dst_type = ldst->type;
 +        mov[0].src_type = TCG_TYPE_REG;
 +
 +        /*
 +         * If load_sign, then we allowed the helper to perform the
 +         * appropriate sign extension to tcg_target_ulong, and all
 +         * we need now is a plain move.
 +         *
 +         * If they do not, then we expect the relevant extension
 +         * instruction to be no more expensive than a move, and
 +         * we thus save the icache etc by only using one of two
 +         * helper functions.
 +         */
 +        if (load_sign || !(mop & MO_SIGN)) {
 +            if (TCG_TARGET_REG_BITS == 32 || ldst->type == TCG_TYPE_I32) {
 +                mov[0].src_ext = MO_32;
 +            } else {
 +                mov[0].src_ext = MO_64;
 +            }
 +        } else {
 +            mov[0].src_ext = mop & MO_SSIZE;
 +        }
 +        tcg_out_movext1(s, mov);
 +    } else {
 +        assert(TCG_TARGET_REG_BITS == 32);
 +
 +        mov[0].dst = ldst->datalo_reg;
 +        mov[0].src =
 +            tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
 +        mov[0].dst_type = TCG_TYPE_I32;
 +        mov[0].src_type = TCG_TYPE_I32;
 +        mov[0].src_ext = MO_32;
 +
 +        mov[1].dst = ldst->datahi_reg;
 +        mov[1].src =
 +            tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, !HOST_BIG_ENDIAN);
 +        mov[1].dst_type = TCG_TYPE_REG;
 +        mov[1].src_type = TCG_TYPE_REG;
 +        mov[1].src_ext = MO_32;
 +
 +        tcg_out_movext2(s, mov, mov + 1, parm->ntmp ? parm->tmp[0] : -1);
 +    }
 +}
 +
 +static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
 +                                   const TCGLdstHelperParam *parm)
 +{
 +    const TCGHelperInfo *info;
 +    const TCGCallArgumentLoc *loc;
 +    TCGMovExtend mov[4];
 +    TCGType data_type;
 +    unsigned next_arg, nmov, n;
 +    MemOp mop = get_memop(ldst->oi);
 +
 +    switch (mop & MO_SIZE) {
 +    case MO_8:
 +    case MO_16:
 +    case MO_32:
 +        info = &info_helper_st32_mmu;
 +        data_type = TCG_TYPE_I32;
 +        break;
 +    case MO_64:
 +        info = &info_helper_st64_mmu;
 +        data_type = TCG_TYPE_I64;
 +        break;
 +    default:
 +        g_assert_not_reached();
 +    }
 +
 +    /* Defer env argument. */
 +    next_arg = 1;
 +    nmov = 0;
 +
 +    /* Handle addr argument. */
 +    loc = &info->in[next_arg];
 +    n = tcg_out_helper_add_mov(mov, loc, TCG_TYPE_TL, TCG_TYPE_TL,
 +                               ldst->addrlo_reg, ldst->addrhi_reg);
 +    next_arg += n;
 +    nmov += n;
 +
 +    /* Handle data argument. */
 +    loc = &info->in[next_arg];
 +    n = tcg_out_helper_add_mov(mov + nmov, loc, data_type, ldst->type,
 +                               ldst->datalo_reg, ldst->datahi_reg);
 +    next_arg += n;
 +    nmov += n;
 +    tcg_debug_assert(nmov <= ARRAY_SIZE(mov));
 +
 +    tcg_out_helper_load_slots(s, nmov, mov, parm);
 +    tcg_out_helper_load_common_args(s, ldst, parm, info, next_arg);
 +}
 +
  #ifdef CONFIG_PROFILER
  /* avoid copy/paste errors */
 --
 .34.1

-[PULL 21/53] tcg/i386: Convert tcg_out_qemu_ld_slow_path
+[PULL 05/27] tcg/i386: Support 128-bit load/store
-Use tcg_out_ld_helper_args and tcg_out_ld_helper_ret.
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/i386/tcg-target.c.inc | 71 +++++++++++++++------------------------
+ tcg/i386/tcg-target.h     |   4 +-
-file changed, 28 insertions(+), 43 deletions(-)
+ tcg/i386/tcg-target.c.inc | 191 +++++++++++++++++++++++++++++++++++++-
 files changed, 190 insertions(+), 5 deletions(-)
+diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/i386/tcg-target.h
++++ b/tcg/i386/tcg-target.h
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+ #define have_avx1         (cpuinfo & CPUINFO_AVX1)
+ #define have_avx2         (cpuinfo & CPUINFO_AVX2)
+ #define have_movbe        (cpuinfo & CPUINFO_MOVBE)
+-#define have_atomic16     (cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
+ /*
+  * There are interesting instructions in AVX512, so long as we have AVX512VL,
+@@ -XXX,XX +XXX,XX @@ typedef enum {
+ #define TCG_TARGET_HAS_qemu_st8_i32     1
+ #endif
+-#define TCG_TARGET_HAS_qemu_ldst_i128   0
++#define TCG_TARGET_HAS_qemu_ldst_i128 \
++    (TCG_TARGET_REG_BITS == 64 && (cpuinfo & CPUINFO_ATOMIC_VMOVDQA))
+ /* We do not support older SSE systems, only beginning with AVX1.  */
+ #define TCG_TARGET_HAS_v64              have_avx1
 diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/i386/tcg-target.c.inc
 +++ b/tcg/i386/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
+@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
-     [MO_BEUQ] = helper_be_stq_mmu,
+ #endif
  };
-+/*
++#define TCG_TMP_VEC  TCG_REG_XMM5
-+ * Because i686 has no register parameters and because x86_64 has xchg
++
-+ * to handle addr/data register overlap, we have placed all input arguments
+ static const int tcg_target_call_iarg_regs[] = {
-+ * before we need might need a scratch reg.
+ #if TCG_TARGET_REG_BITS == 64
-+ *
+ #if defined(_WIN64)
-+ * Even then, a scratch is only needed for l->raddr.  Rather than expose
+@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
-+ * a general-purpose scratch when we don't actually know it's available,
+ #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
-+ * use the ra_gen hook to load into RAX if needed.
+ #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
-+ */
+ #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
-+#if TCG_TARGET_REG_BITS == 64
++#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
-+static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
++#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
  #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
  #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
  #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
@@ -XXX,XX +XXX,XX @@ typedef struct {
  bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return have_movbe;
 +    TCGAtomAlign aa;
 +
 +    if (!have_movbe) {
 +        return false;
 +    }
 +    if ((memop & MO_SIZE) < MO_128) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
 +     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom < MO_128;
  }
  /*
@@ -XXX,XX +XXX,XX @@ static const TCGLdstHelperParam ldst_helper_param = {
  static const TCGLdstHelperParam ldst_helper_param = { };
  #endif
 +static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
 +                                TCGReg l, TCGReg h, TCGReg v)
 +{
-+    if (arg < 0) {
++    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
-+        arg = TCG_REG_RAX;
++
-+    }
++    /* vpmov{d,q} %v, %l */
-+    tcg_out_movi(s, TCG_TYPE_PTR, arg, (uintptr_t)l->raddr);
++    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
-+    return arg;
++    /* vpextr{d,q} $1, %v, %h */
 +    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
 +    tcg_out8(s, 1);
 +}
-+static const TCGLdstHelperParam ldst_helper_param = {
++
-+    .ra_gen = ldst_ra_gen
++static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
-+};
++                                TCGReg v, TCGReg l, TCGReg h)
-+#else
++{
-+static const TCGLdstHelperParam ldst_helper_param = { };
++    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
-+#endif
++
 +    /* vmov{d,q} %l, %v */
 +    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
 +    /* vpinsr{d,q} $1, %h, %v, %v */
 +    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
 +    tcg_out8(s, 1);
 +}
 +
  /*
   * Generate code for the slow path for a load at the end of block
   */
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
  {
--    MemOpIdx oi = l->oi;
+     TCGLabelQemuLdst *ldst = NULL;
--    MemOp opc = get_memop(oi);
+     MemOp opc = get_memop(oi);
-+    MemOp opc = get_memop(l->oi);
++    MemOp s_bits = opc & MO_SIZE;
-     tcg_insn_unit **label_ptr = &l->label_ptr[0];
+     unsigned a_mask;
-     /* resolve label address */
+ #ifdef CONFIG_SOFTMMU
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-         tcg_patch32(label_ptr[1], s->code_ptr - label_ptr[1] - 4);
+     *h = x86_guest_base;
  #endif
      h->base = addrlo;
 -    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
 +    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
      a_mask = (1 << h->aa.align) - 1;
  #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      TCGType tlbtype = TCG_TYPE_I32;
      int trexw = 0, hrexw = 0, tlbrexw = 0;
      unsigned mem_index = get_mmuidx(oi);
 -    unsigned s_bits = opc & MO_SIZE;
      unsigned s_mask = (1 << s_bits) - 1;
      int tlb_mask;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                       h.base, h.index, 0, h.ofs + 4);
          }
          break;
 +
 +    case MO_128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +
 +        /*
 +         * Without 16-byte atomicity, use integer regs.
 +         * That is where we want the data, and it allows bswaps.
 +         */
 +        if (h.aa.atom < MO_128) {
 +            if (use_movbe) {
 +                TCGReg t = datalo;
 +                datalo = datahi;
 +                datahi = t;
 +            }
 +            if (h.base == datalo || h.index == datalo) {
 +                tcg_out_modrm_sib_offset(s, OPC_LEA + P_REXW, datahi,
 +                                         h.base, h.index, 0, h.ofs);
 +                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
 +                                     datalo, datahi, 0);
 +                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
 +                                     datahi, datahi, 8);
 +            } else {
 +                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
 +                                         h.base, h.index, 0, h.ofs);
 +                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
 +                                         h.base, h.index, 0, h.ofs + 8);
 +            }
 +            break;
 +        }
 +
 +        /*
 +         * With 16-byte atomicity, a vector load is required.
 +         * If we already have 16-byte alignment, then VMOVDQA always works.
 +         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
 +         * Else use we require a runtime test for alignment for VMOVDQA;
 +         * use VMOVDQU on the unaligned nonatomic path for simplicity.
 +         */
 +        if (h.aa.align >= MO_128) {
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +        } else {
 +            TCGLabel *l1 = gen_new_label();
 +            TCGLabel *l2 = gen_new_label();
 +
 +            tcg_out_testi(s, h.base, 15);
 +            tcg_out_jxx(s, JCC_JNE, l1, true);
 +
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_jxx(s, JCC_JMP, l2, true);
 +
 +            tcg_out_label(s, l1);
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_label(s, l2);
 +        }
 +        tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo, datahi, TCG_TMP_VEC);
 +        break;
 +
      default:
          g_assert_not_reached();
      }
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
--    if (TCG_TARGET_REG_BITS == 32) {
+                                      h.base, h.index, 0, h.ofs + 4);
--        int ofs = 0;
+         }
--
+         break;
--        tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
++
--        ofs += 4;
++    case MO_128:
--
++        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
--        tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
++
--        ofs += 4;
++        /*
--
++         * Without 16-byte atomicity, use integer regs.
--        if (TARGET_LONG_BITS == 64) {
++         * That is where we have the data, and it allows bswaps.
--            tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
++         */
--            ofs += 4;
++        if (h.aa.atom < MO_128) {
--        }
++            if (use_movbe) {
--
++                TCGReg t = datalo;
--        tcg_out_sti(s, TCG_TYPE_I32, oi, TCG_REG_ESP, ofs);
++                datalo = datahi;
--        ofs += 4;
++                datahi = t;
--
++            }
--        tcg_out_sti(s, TCG_TYPE_PTR, (uintptr_t)l->raddr, TCG_REG_ESP, ofs);
++            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
--    } else {
++                                     h.base, h.index, 0, h.ofs);
--        tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
++            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
--        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
++                                     h.base, h.index, 0, h.ofs + 8);
--                    l->addrlo_reg);
++            break;
--        tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], oi);
++        }
--        tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
++
--                     (uintptr_t)l->raddr);
++        /*
--    }
++         * With 16-byte atomicity, a vector store is required.
--
++         * If we already have 16-byte alignment, then VMOVDQA always works.
-+    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
++         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
-     tcg_out_branch(s, 1, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
++         * Else use we require a runtime test for alignment for VMOVDQA;
-+    tcg_out_ld_helper_ret(s, l, false, &ldst_helper_param);
++         * use VMOVDQU on the unaligned nonatomic path for simplicity.
++         */
--    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
++        tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC, datalo, datahi);
--        TCGMovExtend ext[2] = {
++        if (h.aa.align >= MO_128) {
--            { .dst = l->datalo_reg, .dst_type = TCG_TYPE_I32,
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
--              .src = TCG_REG_EAX, .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
++                                         TCG_TMP_VEC, 0,
--            { .dst = l->datahi_reg, .dst_type = TCG_TYPE_I32,
++                                         h.base, h.index, 0, h.ofs);
--              .src = TCG_REG_EDX, .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
++        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
--        };
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
--        tcg_out_movext2(s, &ext[0], &ext[1], -1);
++                                         TCG_TMP_VEC, 0,
--    } else {
++                                         h.base, h.index, 0, h.ofs);
--        tcg_out_movext(s, l->type, l->datalo_reg,
++        } else {
--                       TCG_TYPE_REG, opc & MO_SSIZE, TCG_REG_EAX);
++            TCGLabel *l1 = gen_new_label();
--    }
++            TCGLabel *l2 = gen_new_label();
--
++
--    /* Jump to the code corresponding to next IR of qemu_st */
++            tcg_out_testi(s, h.base, 15);
-     tcg_out_jmp(s, l->raddr);
++            tcg_out_jxx(s, JCC_JNE, l1, true);
-     return true;
++
- }
++            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_jxx(s, JCC_JMP, l2, true);
 +
 +            tcg_out_label(s, l1);
 +            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
 +                                         TCG_TMP_VEC, 0,
 +                                         h.base, h.index, 0, h.ofs);
 +            tcg_out_label(s, l2);
 +        }
 +        break;
 +
      default:
          g_assert_not_reached();
      }
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
              tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
          }
          break;
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
 +        break;
      case INDEX_op_qemu_st_a64_i32:
      case INDEX_op_qemu_st8_a64_i32:
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
              tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
          }
          break;
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
 +        break;
      OP_32_64(mulu2):
          tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
      case INDEX_op_qemu_st_a64_i64:
          return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(L, L) : C_O0_I4(L, L, L, L);
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        return C_O2_I1(r, r, L);
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        return C_O0_I3(L, L, L);
 +
      case INDEX_op_brcond2_i32:
          return C_O0_I4(r, r, ri, ri);
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
      s->reserved_regs = 0;
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
 +    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
  #ifdef _WIN64
      /* These are call saved, and we don't save them, so don't use them. */
      tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
 --
 .34.1

-[PULL 52/53] accel/tcg: Reorg system mode store helpers
+[PULL 06/27] tcg/aarch64: Rename temporaries
-Instead of trying to unify all operations on uint64_t, use
+We will need to allocate a second general-purpose temporary.
-mmu_lookup() to perform the basic tlb hit and resolution.
+Rename the existing temps to add a distinguishing number.
 Create individual functions to handle access by size.
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- accel/tcg/cputlb.c | 408 +++++++++++++++++++++------------------------
+ tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
-file changed, 193 insertions(+), 215 deletions(-)
+file changed, 25 insertions(+), 25 deletions(-)
-diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/accel/tcg/cputlb.c
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/accel/tcg/cputlb.c
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ store_memop(void *haddr, uint64_t val, MemOp op)
+@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
      return TCG_REG_X0 + slot;
  }
 -#define TCG_REG_TMP TCG_REG_X30
 -#define TCG_VEC_TMP TCG_REG_V31
 +#define TCG_REG_TMP0 TCG_REG_X30
 +#define TCG_VEC_TMP0 TCG_REG_V31
  #ifndef CONFIG_SOFTMMU
  #define TCG_REG_GUEST_BASE TCG_REG_X28
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
  static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                               TCGReg r, TCGReg base, intptr_t offset)
  {
 -    TCGReg temp = TCG_REG_TMP;
 +    TCGReg temp = TCG_REG_TMP0;
      if (offset < -0xffffff || offset > 0xffffff) {
          tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
      }
      /* Worst-case scenario, move offset to temp register, use reg offset.  */
 -    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
 -    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
 +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
 +    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
  }
  static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
      if (offset == sextract64(offset, 0, 26)) {
          tcg_out_insn(s, 3206, BL, offset);
      } else {
 -        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
 -        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
 +        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
 +        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
      }
  }
--static void full_stb_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
--                         MemOpIdx oi, uintptr_t retaddr);
+     AArch64Insn insn;
--
--static void __attribute__((noinline))
+     if (rl == ah || (!const_bh && rl == bh)) {
--store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
+-        rl = TCG_REG_TMP;
--                       uintptr_t retaddr, size_t size, uintptr_t mmu_idx,
++        rl = TCG_REG_TMP0;
--                       bool big_endian)
+     }
-+/**
-+ * do_st_mmio_leN:
+     if (const_bl) {
-+ * @env: cpu context
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
-+ * @p: translation parameters
+                possibility of adding 0+const in the low part, and the
-+ * @val_le: data to store
+                immediate add instructions encode XSP not XZR.  Don't try
-+ * @mmu_idx: virtual address context
+                anything more elaborate here than loading another zero.  */
-+ * @ra: return address into tcg generated code, or 0
+-            al = TCG_REG_TMP;
-+ *
++            al = TCG_REG_TMP0;
-+ * Store @p->size bytes at @p->addr, which is memory-mapped i/o.
+             tcg_out_movi(s, ext, al, 0);
-+ * The bytes to store are extracted in little-endian order from @val_le;
+         }
-+ * return the bytes of @val_le beyond @p->size that have not been stored.
+         tcg_out_insn_3401(s, insn, ext, rl, al, bl);
-+ */
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
 +static uint64_t do_st_mmio_leN(CPUArchState *env, MMULookupPageData *p,
 +                               uint64_t val_le, int mmu_idx, uintptr_t ra)
  {
--    uintptr_t index, index2;
+     TCGReg a1 = a0;
--    CPUTLBEntry *entry, *entry2;
+     if (is_ctz) {
--    target_ulong page1, page2, tlb_addr, tlb_addr2;
+-        a1 = TCG_REG_TMP;
--    MemOpIdx oi;
++        a1 = TCG_REG_TMP0;
--    size_t size2;
+         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
 -    int i;
 +    CPUTLBEntryFull *full = p->full;
 +    target_ulong addr = p->addr;
 +    int i, size = p->size;
 -    /*
 -     * Ensure the second page is in the TLB.  Note that the first page
 -     * is already guaranteed to be filled, and that the second page
 -     * cannot evict the first.  An exception to this rule is PAGE_WRITE_INV
 -     * handling: the first page could have evicted itself.
 -     */
 -    page1 = addr & TARGET_PAGE_MASK;
 -    page2 = (addr + size) & TARGET_PAGE_MASK;
 -    size2 = (addr + size) & ~TARGET_PAGE_MASK;
 -    index2 = tlb_index(env, mmu_idx, page2);
 -    entry2 = tlb_entry(env, mmu_idx, page2);
 -
 -    tlb_addr2 = tlb_addr_write(entry2);
 -    if (page1 != page2 && !tlb_hit_page(tlb_addr2, page2)) {
 -        if (!victim_tlb_hit(env, mmu_idx, index2, MMU_DATA_STORE, page2)) {
 -            tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
 -                     mmu_idx, retaddr);
 -            index2 = tlb_index(env, mmu_idx, page2);
 -            entry2 = tlb_entry(env, mmu_idx, page2);
 -        }
 -        tlb_addr2 = tlb_addr_write(entry2);
 +    QEMU_IOTHREAD_LOCK_GUARD();
 +    for (i = 0; i < size; i++, val_le >>= 8) {
 +        io_writex(env, full, mmu_idx, val_le, addr + i, ra, MO_UB);
      }
-+    return val_le;
+     if (const_b && b == (ext ? 64 : 32)) {
-+}
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
+         AArch64Insn sel = I3506_CSEL;
--    index = tlb_index(env, mmu_idx, addr);
--    entry = tlb_entry(env, mmu_idx, addr);
+         tcg_out_cmp(s, ext, a0, 0, 1);
--    tlb_addr = tlb_addr_write(entry);
+-        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
-+/**
++        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
-+ * do_st_bytes_leN:
-+ * @p: translation parameters
+         if (const_b) {
-+ * @val_le: data to store
+             if (b == -1) {
-+ *
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
-+ * Store @p->size bytes at @p->haddr, which is RAM.
+                 b = d;
-+ * The bytes to store are extracted in little-endian order from @val_le;
+             }
-+ * return the bytes of @val_le beyond @p->size that have not been stored.
+         }
-+ */
+-        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
-+static uint64_t do_st_bytes_leN(MMULookupPageData *p, uint64_t val_le)
++        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
 +{
 +    uint8_t *haddr = p->haddr;
 +    int i, size = p->size;
 -    /*
 -     * Handle watchpoints.  Since this may trap, all checks
 -     * must happen before any store.
 -     */
 -    if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
 -        cpu_check_watchpoint(env_cpu(env), addr, size - size2,
 -                             env_tlb(env)->d[mmu_idx].fulltlb[index].attrs,
 -                             BP_MEM_WRITE, retaddr);
 -    }
 -    if (unlikely(tlb_addr2 & TLB_WATCHPOINT)) {
 -        cpu_check_watchpoint(env_cpu(env), page2, size2,
 -                             env_tlb(env)->d[mmu_idx].fulltlb[index2].attrs,
 -                             BP_MEM_WRITE, retaddr);
 +    for (i = 0; i < size; i++, val_le >>= 8) {
 +        haddr[i] = val_le;
      }
 +    return val_le;
 +}
 -    /*
 -     * XXX: not efficient, but simple.
 -     * This loop must go in the forward direction to avoid issues
 -     * with self-modifying code in Windows 64-bit.
 -     */
 -    oi = make_memop_idx(MO_UB, mmu_idx);
 -    if (big_endian) {
 -        for (i = 0; i < size; ++i) {
 -            /* Big-endian extract.  */
 -            uint8_t val8 = val >> (((size - 1) * 8) - (i * 8));
 -            full_stb_mmu(env, addr + i, val8, oi, retaddr);
 -        }
 +/*
 + * Wrapper for the above.
 + */
 +static uint64_t do_st_leN(CPUArchState *env, MMULookupPageData *p,
 +                          uint64_t val_le, int mmu_idx, uintptr_t ra)
 +{
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        return do_st_mmio_leN(env, p, val_le, mmu_idx, ra);
 +    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
 +        return val_le >> (p->size * 8);
      } else {
 -        for (i = 0; i < size; ++i) {
 -            /* Little-endian extract.  */
 -            uint8_t val8 = val >> (i * 8);
 -            full_stb_mmu(env, addr + i, val8, oi, retaddr);
 -        }
 +        return do_st_bytes_leN(p, val_le);
      }
  }
--static inline void QEMU_ALWAYS_INLINE
+@@ -XXX,XX +XXX,XX @@ bool tcg_target_has_memory_bswap(MemOp memop)
--store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
+ }
--             MemOpIdx oi, uintptr_t retaddr, MemOp op)
-+static void do_st_1(CPUArchState *env, MMULookupPageData *p, uint8_t val,
+ static const TCGLdstHelperParam ldst_helper_param = {
-+                    int mmu_idx, uintptr_t ra)
+-    .ntmp = 1, .tmp = { TCG_REG_TMP }
- {
++    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
--    const unsigned a_bits = get_alignment_bits(get_memop(oi));
+ };
--    const size_t size = memop_size(op);
--    uintptr_t mmu_idx = get_mmuidx(oi);
+ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
--    uintptr_t index;
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto_tb(TCGContext *s, int which)
--    CPUTLBEntry *entry;
--    target_ulong tlb_addr;
+     set_jmp_insn_offset(s, which);
--    void *haddr;
+     tcg_out32(s, I3206_B);
--
+-    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
--    tcg_debug_assert(mmu_idx < NB_MMU_MODES);
++    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
--
+     set_jmp_reset_offset(s, which);
--    /* Handle CPU specific unaligned behaviour */
+ }
--    if (addr & ((1 << a_bits) - 1)) {
--        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
+@@ -XXX,XX +XXX,XX @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
--                             mmu_idx, retaddr);
+         ptrdiff_t i_offset = i_addr - jmp_rx;
-+    if (unlikely(p->flags & TLB_MMIO)) {
-+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, MO_UB);
+         /* Note that we asserted this in range in tcg_out_goto_tb. */
-+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+-        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
-+        /* nothing */
++        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
 +    } else {
 +        *(uint8_t *)p->haddr = val;
      }
--
+     qatomic_set((uint32_t *)jmp_rw, insn);
--    index = tlb_index(env, mmu_idx, addr);
+     flush_idcache_range(jmp_rx, jmp_rw, 4);
--    entry = tlb_entry(env, mmu_idx, addr);
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
--    tlb_addr = tlb_addr_write(entry);
--
+     case INDEX_op_rem_i64:
--    /* If the TLB entry is for a different page, reload and try again.  */
+     case INDEX_op_rem_i32:
--    if (!tlb_hit(tlb_addr, addr)) {
+-        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
--        if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
+-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
--            addr & TARGET_PAGE_MASK)) {
++        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
--            tlb_fill(env_cpu(env), addr, size, MMU_DATA_STORE,
++        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
--                     mmu_idx, retaddr);
+         break;
--            index = tlb_index(env, mmu_idx, addr);
+     case INDEX_op_remu_i64:
--            entry = tlb_entry(env, mmu_idx, addr);
+     case INDEX_op_remu_i32:
--        }
+-        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
--        tlb_addr = tlb_addr_write(entry) & ~TLB_INVALID_MASK;
+-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
--    }
++        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
--
++        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
--    /* Handle anything that isn't just a straight memory access.  */
+         break;
--    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
--        CPUTLBEntryFull *full;
+     case INDEX_op_shl_i64:
--        bool need_swap;
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
--
+         if (c2) {
--        /* For anything that is unaligned, recurse through byte stores.  */
+             tcg_out_rotl(s, ext, a0, a1, a2);
--        if ((addr & (size - 1)) != 0) {
+         } else {
--            goto do_unaligned_access;
+-            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
--        }
+-            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
--
++            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
--        full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
++            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
--
+         }
--        /* Handle watchpoints.  */
+         break;
--        if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
--            /* On watchpoint hit, this will longjmp out.  */
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
--            cpu_check_watchpoint(env_cpu(env), addr, size,
+                             break;
--                                 full->attrs, BP_MEM_WRITE, retaddr);
+                         }
--        }
+                     }
--
+-                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
--        need_swap = size > 1 && (tlb_addr & TLB_BSWAP);
+-                    a2 = TCG_VEC_TMP;
--
++                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
--        /* Handle I/O access.  */
++                    a2 = TCG_VEC_TMP0;
--        if (tlb_addr & TLB_MMIO) {
+                 }
--            io_writex(env, full, mmu_idx, val, addr, retaddr,
+                 if (is_scalar) {
--                      op ^ (need_swap * MO_BSWAP));
+                     insn = cmp_scalar_insn[cond];
--            return;
+@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
--        }
+     s->reserved_regs = 0;
--
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
--        /* Ignore writes to ROM.  */
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
--        if (unlikely(tlb_addr & TLB_DISCARD_WRITE)) {
+-    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
--            return;
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
--        }
+-    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
--
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
--        /* Handle clean RAM pages.  */
++    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 -        if (tlb_addr & TLB_NOTDIRTY) {
 -            notdirty_write(env_cpu(env), addr, size, full, retaddr);
 -        }
 -
 -        haddr = (void *)((uintptr_t)addr + entry->addend);
 -
 -        /*
 -         * Keep these two store_memop separate to ensure that the compiler
 -         * is able to fold the entire function to a single instruction.
 -         * There is a build-time assert inside to remind you of this.  ;-)
 -         */
 -        if (unlikely(need_swap)) {
 -            store_memop(haddr, val, op ^ MO_BSWAP);
 -        } else {
 -            store_memop(haddr, val, op);
 -        }
 -        return;
 -    }
 -
 -    /* Handle slow unaligned access (it spans two pages or IO).  */
 -    if (size > 1
 -        && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
 -                     >= TARGET_PAGE_SIZE)) {
 -    do_unaligned_access:
 -        store_helper_unaligned(env, addr, val, retaddr, size,
 -                               mmu_idx, memop_big_endian(op));
 -        return;
 -    }
 -
 -    haddr = (void *)((uintptr_t)addr + entry->addend);
 -    store_memop(haddr, val, op);
  }
--static void __attribute__((noinline))
+ /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
 -full_stb_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 -             MemOpIdx oi, uintptr_t retaddr)
 +static void do_st_2(CPUArchState *env, MMULookupPageData *p, uint16_t val,
 +                    int mmu_idx, MemOp memop, uintptr_t ra)
  {
 -    validate_memop(oi, MO_UB);
 -    store_helper(env, addr, val, oi, retaddr, MO_UB);
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
 +    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
 +        /* nothing */
 +    } else {
 +        /* Swap to host endian if necessary, then store. */
 +        if (memop & MO_BSWAP) {
 +            val = bswap16(val);
 +        }
 +        store_memop(p->haddr, val, MO_UW);
 +    }
 +}
 +
 +static void do_st_4(CPUArchState *env, MMULookupPageData *p, uint32_t val,
 +                    int mmu_idx, MemOp memop, uintptr_t ra)
 +{
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
 +    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
 +        /* nothing */
 +    } else {
 +        /* Swap to host endian if necessary, then store. */
 +        if (memop & MO_BSWAP) {
 +            val = bswap32(val);
 +        }
 +        store_memop(p->haddr, val, MO_UL);
 +    }
 +}
 +
 +static void do_st_8(CPUArchState *env, MMULookupPageData *p, uint64_t val,
 +                    int mmu_idx, MemOp memop, uintptr_t ra)
 +{
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
 +    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
 +        /* nothing */
 +    } else {
 +        /* Swap to host endian if necessary, then store. */
 +        if (memop & MO_BSWAP) {
 +            val = bswap64(val);
 +        }
 +        store_memop(p->haddr, val, MO_UQ);
 +    }
  }
  void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
 -                        MemOpIdx oi, uintptr_t retaddr)
 +                        MemOpIdx oi, uintptr_t ra)
  {
 -    full_stb_mmu(env, addr, val, oi, retaddr);
 +    MMULookupLocals l;
 +    bool crosspage;
 +
 +    validate_memop(oi, MO_UB);
 +    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
 +    tcg_debug_assert(!crosspage);
 +
 +    do_st_1(env, &l.page[0], val, l.mmu_idx, ra);
  }
 -static void full_le_stw_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 -                            MemOpIdx oi, uintptr_t retaddr)
 +static void do_st2_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
 +                       MemOpIdx oi, uintptr_t ra)
  {
 -    validate_memop(oi, MO_LEUW);
 -    store_helper(env, addr, val, oi, retaddr, MO_LEUW);
 +    MMULookupLocals l;
 +    bool crosspage;
 +    uint8_t a, b;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
 +    if (likely(!crosspage)) {
 +        do_st_2(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
 +        return;
 +    }
 +
 +    if ((l.memop & MO_BSWAP) == MO_LE) {
 +        a = val, b = val >> 8;
 +    } else {
 +        b = val, a = val >> 8;
 +    }
 +    do_st_1(env, &l.page[0], a, l.mmu_idx, ra);
 +    do_st_1(env, &l.page[1], b, l.mmu_idx, ra);
  }
  void helper_le_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                         MemOpIdx oi, uintptr_t retaddr)
  {
 -    full_le_stw_mmu(env, addr, val, oi, retaddr);
 -}
 -
 -static void full_be_stw_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 -                            MemOpIdx oi, uintptr_t retaddr)
 -{
 -    validate_memop(oi, MO_BEUW);
 -    store_helper(env, addr, val, oi, retaddr, MO_BEUW);
 +    validate_memop(oi, MO_LEUW);
 +    do_st2_mmu(env, addr, val, oi, retaddr);
  }
  void helper_be_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                         MemOpIdx oi, uintptr_t retaddr)
  {
 -    full_be_stw_mmu(env, addr, val, oi, retaddr);
 +    validate_memop(oi, MO_BEUW);
 +    do_st2_mmu(env, addr, val, oi, retaddr);
  }
 -static void full_le_stl_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 -                            MemOpIdx oi, uintptr_t retaddr)
 +static void do_st4_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
 +                       MemOpIdx oi, uintptr_t ra)
  {
 -    validate_memop(oi, MO_LEUL);
 -    store_helper(env, addr, val, oi, retaddr, MO_LEUL);
 +    MMULookupLocals l;
 +    bool crosspage;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
 +    if (likely(!crosspage)) {
 +        do_st_4(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
 +        return;
 +    }
 +
 +    /* Swap to little endian for simplicity, then store by bytes. */
 +    if ((l.memop & MO_BSWAP) != MO_LE) {
 +        val = bswap32(val);
 +    }
 +    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
 +    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
  }
  void helper_le_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                         MemOpIdx oi, uintptr_t retaddr)
  {
 -    full_le_stl_mmu(env, addr, val, oi, retaddr);
 -}
 -
 -static void full_be_stl_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 -                            MemOpIdx oi, uintptr_t retaddr)
 -{
 -    validate_memop(oi, MO_BEUL);
 -    store_helper(env, addr, val, oi, retaddr, MO_BEUL);
 +    validate_memop(oi, MO_LEUL);
 +    do_st4_mmu(env, addr, val, oi, retaddr);
  }
  void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                         MemOpIdx oi, uintptr_t retaddr)
  {
 -    full_be_stl_mmu(env, addr, val, oi, retaddr);
 +    validate_memop(oi, MO_BEUL);
 +    do_st4_mmu(env, addr, val, oi, retaddr);
 +}
 +
 +static void do_st8_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 +                       MemOpIdx oi, uintptr_t ra)
 +{
 +    MMULookupLocals l;
 +    bool crosspage;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
 +    if (likely(!crosspage)) {
 +        do_st_8(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
 +        return;
 +    }
 +
 +    /* Swap to little endian for simplicity, then store by bytes. */
 +    if ((l.memop & MO_BSWAP) != MO_LE) {
 +        val = bswap64(val);
 +    }
 +    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
 +    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
  }
  void helper_le_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                         MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_LEUQ);
 -    store_helper(env, addr, val, oi, retaddr, MO_LEUQ);
 +    do_st8_mmu(env, addr, val, oi, retaddr);
  }
  void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                         MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_BEUQ);
 -    store_helper(env, addr, val, oi, retaddr, MO_BEUQ);
 +    do_st8_mmu(env, addr, val, oi, retaddr);
  }
  /*
   * Store Helpers for cpu_ldst.h
   */
 -typedef void FullStoreHelper(CPUArchState *env, target_ulong addr,
 -                             uint64_t val, MemOpIdx oi, uintptr_t retaddr);
 -
 -static inline void cpu_store_helper(CPUArchState *env, target_ulong addr,
 -                                    uint64_t val, MemOpIdx oi, uintptr_t ra,
 -                                    FullStoreHelper *full_store)
 +static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
  {
 -    full_store(env, addr, val, oi, ra);
      qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
  }
  void cpu_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
                   MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, full_stb_mmu);
 +    helper_ret_stb_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_stw_be_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, full_be_stw_mmu);
 +    helper_be_stw_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_stl_be_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, full_be_stl_mmu);
 +    helper_be_stl_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_stq_be_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, helper_be_stq_mmu);
 +    helper_be_stq_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_stw_le_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, full_le_stw_mmu);
 +    helper_le_stw_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_stl_le_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, full_le_stl_mmu);
 +    helper_le_stl_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_stq_le_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    cpu_store_helper(env, addr, val, oi, retaddr, helper_le_stq_mmu);
 +    helper_le_stq_mmu(env, addr, val, oi, retaddr);
 +    plugin_store_cb(env, addr, oi);
  }
  void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
 --
 .34.1

-[PULL 23/53] tcg/aarch64: Convert tcg_out_qemu_{ld,st}_slow_path
+[PULL 07/27] tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 and tcg_out_st_helper_args.
 Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/aarch64/tcg-target.c.inc | 40 +++++++++++++++---------------------
+ tcg/aarch64/tcg-target.c.inc | 9 +++++++--
-file changed, 16 insertions(+), 24 deletions(-)
+file changed, 7 insertions(+), 2 deletions(-)
 diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/aarch64/tcg-target.c.inc
 +++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
+@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
-     }
      TCG_REG_X8, TCG_REG_X9, TCG_REG_X10, TCG_REG_X11,
      TCG_REG_X12, TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
 -    TCG_REG_X16, TCG_REG_X17,
      TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
      TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
 +    /* X16 reserved as temporary */
 +    /* X17 reserved as temporary */
      /* X18 reserved by system */
      /* X19 reserved for AREG0 */
      /* X29 reserved as fp */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
      return TCG_REG_X0 + slot;
  }
--static void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
+-#define TCG_REG_TMP0 TCG_REG_X30
--{
++#define TCG_REG_TMP0 TCG_REG_X16
--    ptrdiff_t offset = tcg_pcrel_diff(s, target);
++#define TCG_REG_TMP1 TCG_REG_X17
--    tcg_debug_assert(offset == sextract64(offset, 0, 21));
++#define TCG_REG_TMP2 TCG_REG_X30
--    tcg_out_insn(s, 3406, ADR, rd, offset);
+ #define TCG_VEC_TMP0 TCG_REG_V31
--}
--
+ #ifndef CONFIG_SOFTMMU
- typedef struct {
+@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
-     TCGReg base;
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
-     TCGReg index;
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
+     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
- #endif
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
- };
++    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
+     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 +static const TCGLdstHelperParam ldst_helper_param = {
 +    .ntmp = 1, .tmp = { TCG_REG_TMP }
 +};
 +
  static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
  {
 -    MemOpIdx oi = lb->oi;
 -    MemOp opc = get_memop(oi);
 +    MemOp opc = get_memop(lb->oi);
      if (!reloc_pc19(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
          return false;
      }
 -    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_X0, TCG_AREG0);
 -    tcg_out_mov(s, TARGET_LONG_BITS == 64, TCG_REG_X1, lb->addrlo_reg);
 -    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_X2, oi);
 -    tcg_out_adr(s, TCG_REG_X3, lb->raddr);
 +    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
      tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE]);
 -
 -    tcg_out_movext(s, lb->type, lb->datalo_reg,
 -                   TCG_TYPE_REG, opc & MO_SSIZE, TCG_REG_X0);
 +    tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
      tcg_out_goto(s, lb->raddr);
      return true;
  }
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
--    MemOp size = opc & MO_SIZE;
-+    MemOp opc = get_memop(lb->oi);
-     if (!reloc_pc19(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_X0, TCG_AREG0);
--    tcg_out_mov(s, TARGET_LONG_BITS == 64, TCG_REG_X1, lb->addrlo_reg);
--    tcg_out_mov(s, size == MO_64, TCG_REG_X2, lb->datalo_reg);
--    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_X3, oi);
--    tcg_out_adr(s, TCG_REG_X4, lb->raddr);
-+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
-     tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE]);
-     tcg_out_goto(s, lb->raddr);
-     return true;
- }
- #else
-+static void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
-+{
-+    ptrdiff_t offset = tcg_pcrel_diff(s, target);
-+    tcg_debug_assert(offset == sextract64(offset, 0, 21));
-+    tcg_out_insn(s, 3406, ADR, rd, offset);
-+}
-+
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     if (!reloc_pc19(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
 --
 .34.1

-[PULL 38/53] tcg/riscv: Simplify constraints on qemu_ld/st
+[PULL 08/27] tcg/aarch64: Simplify constraints on qemu_ld/st
-The softmmu tlb uses TCG_REG_TMP[0-2], not any of the normally available
+Adjust the softmmu tlb to use TMP[0-2], not any of the normally available
-registers.  Now that we handle overlap betwen inputs and helper arguments,
+registers.  Since we handle overlap betwen inputs and helper arguments,
 we can allow any allocatable reg.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/riscv/tcg-target-con-set.h |  2 --
+ tcg/aarch64/tcg-target-con-set.h |  2 --
- tcg/riscv/tcg-target-con-str.h |  1 -
+ tcg/aarch64/tcg-target-con-str.h |  1 -
- tcg/riscv/tcg-target.c.inc     | 16 +++-------------
+ tcg/aarch64/tcg-target.c.inc     | 45 ++++++++++++++------------------
-files changed, 3 insertions(+), 16 deletions(-)
+files changed, 19 insertions(+), 29 deletions(-)
-diff --git a/tcg/riscv/tcg-target-con-set.h b/tcg/riscv/tcg-target-con-set.h
+diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/riscv/tcg-target-con-set.h
+--- a/tcg/aarch64/tcg-target-con-set.h
-+++ b/tcg/riscv/tcg-target-con-set.h
++++ b/tcg/aarch64/tcg-target-con-set.h
 @@ -XXX,XX +XXX,XX @@
   * tcg-target-con-str.h; the constraint combination is inclusive or.
   */
  C_O0_I1(r)
--C_O0_I2(LZ, L)
+-C_O0_I2(lZ, l)
  C_O0_I2(r, rA)
  C_O0_I2(rZ, r)
- C_O0_I2(rZ, rZ)
+ C_O0_I2(w, r)
--C_O1_I1(r, L)
+-C_O1_I1(r, l)
  C_O1_I1(r, r)
- C_O1_I2(r, r, ri)
+ C_O1_I1(w, r)
- C_O1_I2(r, r, rI)
+ C_O1_I1(w, w)
-diff --git a/tcg/riscv/tcg-target-con-str.h b/tcg/riscv/tcg-target-con-str.h
+diff --git a/tcg/aarch64/tcg-target-con-str.h b/tcg/aarch64/tcg-target-con-str.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/riscv/tcg-target-con-str.h
+--- a/tcg/aarch64/tcg-target-con-str.h
-+++ b/tcg/riscv/tcg-target-con-str.h
++++ b/tcg/aarch64/tcg-target-con-str.h
 @@ -XXX,XX +XXX,XX @@
   * REGS(letter, register_mask)
   */
  REGS('r', ALL_GENERAL_REGS)
--REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
+-REGS('l', ALL_QLDST_REGS)
  REGS('w', ALL_VECTOR_REGS)
  /*
-  * Define constraint letters for constants:
+diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/riscv/tcg-target.c.inc
+--- a/tcg/aarch64/tcg-target.c.inc
-+++ b/tcg/riscv/tcg-target.c.inc
++++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
+@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
- #define TCG_CT_CONST_N12   0x400
+ #define ALL_GENERAL_REGS  0xffffffffu
- #define TCG_CT_CONST_M12   0x800
+ #define ALL_VECTOR_REGS   0xffffffff00000000ull
 -#define ALL_GENERAL_REGS      MAKE_64BIT_MASK(0, 32)
 -/*
 - * For softmmu, we need to avoid conflicts with the first 5
 - * argument registers to call the helper.  Some of these are
 - * also used for the tlb lookup.
 - */
 -#ifdef CONFIG_SOFTMMU
--#define SOFTMMU_RESERVE_REGS  MAKE_64BIT_MASK(TCG_REG_A0, 5)
+-#define ALL_QLDST_REGS \
 -    (ALL_GENERAL_REGS & ~((1 << TCG_REG_X0) | (1 << TCG_REG_X1) | \
 -                          (1 << TCG_REG_X2) | (1 << TCG_REG_X3)))
 -#else
--#define SOFTMMU_RESERVE_REGS  0
+-#define ALL_QLDST_REGS   ALL_GENERAL_REGS
 -#endif
-+#define ALL_GENERAL_REGS   MAKE_64BIT_MASK(0, 32)
+-
+ /* Match a constant valid for addition (12-bit, optionally shifted).  */
- #define sextreg  sextract64
+ static inline bool is_aimm(uint64_t val)
+ {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      unsigned s_bits = opc & MO_SIZE;
      unsigned s_mask = (1u << s_bits) - 1;
      unsigned mem_index = get_mmuidx(oi);
 -    TCGReg x3;
 +    TCGReg addr_adj;
      TCGType mask_type;
      uint64_t compare_mask;
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      mask_type = (s->page_bits + s->tlb_dyn_max_bits > 32
                   ? TCG_TYPE_I64 : TCG_TYPE_I32);
 -    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
 +    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {tmp0,tmp1}. */
      QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
      QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
      QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
      QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
 -    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
 +    tcg_out_insn(s, 3314, LDP, TCG_REG_TMP0, TCG_REG_TMP1, TCG_AREG0,
                   TLB_MASK_TABLE_OFS(mem_index), 1, 0);
      /* Extract the TLB index from the address into X0.  */
      tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
 -                 TCG_REG_X0, TCG_REG_X0, addr_reg,
 +                 TCG_REG_TMP0, TCG_REG_TMP0, addr_reg,
                   s->page_bits - CPU_TLB_ENTRY_BITS);
 -    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
 -    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
 +    /* Add the tlb_table pointer, forming the CPUTLBEntry address in TMP1. */
 +    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP0);
 -    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
 -    tcg_out_ld(s, addr_type, TCG_REG_X0, TCG_REG_X1,
 +    /* Load the tlb comparator into TMP0, and the fast path addend into TMP1. */
 +    tcg_out_ld(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP1,
                 is_ld ? offsetof(CPUTLBEntry, addr_read)
                       : offsetof(CPUTLBEntry, addr_write));
 -    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
 +    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
                 offsetof(CPUTLBEntry, addend));
      /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
       * cross pages using the address of the last byte of the access.
       */
      if (a_mask >= s_mask) {
 -        x3 = addr_reg;
 +        addr_adj = addr_reg;
      } else {
 +        addr_adj = TCG_REG_TMP2;
          tcg_out_insn(s, 3401, ADDI, addr_type,
 -                     TCG_REG_X3, addr_reg, s_mask - a_mask);
 -        x3 = TCG_REG_X3;
 +                     addr_adj, addr_reg, s_mask - a_mask);
      }
      compare_mask = (uint64_t)s->page_mask | a_mask;
 -    /* Store the page mask part of the address into X3.  */
 -    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_X3, x3, compare_mask);
 +    /* Store the page mask part of the address into TMP2.  */
 +    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_TMP2,
 +                     addr_adj, compare_mask);
      /* Perform the address comparison. */
 -    tcg_out_cmp(s, addr_type, TCG_REG_X0, TCG_REG_X3, 0);
 +    tcg_out_cmp(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP2, 0);
      /* If not equal, we jump to the slow path. */
      ldst->label_ptr[0] = s->code_ptr;
      tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 -    h->base = TCG_REG_X1,
 +    h->base = TCG_REG_TMP1;
      h->index = addr_reg;
      h->index_ext = addr_type;
  #else
 @@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_qemu_ld_a64_i32:
-     case INDEX_op_qemu_ld_i32:
+     case INDEX_op_qemu_ld_a32_i64:
-     case INDEX_op_qemu_ld_i64:
+     case INDEX_op_qemu_ld_a64_i64:
--        return C_O1_I1(r, L);
+-        return C_O1_I1(r, l);
 +        return C_O1_I1(r, r);
-     case INDEX_op_qemu_st_i32:
+     case INDEX_op_qemu_st_a32_i32:
-     case INDEX_op_qemu_st_i64:
+     case INDEX_op_qemu_st_a64_i32:
--        return C_O0_I2(LZ, L);
+     case INDEX_op_qemu_st_a32_i64:
      case INDEX_op_qemu_st_a64_i64:
 -        return C_O0_I2(lZ, l);
 +        return C_O0_I2(rZ, r);
-     default:
+     case INDEX_op_deposit_i32:
-         g_assert_not_reached();
+     case INDEX_op_deposit_i64:
 --
 .34.1

-[PULL 16/53] tcg/mips: Introduce prepare_host_addr
+[PULL 09/27] tcg/aarch64: Support 128-bit load/store
-Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
+With FEAT_LSE2, LDP/STP suffices.  Without FEAT_LSE2, use LDXP+STXP
-and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
+-byte atomicity is required and LDP/STP otherwise.
 into one function that returns HostAddress and TCGLabelQemuLdst structures.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/mips/tcg-target.c.inc | 404 ++++++++++++++++----------------------
+ tcg/aarch64/tcg-target-con-set.h |   2 +
-file changed, 172 insertions(+), 232 deletions(-)
+ tcg/aarch64/tcg-target.h         |  11 ++-
  tcg/aarch64/tcg-target.c.inc     | 141 ++++++++++++++++++++++++++++++-
 files changed, 151 insertions(+), 3 deletions(-)
-diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
+diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target.c.inc
+--- a/tcg/aarch64/tcg-target-con-set.h
-+++ b/tcg/mips/tcg-target.c.inc
++++ b/tcg/aarch64/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@ static int tcg_out_call_iarg_reg2(TCGContext *s, int i, TCGReg al, TCGReg ah)
+@@ -XXX,XX +XXX,XX @@ C_O0_I1(r)
-     return i;
+ C_O0_I2(r, rA)
  C_O0_I2(rZ, r)
  C_O0_I2(w, r)
 +C_O0_I3(rZ, rZ, r)
  C_O1_I1(r, r)
  C_O1_I1(w, r)
  C_O1_I1(w, w)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(w, w, wO)
  C_O1_I2(w, w, wZ)
  C_O1_I3(w, w, w, w)
  C_O1_I4(r, r, rA, rZ, rZ)
 +C_O2_I1(r, r, r)
  C_O2_I4(r, r, rZ, rZ, rA, rMZ)
 diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/aarch64/tcg-target.h
 +++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
  #define TCG_TARGET_HAS_muluh_i64        1
  #define TCG_TARGET_HAS_mulsh_i64        1
 -#define TCG_TARGET_HAS_qemu_ldst_i128   0
 +/*
 + * Without FEAT_LSE2, we must use LDXP+STXP to implement atomic 128-bit load,
 + * which requires writable pages.  We must defer to the helper for user-only,
 + * but in system mode all ram is writable for the host.
 + */
 +#ifdef CONFIG_USER_ONLY
 +#define TCG_TARGET_HAS_qemu_ldst_i128   have_lse2
 +#else
 +#define TCG_TARGET_HAS_qemu_ldst_i128   1
 +#endif
  #define TCG_TARGET_HAS_v64              1
  #define TCG_TARGET_HAS_v128             1
 diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/aarch64/tcg-target.c.inc
 +++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum {
      I3305_LDR_v64   = 0x5c000000,
      I3305_LDR_v128  = 0x9c000000,
 +    /* Load/store exclusive. */
 +    I3306_LDXP      = 0xc8600000,
 +    I3306_STXP      = 0xc8200000,
 +
      /* Load/store register.  Described here as 3.3.12, but the helper
         that emits them can transform to 3.3.10 or 3.3.13.  */
      I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
@@ -XXX,XX +XXX,XX @@ typedef enum {
      I3406_ADR       = 0x10000000,
      I3406_ADRP      = 0x90000000,
 +    /* Add/subtract extended register instructions. */
 +    I3501_ADD       = 0x0b200000,
 +
      /* Add/subtract shifted register instructions (without a shift).  */
      I3502_ADD       = 0x0b000000,
      I3502_ADDS      = 0x2b000000,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
      tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
  }
--/* We expect to use a 16-bit negative offset from ENV.  */
++static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
++                              TCGReg rt, TCGReg rt2, TCGReg rn)
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
++{
--
++    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
--/*
++}
-- * Perform the tlb comparison operation.
++
-- * The complete host address is placed in BASE.
+ static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
-- * Clobbers TMP0, TMP1, TMP2, TMP3.
+                               TCGReg rt, int imm19)
-- */
+ {
--static void tcg_out_tlb_load(TCGContext *s, TCGReg base, TCGReg addrl,
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
--                             TCGReg addrh, MemOpIdx oi,
+     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
--                             tcg_insn_unit *label_ptr[2], bool is_load)
+ }
--{
--    MemOp opc = get_memop(oi);
++static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
--    unsigned a_bits = get_alignment_bits(opc);
++                                     TCGType sf, TCGReg rd, TCGReg rn,
 +                                     TCGReg rm, int opt, int imm3)
 +{
 +    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
 +              imm3 << 10 | rn << 5 | rd);
 +}
 +
  /* This function is for both 3.5.2 (Add/Subtract shifted register), for
     the rare occasion when we actually want to supply a shift amount.  */
  static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      TCGType addr_type = s->addr_type;
      TCGLabelQemuLdst *ldst = NULL;
      MemOp opc = get_memop(oi);
 +    MemOp s_bits = opc & MO_SIZE;
      unsigned a_mask;
      h->aa = atom_and_align_for_opc(s, opc,
                                     have_lse2 ? MO_ATOM_WITHIN16
                                               : MO_ATOM_IFALIGN,
 -                                   false);
 +                                   s_bits == MO_128);
      a_mask = (1 << h->aa.align) - 1;
  #ifdef CONFIG_SOFTMMU
 -    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_mask = (1 << a_bits) - 1;
+     unsigned s_mask = (1u << s_bits) - 1;
--    unsigned s_mask = (1 << s_bits) - 1;
+     unsigned mem_index = get_mmuidx(oi);
--    int mem_index = get_mmuidx(oi);
+     TCGReg addr_adj;
--    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
--    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
+     }
 -    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
 -    int add_off = offsetof(CPUTLBEntry, addend);
 -    int cmp_off = (is_load ? offsetof(CPUTLBEntry, addr_read)
 -                   : offsetof(CPUTLBEntry, addr_write));
 -    target_ulong tlb_mask;
 -
 -    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
 -    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP0, TCG_AREG0, mask_off);
 -    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP1, TCG_AREG0, table_off);
 -
 -    /* Extract the TLB index from the address into TMP3.  */
 -    tcg_out_opc_sa(s, ALIAS_TSRL, TCG_TMP3, addrl,
 -                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
 -    tcg_out_opc_reg(s, OPC_AND, TCG_TMP3, TCG_TMP3, TCG_TMP0);
 -
 -    /* Add the tlb_table pointer, creating the CPUTLBEntry address in TMP3.  */
 -    tcg_out_opc_reg(s, ALIAS_PADD, TCG_TMP3, TCG_TMP3, TCG_TMP1);
 -
 -    /* Load the (low-half) tlb comparator.  */
 -    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
 -        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
 -    } else {
 -        tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
 -                         : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
 -                     TCG_TMP0, TCG_TMP3, cmp_off);
 -    }
 -
 -    /* Zero extend a 32-bit guest address for a 64-bit host. */
 -    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
 -        tcg_out_ext32u(s, base, addrl);
 -        addrl = base;
 -    }
 -
 -    /*
 -     * Mask the page bits, keeping the alignment bits to compare against.
 -     * For unaligned accesses, compare against the end of the access to
 -     * verify that it does not cross a page boundary.
 -     */
 -    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
 -    tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
 -    if (a_mask >= s_mask) {
 -        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrl);
 -    } else {
 -        tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrl, s_mask - a_mask);
 -        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
 -    }
 -
 -    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
 -        /* Load the tlb addend for the fast path.  */
 -        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
 -    }
 -
 -    label_ptr[0] = s->code_ptr;
 -    tcg_out_opc_br(s, OPC_BNE, TCG_TMP1, TCG_TMP0);
 -
 -    /* Load and test the high half tlb comparator.  */
 -    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
 -        /* delay slot */
 -        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
 -
 -        /* Load the tlb addend for the fast path.  */
 -        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
 -
 -        label_ptr[1] = s->code_ptr;
 -        tcg_out_opc_br(s, OPC_BNE, addrh, TCG_TMP0);
 -    }
 -
 -    /* delay slot */
 -    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP2, addrl);
 -}
 -
 -static void add_qemu_ldst_label(TCGContext *s, int is_ld, MemOpIdx oi,
 -                                TCGType ext,
 -                                TCGReg datalo, TCGReg datahi,
 -                                TCGReg addrlo, TCGReg addrhi,
 -                                void *raddr, tcg_insn_unit *label_ptr[2])
 -{
 -    TCGLabelQemuLdst *label = new_ldst_label(s);
 -
 -    label->is_ld = is_ld;
 -    label->oi = oi;
 -    label->type = ext;
 -    label->datalo_reg = datalo;
 -    label->datahi_reg = datahi;
 -    label->addrlo_reg = addrlo;
 -    label->addrhi_reg = addrhi;
 -    label->raddr = tcg_splitwx_to_rx(raddr);
 -    label->label_ptr[0] = label_ptr[0];
 -    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
 -        label->label_ptr[1] = label_ptr[1];
 -    }
 -}
 -
  static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  {
      const tcg_insn_unit *tgt_rx = tcg_splitwx_to_rx(s->code_ptr);
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  }
- #else
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
--
++                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 -static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
 -                                   TCGReg addrhi, unsigned a_bits)
 -{
 -    unsigned a_mask = (1 << a_bits) - 1;
 -    TCGLabelQemuLdst *l = new_ldst_label(s);
 -
 -    l->is_ld = is_ld;
 -    l->addrlo_reg = addrlo;
 -    l->addrhi_reg = addrhi;
 -
 -    /* We are expecting a_bits to max out at 7, much lower than ANDI. */
 -    tcg_debug_assert(a_bits < 16);
 -    tcg_out_opc_imm(s, OPC_ANDI, TCG_TMP0, addrlo, a_mask);
 -
 -    l->label_ptr[0] = s->code_ptr;
 -    if (use_mips32r6_instructions) {
 -        tcg_out_opc_br(s, OPC_BNEZALC_R6, TCG_REG_ZERO, TCG_TMP0);
 -    } else {
 -        tcg_out_opc_br(s, OPC_BNEL, TCG_TMP0, TCG_REG_ZERO);
 -        tcg_out_nop(s);
 -    }
 -
 -    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
 -}
 -
  static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
  {
      void *target;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  }
  #endif /* SOFTMMU */
 +typedef struct {
 +    TCGReg base;
 +    MemOp align;
 +} HostAddress;
 +
 +/*
 + * For softmmu, perform the TLB load and compare.
 + * For useronly, perform any required alignment tests.
 + * In both cases, return a TCGLabelQemuLdst structure if the slow path
 + * is required and fill in @h with the host address for the fast path.
 + */
 +static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 +                                           TCGReg addrlo, TCGReg addrhi,
 +                                           MemOpIdx oi, bool is_ld)
 +{
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+    unsigned s_bits = opc & MO_SIZE;
-+    unsigned a_mask = (1 << a_bits) - 1;
-+    TCGReg base;
-+
-+#ifdef CONFIG_SOFTMMU
-+    unsigned s_mask = (1 << s_bits) - 1;
-+    int mem_index = get_mmuidx(oi);
-+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-+    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-+    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-+    int add_off = offsetof(CPUTLBEntry, addend);
-+    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
-+                        : offsetof(CPUTLBEntry, addr_write);
-+    target_ulong tlb_mask;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addrlo;
-+    ldst->addrhi_reg = addrhi;
-+    base = TCG_REG_A0;
-+
-+    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP0, TCG_AREG0, mask_off);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP1, TCG_AREG0, table_off);
-+
-+    /* Extract the TLB index from the address into TMP3.  */
-+    tcg_out_opc_sa(s, ALIAS_TSRL, TCG_TMP3, addrlo,
-+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+    tcg_out_opc_reg(s, OPC_AND, TCG_TMP3, TCG_TMP3, TCG_TMP0);
-+
-+    /* Add the tlb_table pointer, creating the CPUTLBEntry address in TMP3.  */
-+    tcg_out_opc_reg(s, ALIAS_PADD, TCG_TMP3, TCG_TMP3, TCG_TMP1);
-+
-+    /* Load the (low-half) tlb comparator.  */
-+    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-+        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
-+    } else {
-+        tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
-+                         : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
-+                     TCG_TMP0, TCG_TMP3, cmp_off);
-+    }
-+
-+    /* Zero extend a 32-bit guest address for a 64-bit host. */
-+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-+        tcg_out_ext32u(s, base, addrlo);
-+        addrlo = base;
-+    }
-+
-+    /*
-+     * Mask the page bits, keeping the alignment bits to compare against.
-+     * For unaligned accesses, compare against the end of the access to
-+     * verify that it does not cross a page boundary.
-+     */
-+    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-+    tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
-+    if (a_mask >= s_mask) {
-+        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrlo);
-+    } else {
-+        tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrlo, s_mask - a_mask);
-+        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
-+    }
-+
-+    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-+        /* Load the tlb addend for the fast path.  */
-+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
-+    }
-+
-+    ldst->label_ptr[0] = s->code_ptr;
-+    tcg_out_opc_br(s, OPC_BNE, TCG_TMP1, TCG_TMP0);
-+
-+    /* Load and test the high half tlb comparator.  */
-+    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-+        /* delay slot */
-+        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
-+
-+        /* Load the tlb addend for the fast path.  */
-+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
-+
-+        ldst->label_ptr[1] = s->code_ptr;
-+        tcg_out_opc_br(s, OPC_BNE, addrhi, TCG_TMP0);
-+    }
-+
-+    /* delay slot */
-+    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP2, addrlo);
-+#else
-+    if (a_mask && (use_mips32r6_instructions || a_bits != s_bits)) {
-+        ldst = new_ldst_label(s);
-+
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addrlo;
-+        ldst->addrhi_reg = addrhi;
-+
-+        /* We are expecting a_bits to max out at 7, much lower than ANDI. */
-+        tcg_debug_assert(a_bits < 16);
-+        tcg_out_opc_imm(s, OPC_ANDI, TCG_TMP0, addrlo, a_mask);
-+
-+        ldst->label_ptr[0] = s->code_ptr;
-+        if (use_mips32r6_instructions) {
-+            tcg_out_opc_br(s, OPC_BNEZALC_R6, TCG_REG_ZERO, TCG_TMP0);
-+        } else {
-+            tcg_out_opc_br(s, OPC_BNEL, TCG_TMP0, TCG_REG_ZERO);
-+            tcg_out_nop(s);
-+        }
-+    }
-+
-+    base = addrlo;
-+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-+        tcg_out_ext32u(s, TCG_REG_A0, base);
-+        base = TCG_REG_A0;
-+    }
-+    if (guest_base) {
-+        if (guest_base == (int16_t)guest_base) {
-+            tcg_out_opc_imm(s, ALIAS_PADDI, TCG_REG_A0, base, guest_base);
-+        } else {
-+            tcg_out_opc_reg(s, ALIAS_PADD, TCG_REG_A0, base,
-+                            TCG_GUEST_BASE_REG);
-+        }
-+        base = TCG_REG_A0;
-+    }
-+#endif
-+
-+    h->base = base;
-+    h->align = a_bits;
-+    return ldst;
-+}
-+
- static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
-                                    TCGReg base, MemOp opc, TCGType type)
- {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
-                             MemOpIdx oi, TCGType data_type)
- {
-     MemOp opc = get_memop(oi);
--    unsigned a_bits = get_alignment_bits(opc);
--    unsigned s_bits = opc & MO_SIZE;
--    TCGReg base;
 +    TCGLabelQemuLdst *ldst;
 +    HostAddress h;
++    TCGReg base;
--    /*
++    bool use_pair;
--     * R6 removes the left/right instructions but requires the
++
--     * system to support misaligned memory accesses.
++    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
--     */
++
--#if defined(CONFIG_SOFTMMU)
++    /* Compose the final address, as LDP/STP have no indexing. */
--    tcg_insn_unit *label_ptr[2];
++    if (h.index == TCG_REG_XZR) {
-+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
++        base = h.base;
++    } else {
--    base = TCG_REG_A0;
++        base = TCG_REG_TMP2;
--    tcg_out_tlb_load(s, base, addrlo, addrhi, oi, label_ptr, 1);
++        if (h.index_ext == TCG_TYPE_I32) {
--    if (use_mips32r6_instructions || a_bits >= s_bits) {
++            /* add base, base, index, uxtw */
--        tcg_out_qemu_ld_direct(s, datalo, datahi, base, opc, data_type);
++            tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, base,
-+    if (use_mips32r6_instructions || h.align >= (opc & MO_SIZE)) {
++                         h.base, h.index, MO_32, 0);
-+        tcg_out_qemu_ld_direct(s, datalo, datahi, h.base, opc, data_type);
++        } else {
-     } else {
++            /* add base, base, index */
--        tcg_out_qemu_ld_unalign(s, datalo, datahi, base, opc, data_type);
++            tcg_out_insn(s, 3502, ADD, 1, base, h.base, h.index);
-+        tcg_out_qemu_ld_unalign(s, datalo, datahi, h.base, opc, data_type);
++        }
-     }
++    }
--    add_qemu_ldst_label(s, true, oi, data_type, datalo, datahi,
++
--                        addrlo, addrhi, s->code_ptr, label_ptr);
++    use_pair = h.aa.atom < MO_128 || have_lse2;
--#else
++
--    base = addrlo;
++    if (!use_pair) {
--    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
++        tcg_insn_unit *branch = NULL;
--        tcg_out_ext32u(s, TCG_REG_A0, base);
++        TCGReg ll, lh, sl, sh;
--        base = TCG_REG_A0;
++
 +        /*
 +         * If we have already checked for 16-byte alignment, that's all
 +         * we need. Otherwise we have determined that misaligned atomicity
 +         * may be handled with two 8-byte loads.
 +         */
 +        if (h.aa.align < MO_128) {
 +            /*
 +             * TODO: align should be MO_64, so we only need test bit 3,
 +             * which means we could use TBNZ instead of ANDS+B_C.
 +             */
 +            tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, 15);
 +            branch = s->code_ptr;
 +            tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 +            use_pair = true;
 +        }
 +
 +        if (is_ld) {
 +            /*
 +             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
 +             *    ldxp lo, hi, [base]
 +             *    stxp t0, lo, hi, [base]
 +             *    cbnz t0, .-8
 +             * Require no overlap between data{lo,hi} and base.
 +             */
 +            if (datalo == base || datahi == base) {
 +                tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_TMP2, base);
 +                base = TCG_REG_TMP2;
 +            }
 +            ll = sl = datalo;
 +            lh = sh = datahi;
 +        } else {
 +            /*
 +             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
 +             * 1: ldxp t0, t1, [base]
 +             *    stxp t0, lo, hi, [base]
 +             *    cbnz t0, 1b
 +             */
 +            tcg_debug_assert(base != TCG_REG_TMP0 && base != TCG_REG_TMP1);
 +            ll = TCG_REG_TMP0;
 +            lh = TCG_REG_TMP1;
 +            sl = datalo;
 +            sh = datahi;
 +        }
 +
 +        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, ll, lh, base);
 +        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP0, sl, sh, base);
 +        tcg_out_insn(s, 3201, CBNZ, 0, TCG_REG_TMP0, -2);
 +
 +        if (use_pair) {
 +            /* "b .+8", branching across the one insn of use_pair. */
 +            tcg_out_insn(s, 3206, B, 2);
 +            reloc_pc19(branch, tcg_splitwx_to_rx(s->code_ptr));
 +        }
 +    }
 +
 +    if (use_pair) {
 +        if (is_ld) {
 +            tcg_out_insn(s, 3314, LDP, datalo, datahi, base, 0, 1, 0);
 +        } else {
 +            tcg_out_insn(s, 3314, STP, datalo, datahi, base, 0, 1, 0);
 +        }
 +    }
 +
 +    if (ldst) {
-+        ldst->type = data_type;
++        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
++    }
--    if (guest_base) {
++}
--        if (guest_base == (int16_t)guest_base) {
++
--            tcg_out_opc_imm(s, ALIAS_PADDI, TCG_REG_A0, base, guest_base);
+ static const tcg_insn_unit *tb_ret_addr;
--        } else {
--            tcg_out_opc_reg(s, ALIAS_PADD, TCG_REG_A0, base,
+ static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
--                            TCG_GUEST_BASE_REG);
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
--        }
+     case INDEX_op_qemu_st_a64_i64:
--        base = TCG_REG_A0;
+         tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
--    }
+         break;
--    if (use_mips32r6_instructions) {
++    case INDEX_op_qemu_ld_a32_i128:
--        if (a_bits) {
++    case INDEX_op_qemu_ld_a64_i128:
--            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
++        tcg_out_qemu_ldst_i128(s, a0, a1, a2, args[3], true);
--        }
++        break;
--        tcg_out_qemu_ld_direct(s, datalo, datahi, base, opc, data_type);
++    case INDEX_op_qemu_st_a32_i128:
--    } else {
++    case INDEX_op_qemu_st_a64_i128:
--        if (a_bits && a_bits != s_bits) {
++        tcg_out_qemu_ldst_i128(s, REG0(0), REG0(1), a2, args[3], false);
--            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
++        break;
--        }
--        if (a_bits >= s_bits) {
+     case INDEX_op_bswap64_i64:
--            tcg_out_qemu_ld_direct(s, datalo, datahi, base, opc, data_type);
+         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
--        } else {
+@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
--            tcg_out_qemu_ld_unalign(s, datalo, datahi, base, opc, data_type);
+     case INDEX_op_qemu_ld_a32_i64:
--        }
+     case INDEX_op_qemu_ld_a64_i64:
--    }
+         return C_O1_I1(r, r);
--#endif
++    case INDEX_op_qemu_ld_a32_i128:
- }
++    case INDEX_op_qemu_ld_a64_i128:
++        return C_O2_I1(r, r, r);
- static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg lo, TCGReg hi,
+     case INDEX_op_qemu_st_a32_i32:
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
+     case INDEX_op_qemu_st_a64_i32:
-                             MemOpIdx oi, TCGType data_type)
+     case INDEX_op_qemu_st_a32_i64:
- {
+     case INDEX_op_qemu_st_a64_i64:
-     MemOp opc = get_memop(oi);
+         return C_O0_I2(rZ, r);
--    unsigned a_bits = get_alignment_bits(opc);
++    case INDEX_op_qemu_st_a32_i128:
--    unsigned s_bits = opc & MO_SIZE;
++    case INDEX_op_qemu_st_a64_i128:
--    TCGReg base;
++        return C_O0_I3(rZ, rZ, r);
-+    TCGLabelQemuLdst *ldst;
-+    HostAddress h;
+     case INDEX_op_deposit_i32:
+     case INDEX_op_deposit_i64:
 -    /*
 -     * R6 removes the left/right instructions but requires the
 -     * system to support misaligned memory accesses.
 -     */
 -#if defined(CONFIG_SOFTMMU)
 -    tcg_insn_unit *label_ptr[2];
 +    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
 -    base = TCG_REG_A0;
 -    tcg_out_tlb_load(s, base, addrlo, addrhi, oi, label_ptr, 0);
 -    if (use_mips32r6_instructions || a_bits >= s_bits) {
 -        tcg_out_qemu_st_direct(s, datalo, datahi, base, opc);
 +    if (use_mips32r6_instructions || h.align >= (opc & MO_SIZE)) {
 +        tcg_out_qemu_st_direct(s, datalo, datahi, h.base, opc);
      } else {
 -        tcg_out_qemu_st_unalign(s, datalo, datahi, base, opc);
 +        tcg_out_qemu_st_unalign(s, datalo, datahi, h.base, opc);
      }
 -    add_qemu_ldst_label(s, false, oi, data_type, datalo, datahi,
 -                        addrlo, addrhi, s->code_ptr, label_ptr);
 -#else
 -    base = addrlo;
 -    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
 -        tcg_out_ext32u(s, TCG_REG_A0, base);
 -        base = TCG_REG_A0;
 +
 +    if (ldst) {
 +        ldst->type = data_type;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
      }
 -    if (guest_base) {
 -        if (guest_base == (int16_t)guest_base) {
 -            tcg_out_opc_imm(s, ALIAS_PADDI, TCG_REG_A0, base, guest_base);
 -        } else {
 -            tcg_out_opc_reg(s, ALIAS_PADD, TCG_REG_A0, base,
 -                            TCG_GUEST_BASE_REG);
 -        }
 -        base = TCG_REG_A0;
 -    }
 -    if (use_mips32r6_instructions) {
 -        if (a_bits) {
 -            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
 -        }
 -        tcg_out_qemu_st_direct(s, datalo, datahi, base, opc);
 -    } else {
 -        if (a_bits && a_bits != s_bits) {
 -            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
 -        }
 -        if (a_bits >= s_bits) {
 -            tcg_out_qemu_st_direct(s, datalo, datahi, base, opc);
 -        } else {
 -            tcg_out_qemu_st_unalign(s, datalo, datahi, base, opc);
 -        }
 -    }
 -#endif
  }
  static void tcg_out_mb(TCGContext *s, TCGArg a0)
 --
 .34.1

-[PULL 35/53] tcg/ppc: Adjust constraints on qemu_ld/st
+[PULL 10/27] tcg/ppc: Support 128-bit load/store
-The softmmu tlb uses TCG_REG_{TMP1,TMP2,R0}, not any of the normally
+Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
-available registers.  Now that we handle overlap betwen inputs and
+Note that these instructions do not require 16-byte alignment.
 helper arguments, we can allow any allocatable reg.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/ppc/tcg-target-con-set.h | 11 ++++-------
+ tcg/ppc/tcg-target-con-set.h |   2 +
- tcg/ppc/tcg-target-con-str.h |  2 --
+ tcg/ppc/tcg-target-con-str.h |   1 +
- tcg/ppc/tcg-target.c.inc     | 32 ++++++++++----------------------
+ tcg/ppc/tcg-target.h         |   3 +-
-files changed, 14 insertions(+), 31 deletions(-)
+ tcg/ppc/tcg-target.c.inc     | 108 +++++++++++++++++++++++++++++++----
 files changed, 101 insertions(+), 13 deletions(-)
 diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target-con-set.h
 +++ b/tcg/ppc/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
  C_O0_I1(r)
  C_O0_I2(r, r)
  C_O0_I2(r, ri)
--C_O0_I2(S, S)
  C_O0_I2(v, r)
--C_O0_I3(S, S, S)
+ C_O0_I3(r, r, r)
-+C_O0_I3(r, r, r)
++C_O0_I3(o, m, r)
  C_O0_I4(r, r, ri, ri)
--C_O0_I4(S, S, S, S)
+ C_O0_I4(r, r, r, r)
 -C_O1_I1(r, L)
 +C_O0_I4(r, r, r, r)
  C_O1_I1(r, r)
- C_O1_I1(v, r)
+@@ -XXX,XX +XXX,XX @@ C_O1_I3(v, v, v, v)
  C_O1_I1(v, v)
  C_O1_I1(v, vr)
  C_O1_I2(r, 0, rZ)
 -C_O1_I2(r, L, L)
  C_O1_I2(r, rI, ri)
  C_O1_I2(r, rI, rT)
  C_O1_I2(r, r, r)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
  C_O1_I3(v, v, v, v)
  C_O1_I4(r, r, ri, rZ, rZ)
  C_O1_I4(r, r, r, ri, ri)
--C_O2_I1(L, L, L)
+ C_O2_I1(r, r, r)
--C_O2_I2(L, L, L, L)
++C_O2_I1(o, m, r)
-+C_O2_I1(r, r, r)
+ C_O2_I2(r, r, r, r)
 +C_O2_I2(r, r, r, r)
  C_O2_I4(r, r, rI, rZM, r, r)
  C_O2_I4(r, r, r, r, rI, rZM)
 diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target-con-str.h
 +++ b/tcg/ppc/tcg-target-con-str.h
-@@ -XXX,XX +XXX,XX @@ REGS('A', 1u << TCG_REG_R3)
+@@ -XXX,XX +XXX,XX @@
- REGS('B', 1u << TCG_REG_R4)
+  * REGS(letter, register_mask)
- REGS('C', 1u << TCG_REG_R5)
+  */
- REGS('D', 1u << TCG_REG_R6)
+ REGS('r', ALL_GENERAL_REGS)
--REGS('L', ALL_QLOAD_REGS)
++REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
--REGS('S', ALL_QSTORE_REGS)
+ REGS('v', ALL_VECTOR_REGS)
  /*
-  * Define constraint letters for constants:
+diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.h
 +++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
  #define TCG_TARGET_HAS_mulsh_i64        1
  #endif
 -#define TCG_TARGET_HAS_qemu_ldst_i128   0
 +#define TCG_TARGET_HAS_qemu_ldst_i128   \
 +    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
  /*
   * While technically Altivec could support V64, it has no 64-bit store
 diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/ppc/tcg-target.c.inc
 +++ b/tcg/ppc/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
- #define ALL_GENERAL_REGS  0xffffffffu
- #define ALL_VECTOR_REGS   0xffffffff00000000ull
+ #define B      OPCD( 18)
+ #define BC     OPCD( 16)
--#ifdef CONFIG_SOFTMMU
++
--#define ALL_QLOAD_REGS \
+ #define LBZ    OPCD( 34)
--    (ALL_GENERAL_REGS & \
+ #define LHZ    OPCD( 40)
--     ~((1 << TCG_REG_R3) | (1 << TCG_REG_R4) | (1 << TCG_REG_R5)))
+ #define LHA    OPCD( 42)
--#define ALL_QSTORE_REGS \
+ #define LWZ    OPCD( 32)
--    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R3) | (1 << TCG_REG_R4) | \
+ #define LWZUX  XO31( 55)
--                          (1 << TCG_REG_R5) | (1 << TCG_REG_R6)))
+-#define STB    OPCD( 38)
--#else
+-#define STH    OPCD( 44)
--#define ALL_QLOAD_REGS  (ALL_GENERAL_REGS & ~(1 << TCG_REG_R3))
+-#define STW    OPCD( 36)
 -#define ALL_QSTORE_REGS ALL_QLOAD_REGS
 -#endif
 -
- TCGPowerISA have_isa;
+-#define STD    XO62(  0)
- static bool have_isel;
+-#define STDU   XO62(  1)
- bool have_altivec;
+-#define STDX   XO31(149)
 -
  #define LD     XO58(  0)
  #define LDX    XO31( 21)
  #define LDU    XO58(  1)
  #define LDUX   XO31( 53)
  #define LWA    XO58(  2)
  #define LWAX   XO31(341)
 +#define LQ     OPCD( 56)
 +
 +#define STB    OPCD( 38)
 +#define STH    OPCD( 44)
 +#define STW    OPCD( 36)
 +#define STD    XO62(  0)
 +#define STDU   XO62(  1)
 +#define STDX   XO31(149)
 +#define STQ    XO62(  2)
  #define ADDIC  OPCD( 12)
  #define ADDI   OPCD( 14)
@@ -XXX,XX +XXX,XX @@ typedef struct {
  bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return true;
 +    TCGAtomAlign aa;
 +
 +    if ((memop & MO_SIZE) <= MO_64) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity,
 +     * but do allow a pair of 64-bit operations.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom <= MO_64;
  }
  /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
  {
      TCGLabelQemuLdst *ldst = NULL;
      MemOp opc = get_memop(oi);
 -    MemOp a_bits;
 +    MemOp a_bits, s_bits;
      /*
       * Book II, Section 1.4, Single-Copy Atomicity, specifies:
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
       * As of 3.0, "the non-atomic access is performed as described in
       * the corresponding list", which matches MO_ATOM_SUBALIGN.
       */
 +    s_bits = opc & MO_SIZE;
      h->aa = atom_and_align_for_opc(s, opc,
                                     have_isa_3_00 ? MO_ATOM_SUBALIGN
                                                   : MO_ATOM_IFALIGN,
 -                                   false);
 +                                   s_bits == MO_128);
      a_bits = h->aa.align;
  #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      int fast_off = TLB_MASK_TABLE_OFS(mem_index);
      int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
      int table_off = fast_off + offsetof(CPUTLBDescFast, table);
 -    unsigned s_bits = opc & MO_SIZE;
      ldst = new_ldst_label(s);
      ldst->is_ld = is_ld;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
      }
  }
 +static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
 +                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
 +{
 +    TCGLabelQemuLdst *ldst;
 +    HostAddress h;
 +    bool need_bswap;
 +    uint32_t insn;
 +    TCGReg index;
 +
 +    ldst = prepare_host_addr(s, &h, addr_reg, -1, oi, is_ld);
 +
 +    /* Compose the final address, as LQ/STQ have no indexing. */
 +    index = h.index;
 +    if (h.base != 0) {
 +        index = TCG_REG_TMP1;
 +        tcg_out32(s, ADD | TAB(index, h.base, h.index));
 +    }
 +    need_bswap = get_memop(oi) & MO_BSWAP;
 +
 +    if (h.aa.atom == MO_128) {
 +        tcg_debug_assert(!need_bswap);
 +        tcg_debug_assert(datalo & 1);
 +        tcg_debug_assert(datahi == datalo - 1);
 +        insn = is_ld ? LQ : STQ;
 +        tcg_out32(s, insn | TAI(datahi, index, 0));
 +    } else {
 +        TCGReg d1, d2;
 +
 +        if (HOST_BIG_ENDIAN ^ need_bswap) {
 +            d1 = datahi, d2 = datalo;
 +        } else {
 +            d1 = datalo, d2 = datahi;
 +        }
 +
 +        if (need_bswap) {
 +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
 +            insn = is_ld ? LDBRX : STDBRX;
 +            tcg_out32(s, insn | TAB(d1, 0, index));
 +            tcg_out32(s, insn | TAB(d2, index, TCG_REG_R0));
 +        } else {
 +            insn = is_ld ? LD : STD;
 +            tcg_out32(s, insn | TAI(d1, index, 0));
 +            tcg_out32(s, insn | TAI(d2, index, 8));
 +        }
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
  static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
  {
      int i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                              args[4], TCG_TYPE_I64);
          }
          break;
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
 +        break;
      case INDEX_op_qemu_st_a64_i32:
          if (TCG_TARGET_REG_BITS == 32) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                              args[4], TCG_TYPE_I64);
          }
          break;
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
 +        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
 +        break;
      case INDEX_op_setcond_i32:
          tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
 @@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_qemu_st_a64_i64:
-     case INDEX_op_qemu_ld_i32:
+         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r) : C_O0_I4(r, r, r, r);
-         return (TCG_TARGET_REG_BITS == 64 || TARGET_LONG_BITS == 32
--                ? C_O1_I1(r, L)
++    case INDEX_op_qemu_ld_a32_i128:
--                : C_O1_I2(r, L, L));
++    case INDEX_op_qemu_ld_a64_i128:
-+                ? C_O1_I1(r, r)
++        return C_O2_I1(o, m, r);
-+                : C_O1_I2(r, r, r));
++    case INDEX_op_qemu_st_a32_i128:
++    case INDEX_op_qemu_st_a64_i128:
-     case INDEX_op_qemu_st_i32:
++        return C_O0_I3(o, m, r);
-         return (TCG_TARGET_REG_BITS == 64 || TARGET_LONG_BITS == 32
++
 -                ? C_O0_I2(S, S)
 -                : C_O0_I3(S, S, S));
 +                ? C_O0_I2(r, r)
 +                : C_O0_I3(r, r, r));
      case INDEX_op_qemu_ld_i64:
 -        return (TCG_TARGET_REG_BITS == 64 ? C_O1_I1(r, L)
 -                : TARGET_LONG_BITS == 32 ? C_O2_I1(L, L, L)
 -                : C_O2_I2(L, L, L, L));
 +        return (TCG_TARGET_REG_BITS == 64 ? C_O1_I1(r, r)
 +                : TARGET_LONG_BITS == 32 ? C_O2_I1(r, r, r)
 +                : C_O2_I2(r, r, r, r));
      case INDEX_op_qemu_st_i64:
 -        return (TCG_TARGET_REG_BITS == 64 ? C_O0_I2(S, S)
 -                : TARGET_LONG_BITS == 32 ? C_O0_I3(S, S, S)
 -                : C_O0_I4(S, S, S, S));
 +        return (TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r)
 +                : TARGET_LONG_BITS == 32 ? C_O0_I3(r, r, r)
 +                : C_O0_I4(r, r, r, r));
      case INDEX_op_add_vec:
      case INDEX_op_sub_vec:
+     case INDEX_op_mul_vec:
 --
 .34.1

-[PULL 40/53] tcg/s390x: Simplify constraints on qemu_ld/st
+[PULL 11/27] tcg/s390x: Support 128-bit load/store
-Adjust the softmmu tlb to use R0+R1, not any of the normally available
+Use LPQ/STPQ when 16-byte atomicity is required.
-registers.  Since we handle overlap betwen inputs and helper arguments,
+Note that these instructions require 16-byte alignment.
 we can allow any allocatable reg.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/s390x/tcg-target-con-set.h |  2 --
+ tcg/s390x/tcg-target-con-set.h |   2 +
- tcg/s390x/tcg-target-con-str.h |  1 -
+ tcg/s390x/tcg-target.h         |   2 +-
- tcg/s390x/tcg-target.c.inc     | 36 ++++++++++++----------------------
+ tcg/s390x/tcg-target.c.inc     | 107 ++++++++++++++++++++++++++++++++-
-files changed, 12 insertions(+), 27 deletions(-)
+files changed, 107 insertions(+), 4 deletions(-)
 diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target-con-set.h
 +++ b/tcg/s390x/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
   * tcg-target-con-str.h; the constraint combination is inclusive or.
   */
  C_O0_I1(r)
 -C_O0_I2(L, L)
  C_O0_I2(r, r)
  C_O0_I2(r, ri)
  C_O0_I2(r, rA)
  C_O0_I2(v, r)
--C_O1_I1(r, L)
++C_O0_I3(o, m, r)
  C_O1_I1(r, r)
  C_O1_I1(v, r)
  C_O1_I1(v, v)
-diff --git a/tcg/s390x/tcg-target-con-str.h b/tcg/s390x/tcg-target-con-str.h
+@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
  C_O1_I3(v, v, v, v)
  C_O1_I4(r, r, ri, rI, r)
  C_O1_I4(r, r, rA, rI, r)
 +C_O2_I1(o, m, r)
  C_O2_I2(o, m, 0, r)
  C_O2_I2(o, m, r, r)
  C_O2_I3(o, m, 0, 1, r)
 diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/s390x/tcg-target-con-str.h
+--- a/tcg/s390x/tcg-target.h
-+++ b/tcg/s390x/tcg-target-con-str.h
++++ b/tcg/s390x/tcg-target.h
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ extern uint64_t s390_facilities[3];
-  * REGS(letter, register_mask)
+ #define TCG_TARGET_HAS_muluh_i64      0
-  */
+ #define TCG_TARGET_HAS_mulsh_i64      0
- REGS('r', ALL_GENERAL_REGS)
--REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
+-#define TCG_TARGET_HAS_qemu_ldst_i128 0
- REGS('v', ALL_VECTOR_REGS)
++#define TCG_TARGET_HAS_qemu_ldst_i128 1
- REGS('o', 0xaaaa) /* odd numbered general regs */
+ #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
  #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
 diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/s390x/tcg-target.c.inc
 +++ b/tcg/s390x/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
- #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 16)
+     RXY_LLGF    = 0xe316,
- #define ALL_VECTOR_REGS      MAKE_64BIT_MASK(32, 32)
+     RXY_LLGH    = 0xe391,
+     RXY_LMG     = 0xeb04,
--/*
++    RXY_LPQ     = 0xe38f,
-- * For softmmu, we need to avoid conflicts with the first 3
+     RXY_LRV     = 0xe31e,
-- * argument registers to perform the tlb lookup, and to call
+     RXY_LRVG    = 0xe30f,
-- * the helper function.
+     RXY_LRVH    = 0xe31f,
-- */
+@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
--#ifdef CONFIG_SOFTMMU
+     RXY_STG     = 0xe324,
--#define SOFTMMU_RESERVE_REGS MAKE_64BIT_MASK(TCG_REG_R2, 3)
+     RXY_STHY    = 0xe370,
--#else
+     RXY_STMG    = 0xeb24,
--#define SOFTMMU_RESERVE_REGS 0
++    RXY_STPQ    = 0xe38e,
--#endif
+     RXY_STRV    = 0xe33e,
--
+     RXY_STRVG   = 0xe32f,
--
+     RXY_STRVH   = 0xe33f,
- /* Several places within the instruction set 0 means "no register"
+@@ -XXX,XX +XXX,XX @@ typedef struct {
-    rather than TCG_REG_R0.  */
- #define TCG_REG_NONE    0
+ bool tcg_target_has_memory_bswap(MemOp memop)
  {
 -    return true;
 +    TCGAtomAlign aa;
 +
 +    if ((memop & MO_SIZE) <= MO_64) {
 +        return true;
 +    }
 +
 +    /*
 +     * Reject 16-byte memop with 16-byte atomicity,
 +     * but do allow a pair of 64-bit operations.
 +     */
 +    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
 +    return aa.atom <= MO_64;
  }
  static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
 @@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     ldst->oi = oi;
+ {
-     ldst->addrlo_reg = addr_reg;
+     TCGLabelQemuLdst *ldst = NULL;
+     MemOp opc = get_memop(oi);
--    tcg_out_sh64(s, RSY_SRLG, TCG_REG_R2, addr_reg, TCG_REG_NONE,
++    MemOp s_bits = opc & MO_SIZE;
-+    tcg_out_sh64(s, RSY_SRLG, TCG_TMP0, addr_reg, TCG_REG_NONE,
+     unsigned a_mask;
-                  TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
-     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
++    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
-     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
+     a_mask = (1 << h->aa.align) - 1;
--    tcg_out_insn(s, RXY, NG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, mask_off);
--    tcg_out_insn(s, RXY, AG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, table_off);
+ #ifdef CONFIG_SOFTMMU
-+    tcg_out_insn(s, RXY, NG, TCG_TMP0, TCG_AREG0, TCG_REG_NONE, mask_off);
+-    unsigned s_bits = opc & MO_SIZE;
-+    tcg_out_insn(s, RXY, AG, TCG_TMP0, TCG_AREG0, TCG_REG_NONE, table_off);
+     unsigned s_mask = (1 << s_bits) - 1;
+     int mem_index = get_mmuidx(oi);
-     /*
+     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-      * For aligned accesses, we check the first byte and include the alignment
+@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
      tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
      if (a_off == 0) {
 -        tgen_andi_risbg(s, TCG_REG_R3, addr_reg, tlb_mask);
 +        tgen_andi_risbg(s, TCG_REG_R0, addr_reg, tlb_mask);
      } else {
 -        tcg_out_insn(s, RX, LA, TCG_REG_R3, addr_reg, TCG_REG_NONE, a_off);
 -        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R3, tlb_mask);
 +        tcg_out_insn(s, RX, LA, TCG_REG_R0, addr_reg, TCG_REG_NONE, a_off);
 +        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R0, tlb_mask);
      }
+ }
-     if (is_ld) {
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
++static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
-         ofs = offsetof(CPUTLBEntry, addr_write);
++                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
-     }
++{
-     if (TARGET_LONG_BITS == 32) {
++    TCGLabel *l1 = NULL, *l2 = NULL;
--        tcg_out_insn(s, RX, C, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
++    TCGLabelQemuLdst *ldst;
-+        tcg_out_insn(s, RX, C, TCG_REG_R0, TCG_TMP0, TCG_REG_NONE, ofs);
++    HostAddress h;
-     } else {
++    bool need_bswap;
--        tcg_out_insn(s, RXY, CG, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
++    bool use_pair;
-+        tcg_out_insn(s, RXY, CG, TCG_REG_R0, TCG_TMP0, TCG_REG_NONE, ofs);
++    S390Opcode insn;
-     }
++
++    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
-     tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
++
-     ldst->label_ptr[0] = s->code_ptr++;
++    use_pair = h.aa.atom < MO_128;
++    need_bswap = get_memop(oi) & MO_BSWAP;
--    h->index = TCG_REG_R2;
++
--    tcg_out_insn(s, RXY, LG, h->index, TCG_REG_R2, TCG_REG_NONE,
++    if (!use_pair) {
-+    h->index = TCG_TMP0;
++        /*
-+    tcg_out_insn(s, RXY, LG, h->index, TCG_TMP0, TCG_REG_NONE,
++         * Atomicity requires we use LPQ.  If we've already checked for
-                  offsetof(CPUTLBEntry, addend));
++         * 16-byte alignment, that's all we need.  If we arrive with
++         * lesser alignment, we have determined that less than 16-byte
-     if (TARGET_LONG_BITS == 32) {
++         * alignment can be satisfied with two 8-byte loads.
 +         */
 +        if (h.aa.align < MO_128) {
 +            use_pair = true;
 +            l1 = gen_new_label();
 +            l2 = gen_new_label();
 +
 +            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
 +            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
 +        }
 +
 +        tcg_debug_assert(!need_bswap);
 +        tcg_debug_assert(datalo & 1);
 +        tcg_debug_assert(datahi == datalo - 1);
 +        insn = is_ld ? RXY_LPQ : RXY_STPQ;
 +        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
 +
 +        if (use_pair) {
 +            tgen_branch(s, S390_CC_ALWAYS, l2);
 +            tcg_out_label(s, l1);
 +        }
 +    }
 +    if (use_pair) {
 +        TCGReg d1, d2;
 +
 +        if (need_bswap) {
 +            d1 = datalo, d2 = datahi;
 +            insn = is_ld ? RXY_LRVG : RXY_STRVG;
 +        } else {
 +            d1 = datahi, d2 = datalo;
 +            insn = is_ld ? RXY_LG : RXY_STG;
 +        }
 +
 +        if (h.base == d1 || h.index == d1) {
 +            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
 +            h.base = TCG_TMP0;
 +            h.index = TCG_REG_NONE;
 +            h.disp = 0;
 +        }
 +        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
 +        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
 +    }
 +    if (l2) {
 +        tcg_out_label(s, l2);
 +    }
 +
 +    if (ldst) {
 +        ldst->type = TCG_TYPE_I128;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    }
 +}
 +
  static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
  {
      /* Reuse the zeroing that exists for goto_ptr.  */
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
      case INDEX_op_qemu_st_a64_i64:
          tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
          break;
 +    case INDEX_op_qemu_ld_a32_i128:
 +    case INDEX_op_qemu_ld_a64_i128:
 +        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
 +        break;
 +    case INDEX_op_qemu_st_a32_i128:
 +    case INDEX_op_qemu_st_a64_i128:
 +        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
 +        break;
      case INDEX_op_ld16s_i64:
          tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
 @@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
+     case INDEX_op_qemu_st_a32_i32:
-     case INDEX_op_qemu_ld_i32:
+     case INDEX_op_qemu_st_a64_i32:
-     case INDEX_op_qemu_ld_i64:
+         return C_O0_I2(r, r);
--        return C_O1_I1(r, L);
++    case INDEX_op_qemu_ld_a32_i128:
-+        return C_O1_I1(r, r);
++    case INDEX_op_qemu_ld_a64_i128:
-     case INDEX_op_qemu_st_i64:
++        return C_O2_I1(o, m, r);
-     case INDEX_op_qemu_st_i32:
++    case INDEX_op_qemu_st_a32_i128:
--        return C_O0_I2(L, L);
++    case INDEX_op_qemu_st_a64_i128:
-+        return C_O0_I2(r, r);
++        return C_O0_I3(o, m, r);
      case INDEX_op_deposit_i32:
      case INDEX_op_deposit_i64:
 --
 .34.1

-[PULL 51/53] accel/tcg: Reorg system mode load helpers
+[PULL 12/27] accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
-Instead of trying to unify all operations on uint64_t, pull out
-mmu_lookup() to perform the basic tlb hit and resolution.
-Create individual functions to handle access by size.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- accel/tcg/cputlb.c | 645 +++++++++++++++++++++++++++++----------------
+ .../generic/host/load-extract-al16-al8.h      | 45 +++++++++++++++++++
-file changed, 424 insertions(+), 221 deletions(-)
+ accel/tcg/ldst_atomicity.c.inc                | 36 +--------------
 files changed, 47 insertions(+), 34 deletions(-)
  create mode 100644 host/include/generic/host/load-extract-al16-al8.h
-diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
+diff --git a/host/include/generic/host/load-extract-al16-al8.h b/host/include/generic/host/load-extract-al16-al8.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/accel/tcg/cputlb.c
+index XXXXXXX..XXXXXXX
-+++ b/accel/tcg/cputlb.c
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, int mmu_idx,
++++ b/host/include/generic/host/load-extract-al16-al8.h
+@@ -XXX,XX +XXX,XX @@
  #endif
 +/*
-+ * Probe for a load/store operation.
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+ * Return the host address and into @flags.
++ * Atomic extract 64 from 128-bit, generic version.
 + *
 + * Copyright (C) 2023 Linaro, Ltd.
 + */
 +
-+typedef struct MMULookupPageData {
++#ifndef HOST_LOAD_EXTRACT_AL16_AL8_H
-+    CPUTLBEntryFull *full;
++#define HOST_LOAD_EXTRACT_AL16_AL8_H
 +    void *haddr;
 +    target_ulong addr;
 +    int flags;
 +    int size;
 +} MMULookupPageData;
 +
 +typedef struct MMULookupLocals {
 +    MMULookupPageData page[2];
 +    MemOp memop;
 +    int mmu_idx;
 +} MMULookupLocals;
 +
 +/**
-+ * mmu_lookup1: translate one page
++ * load_atom_extract_al16_or_al8:
-+ * @env: cpu context
++ * @pv: host address
-+ * @data: lookup parameters
++ * @s: object size in bytes, @s <= 8.
 + * @mmu_idx: virtual address context
 + * @access_type: load/store/code
 + * @ra: return address into tcg generated code, or 0
 + *
-+ * Resolve the translation for the one page at @data.addr, filling in
++ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
-+ * the rest of @data with the results.  If the translation fails,
++ * cross an 16-byte boundary then the access must be 16-byte atomic,
-+ * tlb_fill will longjmp out.  Return true if the softmmu tlb for
++ * otherwise the access must be 8-byte atomic.
 + * @mmu_idx may have resized.
 + */
-+static bool mmu_lookup1(CPUArchState *env, MMULookupPageData *data,
++static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
-+                        int mmu_idx, MMUAccessType access_type, uintptr_t ra)
++load_atom_extract_al16_or_al8(void *pv, int s)
 +{
-+    target_ulong addr = data->addr;
++    uintptr_t pi = (uintptr_t)pv;
-+    uintptr_t index = tlb_index(env, mmu_idx, addr);
++    int o = pi & 7;
-+    CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
++    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
-+    target_ulong tlb_addr = tlb_read_idx(entry, access_type);
++    Int128 r;
 +    bool maybe_resized = false;
 +
-+    /* If the TLB entry is for a different page, reload and try again.  */
++    pv = (void *)(pi & ~7);
-+    if (!tlb_hit(tlb_addr, addr)) {
++    if (pi & 8) {
-+        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
++        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
-+                            addr & TARGET_PAGE_MASK)) {
++        uint64_t a = qatomic_read__nocheck(p8);
-+            tlb_fill(env_cpu(env), addr, data->size, access_type, mmu_idx, ra);
++        uint64_t b = qatomic_read__nocheck(p8 + 1);
-+            maybe_resized = true;
++
-+            index = tlb_index(env, mmu_idx, addr);
++        if (HOST_BIG_ENDIAN) {
-+            entry = tlb_entry(env, mmu_idx, addr);
++            r = int128_make128(b, a);
 +        } else {
 +            r = int128_make128(a, b);
 +        }
-+        tlb_addr = tlb_read_idx(entry, access_type) & ~TLB_INVALID_MASK;
++    } else {
 +        r = atomic16_read_ro(pv);
 +    }
-+
++    return int128_getlo(int128_urshift(r, shr));
 +    data->flags = tlb_addr & TLB_FLAGS_MASK;
 +    data->full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
 +    /* Compute haddr speculatively; depending on flags it might be invalid. */
 +    data->haddr = (void *)((uintptr_t)addr + entry->addend);
 +
 +    return maybe_resized;
 +}
 +
-+/**
++#endif /* HOST_LOAD_EXTRACT_AL16_AL8_H */
-+ * mmu_watch_or_dirty
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
-+ * @env: cpu context
+index XXXXXXX..XXXXXXX 100644
-+ * @data: lookup parameters
+--- a/accel/tcg/ldst_atomicity.c.inc
-+ * @access_type: load/store/code
++++ b/accel/tcg/ldst_atomicity.c.inc
-+ * @ra: return address into tcg generated code, or 0
+@@ -XXX,XX +XXX,XX @@
-+ *
+  * See the COPYING file in the top-level directory.
-+ * Trigger watchpoints for @data.addr:@data.size;
+  */
-+ * record writes to protected clean pages.
-+ */
++#include "host/load-extract-al16-al8.h"
 +static void mmu_watch_or_dirty(CPUArchState *env, MMULookupPageData *data,
 +                               MMUAccessType access_type, uintptr_t ra)
 +{
 +    CPUTLBEntryFull *full = data->full;
 +    target_ulong addr = data->addr;
 +    int flags = data->flags;
 +    int size = data->size;
 +
-+    /* On watchpoint hit, this will longjmp out.  */
+ #ifdef CONFIG_ATOMIC64
-+    if (flags & TLB_WATCHPOINT) {
+ # define HAVE_al8          true
-+        int wp = access_type == MMU_DATA_STORE ? BP_MEM_WRITE : BP_MEM_READ;
+ #else
-+        cpu_check_watchpoint(env_cpu(env), addr, size, full->attrs, wp, ra);
+@@ -XXX,XX +XXX,XX @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
-+        flags &= ~TLB_WATCHPOINT;
+     return int128_getlo(r);
 +    }
 +
 +    /* Note that notdirty is only set for writes. */
 +    if (flags & TLB_NOTDIRTY) {
 +        notdirty_write(env_cpu(env), addr, size, full, ra);
 +        flags &= ~TLB_NOTDIRTY;
 +    }
 +    data->flags = flags;
 +}
 +
 +/**
 + * mmu_lookup: translate page(s)
 + * @env: cpu context
 + * @addr: virtual address
 + * @oi: combined mmu_idx and MemOp
 + * @ra: return address into tcg generated code, or 0
 + * @access_type: load/store/code
 + * @l: output result
 + *
 + * Resolve the translation for the page(s) beginning at @addr, for MemOp.size
 + * bytes.  Return true if the lookup crosses a page boundary.
 + */
 +static bool mmu_lookup(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 +                       uintptr_t ra, MMUAccessType type, MMULookupLocals *l)
 +{
 +    unsigned a_bits;
 +    bool crosspage;
 +    int flags;
 +
 +    l->memop = get_memop(oi);
 +    l->mmu_idx = get_mmuidx(oi);
 +
 +    tcg_debug_assert(l->mmu_idx < NB_MMU_MODES);
 +
 +    /* Handle CPU specific unaligned behaviour */
 +    a_bits = get_alignment_bits(l->memop);
 +    if (addr & ((1 << a_bits) - 1)) {
 +        cpu_unaligned_access(env_cpu(env), addr, type, l->mmu_idx, ra);
 +    }
 +
 +    l->page[0].addr = addr;
 +    l->page[0].size = memop_size(l->memop);
 +    l->page[1].addr = (addr + l->page[0].size - 1) & TARGET_PAGE_MASK;
 +    l->page[1].size = 0;
 +    crosspage = (addr ^ l->page[1].addr) & TARGET_PAGE_MASK;
 +
 +    if (likely(!crosspage)) {
 +        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
 +
 +        flags = l->page[0].flags;
 +        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
 +            mmu_watch_or_dirty(env, &l->page[0], type, ra);
 +        }
 +        if (unlikely(flags & TLB_BSWAP)) {
 +            l->memop ^= MO_BSWAP;
 +        }
 +    } else {
 +        /* Finish compute of page crossing. */
 +        int size0 = l->page[1].addr - addr;
 +        l->page[1].size = l->page[0].size - size0;
 +        l->page[0].size = size0;
 +
 +        /*
 +         * Lookup both pages, recognizing exceptions from either.  If the
 +         * second lookup potentially resized, refresh first CPUTLBEntryFull.
 +         */
 +        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
 +        if (mmu_lookup1(env, &l->page[1], l->mmu_idx, type, ra)) {
 +            uintptr_t index = tlb_index(env, l->mmu_idx, addr);
 +            l->page[0].full = &env_tlb(env)->d[l->mmu_idx].fulltlb[index];
 +        }
 +
 +        flags = l->page[0].flags | l->page[1].flags;
 +        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
 +            mmu_watch_or_dirty(env, &l->page[0], type, ra);
 +            mmu_watch_or_dirty(env, &l->page[1], type, ra);
 +        }
 +
 +        /*
 +         * Since target/sparc is the only user of TLB_BSWAP, and all
 +         * Sparc accesses are aligned, any treatment across two pages
 +         * would be arbitrary.  Refuse it until there's a use.
 +         */
 +        tcg_debug_assert((flags & TLB_BSWAP) == 0);
 +    }
 +
 +    return crosspage;
 +}
 +
  /*
   * Probe for an atomic operation.  Do not allow unaligned operations,
   * or io operations to proceed.  Return the host address.
@@ -XXX,XX +XXX,XX @@ load_memop(const void *haddr, MemOp op)
      }
  }
--static inline uint64_t QEMU_ALWAYS_INLINE
+-/**
--load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+- * load_atom_extract_al16_or_al8:
--            uintptr_t retaddr, MemOp op, MMUAccessType access_type,
+- * @p: host address
--            FullLoadHelper *full_load)
+- * @s: object size in bytes, @s <= 8.
 - *
 - * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
 - * cross an 16-byte boundary then the access must be 16-byte atomic,
 - * otherwise the access must be 8-byte atomic.
 - */
 -static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
 -load_atom_extract_al16_or_al8(void *pv, int s)
 -{
--    const unsigned a_bits = get_alignment_bits(get_memop(oi));
+-    uintptr_t pi = (uintptr_t)pv;
--    const size_t size = memop_size(op);
+-    int o = pi & 7;
--    uintptr_t mmu_idx = get_mmuidx(oi);
+-    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
--    uintptr_t index;
+-    Int128 r;
 -    CPUTLBEntry *entry;
 -    target_ulong tlb_addr;
 -    void *haddr;
 -    uint64_t res;
 -
--    tcg_debug_assert(mmu_idx < NB_MMU_MODES);
+-    pv = (void *)(pi & ~7);
 -    if (pi & 8) {
 -        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
 -        uint64_t a = qatomic_read__nocheck(p8);
 -        uint64_t b = qatomic_read__nocheck(p8 + 1);
 -
--    /* Handle CPU specific unaligned behaviour */
+-        if (HOST_BIG_ENDIAN) {
--    if (addr & ((1 << a_bits) - 1)) {
+-            r = int128_make128(b, a);
--        cpu_unaligned_access(env_cpu(env), addr, access_type,
+-        } else {
--                             mmu_idx, retaddr);
+-            r = int128_make128(a, b);
 -        }
 -    } else {
 -        r = atomic16_read_ro(pv);
 -    }
--
+-    return int128_getlo(int128_urshift(r, shr));
 -    index = tlb_index(env, mmu_idx, addr);
 -    entry = tlb_entry(env, mmu_idx, addr);
 -    tlb_addr = tlb_read_idx(entry, access_type);
 -
 -    /* If the TLB entry is for a different page, reload and try again.  */
 -    if (!tlb_hit(tlb_addr, addr)) {
 -        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
 -                            addr & TARGET_PAGE_MASK)) {
 -            tlb_fill(env_cpu(env), addr, size,
 -                     access_type, mmu_idx, retaddr);
 -            index = tlb_index(env, mmu_idx, addr);
 -            entry = tlb_entry(env, mmu_idx, addr);
 -        }
 -        tlb_addr = tlb_read_idx(entry, access_type);
 -        tlb_addr &= ~TLB_INVALID_MASK;
 -    }
 -
 -    /* Handle anything that isn't just a straight memory access.  */
 -    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
 -        CPUTLBEntryFull *full;
 -        bool need_swap;
 -
 -        /* For anything that is unaligned, recurse through full_load.  */
 -        if ((addr & (size - 1)) != 0) {
 -            goto do_unaligned_access;
 -        }
 -
 -        full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
 -
 -        /* Handle watchpoints.  */
 -        if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
 -            /* On watchpoint hit, this will longjmp out.  */
 -            cpu_check_watchpoint(env_cpu(env), addr, size,
 -                                 full->attrs, BP_MEM_READ, retaddr);
 -        }
 -
 -        need_swap = size > 1 && (tlb_addr & TLB_BSWAP);
 -
 -        /* Handle I/O access.  */
 -        if (likely(tlb_addr & TLB_MMIO)) {
 -            return io_readx(env, full, mmu_idx, addr, retaddr,
 -                            access_type, op ^ (need_swap * MO_BSWAP));
 -        }
 -
 -        haddr = (void *)((uintptr_t)addr + entry->addend);
 -
 -        /*
 -         * Keep these two load_memop separate to ensure that the compiler
 -         * is able to fold the entire function to a single instruction.
 -         * There is a build-time assert inside to remind you of this.  ;-)
 -         */
 -        if (unlikely(need_swap)) {
 -            return load_memop(haddr, op ^ MO_BSWAP);
 -        }
 -        return load_memop(haddr, op);
 -    }
 -
 -    /* Handle slow unaligned access (it spans two pages or IO).  */
 -    if (size > 1
 -        && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
 -                    >= TARGET_PAGE_SIZE)) {
 -        target_ulong addr1, addr2;
 -        uint64_t r1, r2;
 -        unsigned shift;
 -    do_unaligned_access:
 -        addr1 = addr & ~((target_ulong)size - 1);
 -        addr2 = addr1 + size;
 -        r1 = full_load(env, addr1, oi, retaddr);
 -        r2 = full_load(env, addr2, oi, retaddr);
 -        shift = (addr & (size - 1)) * 8;
 -
 -        if (memop_big_endian(op)) {
 -            /* Big-endian combine.  */
 -            res = (r1 << shift) | (r2 >> ((size * 8) - shift));
 -        } else {
 -            /* Little-endian combine.  */
 -            res = (r1 >> shift) | (r2 << ((size * 8) - shift));
 -        }
 -        return res & MAKE_64BIT_MASK(0, size * 8);
 -    }
 -
 -    haddr = (void *)((uintptr_t)addr + entry->addend);
 -    return load_memop(haddr, op);
 -}
 -
- /*
+ /**
-  * For the benefit of TCG generated code, we want to avoid the
+  * load_atom_4_by_2:
-  * complication of ABI-specific return type promotion and always
+  * @pv: host address
@@ -XXX,XX +XXX,XX @@ load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
   * We don't bother with this widened value for SOFTMMU_CODE_ACCESS.
   */
 -static uint64_t full_ldub_mmu(CPUArchState *env, target_ulong addr,
 -                              MemOpIdx oi, uintptr_t retaddr)
 +/**
 + * do_ld_mmio_beN:
 + * @env: cpu context
 + * @p: translation parameters
 + * @ret_be: accumulated data
 + * @mmu_idx: virtual address context
 + * @ra: return address into tcg generated code, or 0
 + *
 + * Load @p->size bytes from @p->addr, which is memory-mapped i/o.
 + * The bytes are concatenated in big-endian order with @ret_be.
 + */
 +static uint64_t do_ld_mmio_beN(CPUArchState *env, MMULookupPageData *p,
 +                               uint64_t ret_be, int mmu_idx,
 +                               MMUAccessType type, uintptr_t ra)
  {
 -    validate_memop(oi, MO_UB);
 -    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
 -                       full_ldub_mmu);
 +    CPUTLBEntryFull *full = p->full;
 +    target_ulong addr = p->addr;
 +    int i, size = p->size;
 +
 +    QEMU_IOTHREAD_LOCK_GUARD();
 +    for (i = 0; i < size; i++) {
 +        uint8_t x = io_readx(env, full, mmu_idx, addr + i, ra, type, MO_UB);
 +        ret_be = (ret_be << 8) | x;
 +    }
 +    return ret_be;
 +}
 +
 +/**
 + * do_ld_bytes_beN
 + * @p: translation parameters
 + * @ret_be: accumulated data
 + *
 + * Load @p->size bytes from @p->haddr, which is RAM.
 + * The bytes to concatenated in big-endian order with @ret_be.
 + */
 +static uint64_t do_ld_bytes_beN(MMULookupPageData *p, uint64_t ret_be)
 +{
 +    uint8_t *haddr = p->haddr;
 +    int i, size = p->size;
 +
 +    for (i = 0; i < size; i++) {
 +        ret_be = (ret_be << 8) | haddr[i];
 +    }
 +    return ret_be;
 +}
 +
 +/*
 + * Wrapper for the above.
 + */
 +static uint64_t do_ld_beN(CPUArchState *env, MMULookupPageData *p,
 +                          uint64_t ret_be, int mmu_idx,
 +                          MMUAccessType type, uintptr_t ra)
 +{
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        return do_ld_mmio_beN(env, p, ret_be, mmu_idx, type, ra);
 +    } else {
 +        return do_ld_bytes_beN(p, ret_be);
 +    }
 +}
 +
 +static uint8_t do_ld_1(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
 +                       MMUAccessType type, uintptr_t ra)
 +{
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, MO_UB);
 +    } else {
 +        return *(uint8_t *)p->haddr;
 +    }
 +}
 +
 +static uint16_t do_ld_2(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
 +                        MMUAccessType type, MemOp memop, uintptr_t ra)
 +{
 +    uint64_t ret;
 +
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
 +    }
 +
 +    /* Perform the load host endian, then swap if necessary. */
 +    ret = load_memop(p->haddr, MO_UW);
 +    if (memop & MO_BSWAP) {
 +        ret = bswap16(ret);
 +    }
 +    return ret;
 +}
 +
 +static uint32_t do_ld_4(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
 +                        MMUAccessType type, MemOp memop, uintptr_t ra)
 +{
 +    uint32_t ret;
 +
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
 +    }
 +
 +    /* Perform the load host endian. */
 +    ret = load_memop(p->haddr, MO_UL);
 +    if (memop & MO_BSWAP) {
 +        ret = bswap32(ret);
 +    }
 +    return ret;
 +}
 +
 +static uint64_t do_ld_8(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
 +                        MMUAccessType type, MemOp memop, uintptr_t ra)
 +{
 +    uint64_t ret;
 +
 +    if (unlikely(p->flags & TLB_MMIO)) {
 +        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
 +    }
 +
 +    /* Perform the load host endian. */
 +    ret = load_memop(p->haddr, MO_UQ);
 +    if (memop & MO_BSWAP) {
 +        ret = bswap64(ret);
 +    }
 +    return ret;
 +}
 +
 +static uint8_t do_ld1_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 +                          uintptr_t ra, MMUAccessType access_type)
 +{
 +    MMULookupLocals l;
 +    bool crosspage;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
 +    tcg_debug_assert(!crosspage);
 +
 +    return do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
  }
  tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
                                       MemOpIdx oi, uintptr_t retaddr)
  {
 -    return full_ldub_mmu(env, addr, oi, retaddr);
 +    validate_memop(oi, MO_UB);
 +    return do_ld1_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
  }
 -static uint64_t full_le_lduw_mmu(CPUArchState *env, target_ulong addr,
 -                                 MemOpIdx oi, uintptr_t retaddr)
 +static uint16_t do_ld2_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 +                           uintptr_t ra, MMUAccessType access_type)
  {
 -    validate_memop(oi, MO_LEUW);
 -    return load_helper(env, addr, oi, retaddr, MO_LEUW, MMU_DATA_LOAD,
 -                       full_le_lduw_mmu);
 +    MMULookupLocals l;
 +    bool crosspage;
 +    uint16_t ret;
 +    uint8_t a, b;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
 +    if (likely(!crosspage)) {
 +        return do_ld_2(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
 +    }
 +
 +    a = do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
 +    b = do_ld_1(env, &l.page[1], l.mmu_idx, access_type, ra);
 +
 +    if ((l.memop & MO_BSWAP) == MO_LE) {
 +        ret = a | (b << 8);
 +    } else {
 +        ret = b | (a << 8);
 +    }
 +    return ret;
  }
  tcg_target_ulong helper_le_lduw_mmu(CPUArchState *env, target_ulong addr,
                                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    return full_le_lduw_mmu(env, addr, oi, retaddr);
 -}
 -
 -static uint64_t full_be_lduw_mmu(CPUArchState *env, target_ulong addr,
 -                                 MemOpIdx oi, uintptr_t retaddr)
 -{
 -    validate_memop(oi, MO_BEUW);
 -    return load_helper(env, addr, oi, retaddr, MO_BEUW, MMU_DATA_LOAD,
 -                       full_be_lduw_mmu);
 +    validate_memop(oi, MO_LEUW);
 +    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
  }
  tcg_target_ulong helper_be_lduw_mmu(CPUArchState *env, target_ulong addr,
                                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    return full_be_lduw_mmu(env, addr, oi, retaddr);
 +    validate_memop(oi, MO_BEUW);
 +    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
  }
 -static uint64_t full_le_ldul_mmu(CPUArchState *env, target_ulong addr,
 -                                 MemOpIdx oi, uintptr_t retaddr)
 +static uint32_t do_ld4_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 +                           uintptr_t ra, MMUAccessType access_type)
  {
 -    validate_memop(oi, MO_LEUL);
 -    return load_helper(env, addr, oi, retaddr, MO_LEUL, MMU_DATA_LOAD,
 -                       full_le_ldul_mmu);
 +    MMULookupLocals l;
 +    bool crosspage;
 +    uint32_t ret;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
 +    if (likely(!crosspage)) {
 +        return do_ld_4(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
 +    }
 +
 +    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
 +    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
 +    if ((l.memop & MO_BSWAP) == MO_LE) {
 +        ret = bswap32(ret);
 +    }
 +    return ret;
  }
  tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    return full_le_ldul_mmu(env, addr, oi, retaddr);
 -}
 -
 -static uint64_t full_be_ldul_mmu(CPUArchState *env, target_ulong addr,
 -                                 MemOpIdx oi, uintptr_t retaddr)
 -{
 -    validate_memop(oi, MO_BEUL);
 -    return load_helper(env, addr, oi, retaddr, MO_BEUL, MMU_DATA_LOAD,
 -                       full_be_ldul_mmu);
 +    validate_memop(oi, MO_LEUL);
 +    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
  }
  tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                      MemOpIdx oi, uintptr_t retaddr)
  {
 -    return full_be_ldul_mmu(env, addr, oi, retaddr);
 +    validate_memop(oi, MO_BEUL);
 +    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 +}
 +
 +static uint64_t do_ld8_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 +                           uintptr_t ra, MMUAccessType access_type)
 +{
 +    MMULookupLocals l;
 +    bool crosspage;
 +    uint64_t ret;
 +
 +    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
 +    if (likely(!crosspage)) {
 +        return do_ld_8(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
 +    }
 +
 +    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
 +    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
 +    if ((l.memop & MO_BSWAP) == MO_LE) {
 +        ret = bswap64(ret);
 +    }
 +    return ret;
  }
  uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
                             MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_LEUQ);
 -    return load_helper(env, addr, oi, retaddr, MO_LEUQ, MMU_DATA_LOAD,
 -                       helper_le_ldq_mmu);
 +    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
  }
  uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                             MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_BEUQ);
 -    return load_helper(env, addr, oi, retaddr, MO_BEUQ, MMU_DATA_LOAD,
 -                       helper_be_ldq_mmu);
 +    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
  }
  /*
@@ -XXX,XX +XXX,XX @@ tcg_target_ulong helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr,
   * Load helpers for cpu_ldst.h.
   */
 -static inline uint64_t cpu_load_helper(CPUArchState *env, abi_ptr addr,
 -                                       MemOpIdx oi, uintptr_t retaddr,
 -                                       FullLoadHelper *full_load)
 +static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
  {
 -    uint64_t ret;
 -
 -    ret = full_load(env, addr, oi, retaddr);
      qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
 -    return ret;
  }
  uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, full_ldub_mmu);
 +    uint8_t ret;
 +
 +    validate_memop(oi, MO_UB);
 +    ret = do_ld1_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  uint16_t cpu_ldw_be_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, full_be_lduw_mmu);
 +    uint16_t ret;
 +
 +    validate_memop(oi, MO_BEUW);
 +    ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, full_be_ldul_mmu);
 +    uint32_t ret;
 +
 +    validate_memop(oi, MO_BEUL);
 +    ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, helper_be_ldq_mmu);
 +    uint64_t ret;
 +
 +    validate_memop(oi, MO_BEUQ);
 +    ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, full_le_lduw_mmu);
 +    uint16_t ret;
 +
 +    validate_memop(oi, MO_LEUW);
 +    ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, full_le_ldul_mmu);
 +    uint32_t ret;
 +
 +    validate_memop(oi, MO_LEUL);
 +    ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t ra)
  {
 -    return cpu_load_helper(env, addr, oi, ra, helper_le_ldq_mmu);
 +    uint64_t ret;
 +
 +    validate_memop(oi, MO_LEUQ);
 +    ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
 +    plugin_load_cb(env, addr, oi);
 +    return ret;
  }
  Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
@@ -XXX,XX +XXX,XX @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
  /* Code access functions.  */
 -static uint64_t full_ldub_code(CPUArchState *env, target_ulong addr,
 -                               MemOpIdx oi, uintptr_t retaddr)
 -{
 -    return load_helper(env, addr, oi, retaddr, MO_8,
 -                       MMU_INST_FETCH, full_ldub_code);
 -}
 -
  uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
  {
      MemOpIdx oi = make_memop_idx(MO_UB, cpu_mmu_index(env, true));
 -    return full_ldub_code(env, addr, oi, 0);
 -}
 -
 -static uint64_t full_lduw_code(CPUArchState *env, target_ulong addr,
 -                               MemOpIdx oi, uintptr_t retaddr)
 -{
 -    return load_helper(env, addr, oi, retaddr, MO_TEUW,
 -                       MMU_INST_FETCH, full_lduw_code);
 +    return do_ld1_mmu(env, addr, oi, 0, MMU_INST_FETCH);
  }
  uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
  {
      MemOpIdx oi = make_memop_idx(MO_TEUW, cpu_mmu_index(env, true));
 -    return full_lduw_code(env, addr, oi, 0);
 -}
 -
 -static uint64_t full_ldl_code(CPUArchState *env, target_ulong addr,
 -                              MemOpIdx oi, uintptr_t retaddr)
 -{
 -    return load_helper(env, addr, oi, retaddr, MO_TEUL,
 -                       MMU_INST_FETCH, full_ldl_code);
 +    return do_ld2_mmu(env, addr, oi, 0, MMU_INST_FETCH);
  }
  uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
  {
      MemOpIdx oi = make_memop_idx(MO_TEUL, cpu_mmu_index(env, true));
 -    return full_ldl_code(env, addr, oi, 0);
 -}
 -
 -static uint64_t full_ldq_code(CPUArchState *env, target_ulong addr,
 -                              MemOpIdx oi, uintptr_t retaddr)
 -{
 -    return load_helper(env, addr, oi, retaddr, MO_TEUQ,
 -                       MMU_INST_FETCH, full_ldq_code);
 +    return do_ld4_mmu(env, addr, oi, 0, MMU_INST_FETCH);
  }
  uint64_t cpu_ldq_code(CPUArchState *env, abi_ptr addr)
  {
      MemOpIdx oi = make_memop_idx(MO_TEUQ, cpu_mmu_index(env, true));
 -    return full_ldq_code(env, addr, oi, 0);
 +    return do_ld8_mmu(env, addr, oi, 0, MMU_INST_FETCH);
  }
  uint8_t cpu_ldb_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
  {
 -    return full_ldub_code(env, addr, oi, retaddr);
 +    return do_ld1_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
  }
  uint16_t cpu_ldw_code_mmu(CPUArchState *env, abi_ptr addr,
                            MemOpIdx oi, uintptr_t retaddr)
  {
 -    MemOp mop = get_memop(oi);
 -    int idx = get_mmuidx(oi);
 -    uint16_t ret;
 -
 -    ret = full_lduw_code(env, addr, make_memop_idx(MO_TEUW, idx), retaddr);
 -    if ((mop & MO_BSWAP) != MO_TE) {
 -        ret = bswap16(ret);
 -    }
 -    return ret;
 +    return do_ld2_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
  }
  uint32_t cpu_ldl_code_mmu(CPUArchState *env, abi_ptr addr,
                            MemOpIdx oi, uintptr_t retaddr)
  {
 -    MemOp mop = get_memop(oi);
 -    int idx = get_mmuidx(oi);
 -    uint32_t ret;
 -
 -    ret = full_ldl_code(env, addr, make_memop_idx(MO_TEUL, idx), retaddr);
 -    if ((mop & MO_BSWAP) != MO_TE) {
 -        ret = bswap32(ret);
 -    }
 -    return ret;
 +    return do_ld4_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
  }
  uint64_t cpu_ldq_code_mmu(CPUArchState *env, abi_ptr addr,
                            MemOpIdx oi, uintptr_t retaddr)
  {
 -    MemOp mop = get_memop(oi);
 -    int idx = get_mmuidx(oi);
 -    uint64_t ret;
 -
 -    ret = full_ldq_code(env, addr, make_memop_idx(MO_TEUQ, idx), retaddr);
 -    if ((mop & MO_BSWAP) != MO_TE) {
 -        ret = bswap64(ret);
 -    }
 -    return ret;
 +    return do_ld8_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
  }
 --
 .34.1

-[PULL 50/53] accel/tcg: Introduce tlb_read_idx
+[PULL 13/27] accel/tcg: Extract store_atom_insert_al16 to host header
-Instead of playing with offsetof in various places, use
-MMUAccessType to index an array.  This is easily defined
-instead of the previous dummy padding array in the union.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- include/exec/cpu-defs.h |   7 ++-
+ host/include/generic/host/store-insert-al16.h | 50 +++++++++++++++++++
- include/exec/cpu_ldst.h |  26 ++++++++--
+ accel/tcg/ldst_atomicity.c.inc                | 40 +--------------
- accel/tcg/cputlb.c      | 104 +++++++++++++---------------------------
+files changed, 51 insertions(+), 39 deletions(-)
-files changed, 59 insertions(+), 78 deletions(-)
+ create mode 100644 host/include/generic/host/store-insert-al16.h
-diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
+diff --git a/host/include/generic/host/store-insert-al16.h b/host/include/generic/host/store-insert-al16.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/include/exec/cpu-defs.h
+index XXXXXXX..XXXXXXX
-+++ b/include/exec/cpu-defs.h
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ typedef struct CPUTLBEntry {
++++ b/host/include/generic/host/store-insert-al16.h
-                use the corresponding iotlb value.  */
+@@ -XXX,XX +XXX,XX @@
-             uintptr_t addend;
++/*
-         };
++ * SPDX-License-Identifier: GPL-2.0-or-later
--        /* padding to get a power of two size */
++ * Atomic store insert into 128-bit, generic version.
--        uint8_t dummy[1 << CPU_TLB_ENTRY_BITS];
++ *
-+        /*
++ * Copyright (C) 2023 Linaro, Ltd.
-+         * Padding to get a power of two size, as well as index
++ */
-+         * access to addr_{read,write,code}.
++
-+         */
++#ifndef HOST_STORE_INSERT_AL16_H
-+        target_ulong addr_idx[(1 << CPU_TLB_ENTRY_BITS) / TARGET_LONG_SIZE];
++#define HOST_STORE_INSERT_AL16_H
-     };
++
- } CPUTLBEntry;
++/**
++ * store_atom_insert_al16:
-diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
++ * @p: host address
-index XXXXXXX..XXXXXXX 100644
++ * @val: shifted value to store
---- a/include/exec/cpu_ldst.h
++ * @msk: mask for value to store
-+++ b/include/exec/cpu_ldst.h
++ *
-@@ -XXX,XX +XXX,XX @@ static inline void clear_helper_retaddr(void)
++ * Atomically store @val to @p masked by @msk.
- /* Needed for TCG_OVERSIZED_GUEST */
++ */
- #include "tcg/tcg.h"
++static inline void ATTRIBUTE_ATOMIC128_OPT
++store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
 +static inline target_ulong tlb_read_idx(const CPUTLBEntry *entry,
 +                                        MMUAccessType access_type)
 +{
-+    /* Do not rearrange the CPUTLBEntry structure members. */
++#if defined(CONFIG_ATOMIC128)
-+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_read) !=
++    __uint128_t *pu;
-+                      MMU_DATA_LOAD * TARGET_LONG_SIZE);
++    Int128Alias old, new;
 +    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_write) !=
 +                      MMU_DATA_STORE * TARGET_LONG_SIZE);
 +    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_code) !=
 +                      MMU_INST_FETCH * TARGET_LONG_SIZE);
 +
-+    const target_ulong *ptr = &entry->addr_idx[access_type];
++    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-+#if TCG_OVERSIZED_GUEST
++    pu = __builtin_assume_aligned(ps, 16);
-+    return *ptr;
++    old.u = *pu;
 +    msk = int128_not(msk);
 +    do {
 +        new.s = int128_and(old.s, msk);
 +        new.s = int128_or(new.s, val);
 +    } while (!__atomic_compare_exchange_n(pu, &old.u, new.u, true,
 +                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 +#else
-+    /* ofs might correspond to .addr_write, so use qatomic_read */
++    Int128 old, new, cmp;
-+    return qatomic_read(ptr);
++
 +    ps = __builtin_assume_aligned(ps, 16);
 +    old = *ps;
 +    msk = int128_not(msk);
 +    do {
 +        cmp = old;
 +        new = int128_and(old, msk);
 +        new = int128_or(new, val);
 +        old = atomic16_cmpxchg(ps, cmp, new);
 +    } while (int128_ne(cmp, old));
 +#endif
 +}
 +
- static inline target_ulong tlb_addr_write(const CPUTLBEntry *entry)
++#endif /* HOST_STORE_INSERT_AL16_H */
- {
+diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
--#if TCG_OVERSIZED_GUEST
+index XXXXXXX..XXXXXXX 100644
--    return entry->addr_write;
+--- a/accel/tcg/ldst_atomicity.c.inc
 +++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
   */
  #include "host/load-extract-al16-al8.h"
 +#include "host/store-insert-al16.h"
  #ifdef CONFIG_ATOMIC64
  # define HAVE_al8          true
@@ -XXX,XX +XXX,XX @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
                                            __ATOMIC_RELAXED, __ATOMIC_RELAXED));
  }
 -/**
 - * store_atom_insert_al16:
 - * @p: host address
 - * @val: shifted value to store
 - * @msk: mask for value to store
 - *
 - * Atomically store @val to @p masked by @msk.
 - */
 -static void ATTRIBUTE_ATOMIC128_OPT
 -store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
 -{
 -#if defined(CONFIG_ATOMIC128)
 -    __uint128_t *pu, old, new;
 -
 -    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
 -    pu = __builtin_assume_aligned(ps, 16);
 -    old = *pu;
 -    do {
 -        new = (old & ~msk.u) | val.u;
 -    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
 -                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 -#elif defined(CONFIG_CMPXCHG128)
 -    __uint128_t *pu, old, new;
 -
 -    /*
 -     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
 -     * defer to libatomic, so we must use __sync_*_compare_and_swap_16
 -     * and accept the sequential consistency that comes with it.
 -     */
 -    pu = __builtin_assume_aligned(ps, 16);
 -    do {
 -        old = *pu;
 -        new = (old & ~msk.u) | val.u;
 -    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
 -#else
--    return qatomic_read(&entry->addr_write);
+-    qemu_build_not_reached();
 -#endif
 +    return tlb_read_idx(entry, MMU_DATA_STORE);
  }
  /* Find the TLB index corresponding to the mmu_idx + address pair.  */
 diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/cputlb.c
 +++ b/accel/tcg/cputlb.c
@@ -XXX,XX +XXX,XX @@ static void io_writex(CPUArchState *env, CPUTLBEntryFull *full,
      }
  }
 -static inline target_ulong tlb_read_ofs(CPUTLBEntry *entry, size_t ofs)
 -{
 -#if TCG_OVERSIZED_GUEST
 -    return *(target_ulong *)((uintptr_t)entry + ofs);
 -#else
 -    /* ofs might correspond to .addr_write, so use qatomic_read */
 -    return qatomic_read((target_ulong *)((uintptr_t)entry + ofs));
 -#endif
 -}
 -
- /* Return true if ADDR is present in the victim tlb, and has been copied
+ /**
-    back to the main tlb.  */
+  * store_bytes_leN:
- static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
+  * @pv: host address
 -                           size_t elt_ofs, target_ulong page)
 +                           MMUAccessType access_type, target_ulong page)
  {
      size_t vidx;
      assert_cpu_is_self(env_cpu(env));
      for (vidx = 0; vidx < CPU_VTLB_SIZE; ++vidx) {
          CPUTLBEntry *vtlb = &env_tlb(env)->d[mmu_idx].vtable[vidx];
 -        target_ulong cmp;
 -
 -        /* elt_ofs might correspond to .addr_write, so use qatomic_read */
 -#if TCG_OVERSIZED_GUEST
 -        cmp = *(target_ulong *)((uintptr_t)vtlb + elt_ofs);
 -#else
 -        cmp = qatomic_read((target_ulong *)((uintptr_t)vtlb + elt_ofs));
 -#endif
 +        target_ulong cmp = tlb_read_idx(vtlb, access_type);
          if (cmp == page) {
              /* Found entry in victim tlb, swap tlb and iotlb.  */
@@ -XXX,XX +XXX,XX @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
      return false;
  }
 -/* Macro to call the above, with local variables from the use context.  */
 -#define VICTIM_TLB_HIT(TY, ADDR) \
 -  victim_tlb_hit(env, mmu_idx, index, offsetof(CPUTLBEntry, TY), \
 -                 (ADDR) & TARGET_PAGE_MASK)
 -
  static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
                             CPUTLBEntryFull *full, uintptr_t retaddr)
  {
@@ -XXX,XX +XXX,XX @@ static int probe_access_internal(CPUArchState *env, target_ulong addr,
  {
      uintptr_t index = tlb_index(env, mmu_idx, addr);
      CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
 -    target_ulong tlb_addr, page_addr;
 -    size_t elt_ofs;
 -    int flags;
 +    target_ulong tlb_addr = tlb_read_idx(entry, access_type);
 +    target_ulong page_addr = addr & TARGET_PAGE_MASK;
 +    int flags = TLB_FLAGS_MASK;
 -    switch (access_type) {
 -    case MMU_DATA_LOAD:
 -        elt_ofs = offsetof(CPUTLBEntry, addr_read);
 -        break;
 -    case MMU_DATA_STORE:
 -        elt_ofs = offsetof(CPUTLBEntry, addr_write);
 -        break;
 -    case MMU_INST_FETCH:
 -        elt_ofs = offsetof(CPUTLBEntry, addr_code);
 -        break;
 -    default:
 -        g_assert_not_reached();
 -    }
 -    tlb_addr = tlb_read_ofs(entry, elt_ofs);
 -
 -    flags = TLB_FLAGS_MASK;
 -    page_addr = addr & TARGET_PAGE_MASK;
      if (!tlb_hit_page(tlb_addr, page_addr)) {
 -        if (!victim_tlb_hit(env, mmu_idx, index, elt_ofs, page_addr)) {
 +        if (!victim_tlb_hit(env, mmu_idx, index, access_type, page_addr)) {
              CPUState *cs = env_cpu(env);
              if (!cs->cc->tcg_ops->tlb_fill(cs, addr, fault_size, access_type,
@@ -XXX,XX +XXX,XX @@ static int probe_access_internal(CPUArchState *env, target_ulong addr,
               */
              flags &= ~TLB_INVALID_MASK;
          }
 -        tlb_addr = tlb_read_ofs(entry, elt_ofs);
 +        tlb_addr = tlb_read_idx(entry, access_type);
      }
      flags &= tlb_addr;
@@ -XXX,XX +XXX,XX @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
      if (prot & PAGE_WRITE) {
          tlb_addr = tlb_addr_write(tlbe);
          if (!tlb_hit(tlb_addr, addr)) {
 -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
 +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
 +                                addr & TARGET_PAGE_MASK)) {
                  tlb_fill(env_cpu(env), addr, size,
                           MMU_DATA_STORE, mmu_idx, retaddr);
                  index = tlb_index(env, mmu_idx, addr);
@@ -XXX,XX +XXX,XX @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
      } else /* if (prot & PAGE_READ) */ {
          tlb_addr = tlbe->addr_read;
          if (!tlb_hit(tlb_addr, addr)) {
 -            if (!VICTIM_TLB_HIT(addr_read, addr)) {
 +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
 +                                addr & TARGET_PAGE_MASK)) {
                  tlb_fill(env_cpu(env), addr, size,
                           MMU_DATA_LOAD, mmu_idx, retaddr);
                  index = tlb_index(env, mmu_idx, addr);
@@ -XXX,XX +XXX,XX @@ load_memop(const void *haddr, MemOp op)
  static inline uint64_t QEMU_ALWAYS_INLINE
  load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 -            uintptr_t retaddr, MemOp op, bool code_read,
 +            uintptr_t retaddr, MemOp op, MMUAccessType access_type,
              FullLoadHelper *full_load)
  {
 -    const size_t tlb_off = code_read ?
 -        offsetof(CPUTLBEntry, addr_code) : offsetof(CPUTLBEntry, addr_read);
 -    const MMUAccessType access_type =
 -        code_read ? MMU_INST_FETCH : MMU_DATA_LOAD;
      const unsigned a_bits = get_alignment_bits(get_memop(oi));
      const size_t size = memop_size(op);
      uintptr_t mmu_idx = get_mmuidx(oi);
@@ -XXX,XX +XXX,XX @@ load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
      index = tlb_index(env, mmu_idx, addr);
      entry = tlb_entry(env, mmu_idx, addr);
 -    tlb_addr = code_read ? entry->addr_code : entry->addr_read;
 +    tlb_addr = tlb_read_idx(entry, access_type);
      /* If the TLB entry is for a different page, reload and try again.  */
      if (!tlb_hit(tlb_addr, addr)) {
 -        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
 +        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
                              addr & TARGET_PAGE_MASK)) {
              tlb_fill(env_cpu(env), addr, size,
                       access_type, mmu_idx, retaddr);
              index = tlb_index(env, mmu_idx, addr);
              entry = tlb_entry(env, mmu_idx, addr);
          }
 -        tlb_addr = code_read ? entry->addr_code : entry->addr_read;
 +        tlb_addr = tlb_read_idx(entry, access_type);
          tlb_addr &= ~TLB_INVALID_MASK;
      }
@@ -XXX,XX +XXX,XX @@ static uint64_t full_ldub_mmu(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_UB);
 -    return load_helper(env, addr, oi, retaddr, MO_UB, false, full_ldub_mmu);
 +    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
 +                       full_ldub_mmu);
  }
  tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
@@ -XXX,XX +XXX,XX @@ static uint64_t full_le_lduw_mmu(CPUArchState *env, target_ulong addr,
                                   MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_LEUW);
 -    return load_helper(env, addr, oi, retaddr, MO_LEUW, false,
 +    return load_helper(env, addr, oi, retaddr, MO_LEUW, MMU_DATA_LOAD,
                         full_le_lduw_mmu);
  }
@@ -XXX,XX +XXX,XX @@ static uint64_t full_be_lduw_mmu(CPUArchState *env, target_ulong addr,
                                   MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_BEUW);
 -    return load_helper(env, addr, oi, retaddr, MO_BEUW, false,
 +    return load_helper(env, addr, oi, retaddr, MO_BEUW, MMU_DATA_LOAD,
                         full_be_lduw_mmu);
  }
@@ -XXX,XX +XXX,XX @@ static uint64_t full_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                                   MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_LEUL);
 -    return load_helper(env, addr, oi, retaddr, MO_LEUL, false,
 +    return load_helper(env, addr, oi, retaddr, MO_LEUL, MMU_DATA_LOAD,
                         full_le_ldul_mmu);
  }
@@ -XXX,XX +XXX,XX @@ static uint64_t full_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                   MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_BEUL);
 -    return load_helper(env, addr, oi, retaddr, MO_BEUL, false,
 +    return load_helper(env, addr, oi, retaddr, MO_BEUL, MMU_DATA_LOAD,
                         full_be_ldul_mmu);
  }
@@ -XXX,XX +XXX,XX @@ uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
                             MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_LEUQ);
 -    return load_helper(env, addr, oi, retaddr, MO_LEUQ, false,
 +    return load_helper(env, addr, oi, retaddr, MO_LEUQ, MMU_DATA_LOAD,
                         helper_le_ldq_mmu);
  }
@@ -XXX,XX +XXX,XX @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                             MemOpIdx oi, uintptr_t retaddr)
  {
      validate_memop(oi, MO_BEUQ);
 -    return load_helper(env, addr, oi, retaddr, MO_BEUQ, false,
 +    return load_helper(env, addr, oi, retaddr, MO_BEUQ, MMU_DATA_LOAD,
                         helper_be_ldq_mmu);
  }
@@ -XXX,XX +XXX,XX @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
                         uintptr_t retaddr, size_t size, uintptr_t mmu_idx,
                         bool big_endian)
  {
 -    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
      uintptr_t index, index2;
      CPUTLBEntry *entry, *entry2;
      target_ulong page1, page2, tlb_addr, tlb_addr2;
@@ -XXX,XX +XXX,XX @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
      tlb_addr2 = tlb_addr_write(entry2);
      if (page1 != page2 && !tlb_hit_page(tlb_addr2, page2)) {
 -        if (!victim_tlb_hit(env, mmu_idx, index2, tlb_off, page2)) {
 +        if (!victim_tlb_hit(env, mmu_idx, index2, MMU_DATA_STORE, page2)) {
              tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
                       mmu_idx, retaddr);
              index2 = tlb_index(env, mmu_idx, page2);
@@ -XXX,XX +XXX,XX @@ static inline void QEMU_ALWAYS_INLINE
  store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
               MemOpIdx oi, uintptr_t retaddr, MemOp op)
  {
 -    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
      const unsigned a_bits = get_alignment_bits(get_memop(oi));
      const size_t size = memop_size(op);
      uintptr_t mmu_idx = get_mmuidx(oi);
@@ -XXX,XX +XXX,XX @@ store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
      /* If the TLB entry is for a different page, reload and try again.  */
      if (!tlb_hit(tlb_addr, addr)) {
 -        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
 +        if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
              addr & TARGET_PAGE_MASK)) {
              tlb_fill(env_cpu(env), addr, size, MMU_DATA_STORE,
                       mmu_idx, retaddr);
@@ -XXX,XX +XXX,XX @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
  static uint64_t full_ldub_code(CPUArchState *env, target_ulong addr,
                                 MemOpIdx oi, uintptr_t retaddr)
  {
 -    return load_helper(env, addr, oi, retaddr, MO_8, true, full_ldub_code);
 +    return load_helper(env, addr, oi, retaddr, MO_8,
 +                       MMU_INST_FETCH, full_ldub_code);
  }
  uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
@@ -XXX,XX +XXX,XX @@ uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
  static uint64_t full_lduw_code(CPUArchState *env, target_ulong addr,
                                 MemOpIdx oi, uintptr_t retaddr)
  {
 -    return load_helper(env, addr, oi, retaddr, MO_TEUW, true, full_lduw_code);
 +    return load_helper(env, addr, oi, retaddr, MO_TEUW,
 +                       MMU_INST_FETCH, full_lduw_code);
  }
  uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
@@ -XXX,XX +XXX,XX @@ uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
  static uint64_t full_ldl_code(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
  {
 -    return load_helper(env, addr, oi, retaddr, MO_TEUL, true, full_ldl_code);
 +    return load_helper(env, addr, oi, retaddr, MO_TEUL,
 +                       MMU_INST_FETCH, full_ldl_code);
  }
  uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
@@ -XXX,XX +XXX,XX @@ uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
  static uint64_t full_ldq_code(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
  {
 -    return load_helper(env, addr, oi, retaddr, MO_TEUQ, true, full_ldq_code);
 +    return load_helper(env, addr, oi, retaddr, MO_TEUQ,
 +                       MMU_INST_FETCH, full_ldq_code);
  }
  uint64_t cpu_ldq_code(CPUArchState *env, abi_ptr addr)
 --
 .34.1

-[PULL 14/53] tcg/arm: Introduce prepare_host_addr
+[PULL 14/27] accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
-Merge tcg_out_tlb_load, add_qemu_ldst_label, and some code that lived
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 in both tcg_out_qemu_ld and tcg_out_qemu_st into one function that
 returns HostAddress and TCGLabelQemuLdst structures.
 Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/arm/tcg-target.c.inc | 351 ++++++++++++++++++---------------------
+ .../x86_64/host/load-extract-al16-al8.h       | 50 +++++++++++++++++++
-file changed, 159 insertions(+), 192 deletions(-)
+file changed, 50 insertions(+)
  create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
-diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
+diff --git a/host/include/x86_64/host/load-extract-al16-al8.h b/host/include/x86_64/host/load-extract-al16-al8.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/tcg/arm/tcg-target.c.inc
+index XXXXXXX..XXXXXXX
-+++ b/tcg/arm/tcg-target.c.inc
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_out_arg_reg64(TCGContext *s, TCGReg argreg,
++++ b/host/include/x86_64/host/load-extract-al16-al8.h
-     }
+@@ -XXX,XX +XXX,XX @@
- }
++/*
++ * SPDX-License-Identifier: GPL-2.0-or-later
--#define TLB_SHIFT    (CPU_TLB_ENTRY_BITS + CPU_TLB_BITS)
++ * Atomic extract 64 from 128-bit, x86_64 version.
--
++ *
--/* We expect to use an 9-bit sign-magnitude negative offset from ENV.  */
++ * Copyright (C) 2023 Linaro, Ltd.
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
++ */
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -256);
++
--
++#ifndef X86_64_LOAD_EXTRACT_AL16_AL8_H
--/* These offsets are built into the LDRD below.  */
++#define X86_64_LOAD_EXTRACT_AL16_AL8_H
--QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
++
--QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 4);
++#ifdef CONFIG_INT128_TYPE
--
++#include "host/cpuinfo.h"
--/* Load and compare a TLB entry, leaving the flags set.  Returns the register
++
--   containing the addend of the tlb entry.  Clobbers R0, R1, R2, TMP.  */
++/**
--
++ * load_atom_extract_al16_or_al8:
--static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
++ * @pv: host address
--                               MemOp opc, int mem_index, bool is_load)
++ * @s: object size in bytes, @s <= 8.
--{
++ *
--    int cmp_off = (is_load ? offsetof(CPUTLBEntry, addr_read)
++ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
--                   : offsetof(CPUTLBEntry, addr_write));
++ * cross an 16-byte boundary then the access must be 16-byte atomic,
--    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
++ * otherwise the access must be 8-byte atomic.
--    unsigned s_mask = (1 << (opc & MO_SIZE)) - 1;
++ */
--    unsigned a_mask = (1 << get_alignment_bits(opc)) - 1;
++static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
--    TCGReg t_addr;
++load_atom_extract_al16_or_al8(void *pv, int s)
 -
 -    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {r0,r1}.  */
 -    tcg_out_ldrd_8(s, COND_AL, TCG_REG_R0, TCG_AREG0, fast_off);
 -
 -    /* Extract the tlb index from the address into R0.  */
 -    tcg_out_dat_reg(s, COND_AL, ARITH_AND, TCG_REG_R0, TCG_REG_R0, addrlo,
 -                    SHIFT_IMM_LSR(TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS));
 -
 -    /*
 -     * Add the tlb_table pointer, creating the CPUTLBEntry address in R1.
 -     * Load the tlb comparator into R2/R3 and the fast path addend into R1.
 -     */
 -    if (cmp_off == 0) {
 -        if (TARGET_LONG_BITS == 64) {
 -            tcg_out_ldrd_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
 -        } else {
 -            tcg_out_ld32_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
 -        }
 -    } else {
 -        tcg_out_dat_reg(s, COND_AL, ARITH_ADD,
 -                        TCG_REG_R1, TCG_REG_R1, TCG_REG_R0, 0);
 -        if (TARGET_LONG_BITS == 64) {
 -            tcg_out_ldrd_8(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
 -        } else {
 -            tcg_out_ld32_12(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
 -        }
 -    }
 -
 -    /* Load the tlb addend.  */
 -    tcg_out_ld32_12(s, COND_AL, TCG_REG_R1, TCG_REG_R1,
 -                    offsetof(CPUTLBEntry, addend));
 -
 -    /*
 -     * Check alignment, check comparators.
 -     * Do this in 2-4 insns.  Use MOVW for v7, if possible,
 -     * to reduce the number of sequential conditional instructions.
 -     * Almost all guests have at least 4k pages, which means that we need
 -     * to clear at least 9 bits even for an 8-byte memory, which means it
 -     * isn't worth checking for an immediate operand for BIC.
 -     *
 -     * For unaligned accesses, test the page of the last unit of alignment.
 -     * This leaves the least significant alignment bits unchanged, and of
 -     * course must be zero.
 -     */
 -    t_addr = addrlo;
 -    if (a_mask < s_mask) {
 -        t_addr = TCG_REG_R0;
 -        tcg_out_dat_imm(s, COND_AL, ARITH_ADD, t_addr,
 -                        addrlo, s_mask - a_mask);
 -    }
 -    if (use_armv7_instructions && TARGET_PAGE_BITS <= 16) {
 -        tcg_out_movi32(s, COND_AL, TCG_REG_TMP, ~(TARGET_PAGE_MASK | a_mask));
 -        tcg_out_dat_reg(s, COND_AL, ARITH_BIC, TCG_REG_TMP,
 -                        t_addr, TCG_REG_TMP, 0);
 -        tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R2, TCG_REG_TMP, 0);
 -    } else {
 -        if (a_mask) {
 -            tcg_debug_assert(a_mask <= 0xff);
 -            tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
 -        }
 -        tcg_out_dat_reg(s, COND_AL, ARITH_MOV, TCG_REG_TMP, 0, t_addr,
 -                        SHIFT_IMM_LSR(TARGET_PAGE_BITS));
 -        tcg_out_dat_reg(s, (a_mask ? COND_EQ : COND_AL), ARITH_CMP,
 -                        0, TCG_REG_R2, TCG_REG_TMP,
 -                        SHIFT_IMM_LSL(TARGET_PAGE_BITS));
 -    }
 -
 -    if (TARGET_LONG_BITS == 64) {
 -        tcg_out_dat_reg(s, COND_EQ, ARITH_CMP, 0, TCG_REG_R3, addrhi, 0);
 -    }
 -
 -    return TCG_REG_R1;
 -}
 -
 -/* Record the context of a call to the out of line helper code for the slow
 -   path for a load or store, so that we can later generate the correct
 -   helper code.  */
 -static void add_qemu_ldst_label(TCGContext *s, bool is_ld,
 -                                MemOpIdx oi, TCGType type,
 -                                TCGReg datalo, TCGReg datahi,
 -                                TCGReg addrlo, TCGReg addrhi,
 -                                tcg_insn_unit *raddr,
 -                                tcg_insn_unit *label_ptr)
 -{
 -    TCGLabelQemuLdst *label = new_ldst_label(s);
 -
 -    label->is_ld = is_ld;
 -    label->oi = oi;
 -    label->type = type;
 -    label->datalo_reg = datalo;
 -    label->datahi_reg = datahi;
 -    label->addrlo_reg = addrlo;
 -    label->addrhi_reg = addrhi;
 -    label->raddr = tcg_splitwx_to_rx(raddr);
 -    label->label_ptr[0] = label_ptr;
 -}
 -
  static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
  {
      TCGReg argreg;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
      return true;
  }
  #else
 -
 -static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
 -                                   TCGReg addrhi, unsigned a_bits)
 -{
 -    unsigned a_mask = (1 << a_bits) - 1;
 -    TCGLabelQemuLdst *label = new_ldst_label(s);
 -
 -    label->is_ld = is_ld;
 -    label->addrlo_reg = addrlo;
 -    label->addrhi_reg = addrhi;
 -
 -    /* We are expecting a_bits to max out at 7, and can easily support 8. */
 -    tcg_debug_assert(a_mask <= 0xff);
 -    /* tst addr, #mask */
 -    tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
 -
 -    /* blne slow_path */
 -    label->label_ptr[0] = s->code_ptr;
 -    tcg_out_bl_imm(s, COND_NE, 0);
 -
 -    label->raddr = tcg_splitwx_to_rx(s->code_ptr);
 -}
 -
  static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
  {
      if (!reloc_pc24(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  }
  #endif /* SOFTMMU */
 +static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 +                                           TCGReg addrlo, TCGReg addrhi,
 +                                           MemOpIdx oi, bool is_ld)
 +{
-+    TCGLabelQemuLdst *ldst = NULL;
++    uintptr_t pi = (uintptr_t)pv;
-+    MemOp opc = get_memop(oi);
++    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
-+    MemOp a_bits = get_alignment_bits(opc);
++    int shr = (pi & 7) * 8;
-+    unsigned a_mask = (1 << a_bits) - 1;
++    Int128Alias r;
 +
 +#ifdef CONFIG_SOFTMMU
 +    int mem_index = get_mmuidx(oi);
 +    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
 +                        : offsetof(CPUTLBEntry, addr_write);
 +    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
 +    unsigned s_mask = (1 << (opc & MO_SIZE)) - 1;
 +    TCGReg t_addr;
 +
 +    ldst = new_ldst_label(s);
 +    ldst->is_ld = is_ld;
 +    ldst->oi = oi;
 +    ldst->addrlo_reg = addrlo;
 +    ldst->addrhi_reg = addrhi;
 +
 +    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {r0,r1}.  */
 +    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
 +    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -256);
 +    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
 +    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 4);
 +    tcg_out_ldrd_8(s, COND_AL, TCG_REG_R0, TCG_AREG0, fast_off);
 +
 +    /* Extract the tlb index from the address into R0.  */
 +    tcg_out_dat_reg(s, COND_AL, ARITH_AND, TCG_REG_R0, TCG_REG_R0, addrlo,
 +                    SHIFT_IMM_LSR(TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS));
 +
 +    /*
-+     * Add the tlb_table pointer, creating the CPUTLBEntry address in R1.
++     * ptr_align % 16 is now only 0 or 8.
-+     * Load the tlb comparator into R2/R3 and the fast path addend into R1.
++     * If the host supports atomic loads with VMOVDQU, then always use that,
 +     * making the branch highly predictable.  Otherwise we must use VMOVDQA
 +     * when ptr_align % 16 == 0 for 16-byte atomicity.
 +     */
-+    if (cmp_off == 0) {
++    if ((cpuinfo & CPUINFO_ATOMIC_VMOVDQU) || (pi & 8)) {
-+        if (TARGET_LONG_BITS == 64) {
++        asm("vmovdqu %1, %0" : "=x" (r.i) : "m" (*ptr_align));
 +            tcg_out_ldrd_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
 +        } else {
 +            tcg_out_ld32_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
 +        }
 +    } else {
-+        tcg_out_dat_reg(s, COND_AL, ARITH_ADD,
++        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
 +                        TCG_REG_R1, TCG_REG_R1, TCG_REG_R0, 0);
 +        if (TARGET_LONG_BITS == 64) {
 +            tcg_out_ldrd_8(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
 +        } else {
 +            tcg_out_ld32_12(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
 +        }
 +    }
-+
++    return int128_getlo(int128_urshift(r.s, shr));
-+    /* Load the tlb addend.  */
++}
 +    tcg_out_ld32_12(s, COND_AL, TCG_REG_R1, TCG_REG_R1,
 +                    offsetof(CPUTLBEntry, addend));
 +
 +    /*
 +     * Check alignment, check comparators.
 +     * Do this in 2-4 insns.  Use MOVW for v7, if possible,
 +     * to reduce the number of sequential conditional instructions.
 +     * Almost all guests have at least 4k pages, which means that we need
 +     * to clear at least 9 bits even for an 8-byte memory, which means it
 +     * isn't worth checking for an immediate operand for BIC.
 +     *
 +     * For unaligned accesses, test the page of the last unit of alignment.
 +     * This leaves the least significant alignment bits unchanged, and of
 +     * course must be zero.
 +     */
 +    t_addr = addrlo;
 +    if (a_mask < s_mask) {
 +        t_addr = TCG_REG_R0;
 +        tcg_out_dat_imm(s, COND_AL, ARITH_ADD, t_addr,
 +                        addrlo, s_mask - a_mask);
 +    }
 +    if (use_armv7_instructions && TARGET_PAGE_BITS <= 16) {
 +        tcg_out_movi32(s, COND_AL, TCG_REG_TMP, ~(TARGET_PAGE_MASK | a_mask));
 +        tcg_out_dat_reg(s, COND_AL, ARITH_BIC, TCG_REG_TMP,
 +                        t_addr, TCG_REG_TMP, 0);
 +        tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R2, TCG_REG_TMP, 0);
 +    } else {
 +        if (a_mask) {
 +            tcg_debug_assert(a_mask <= 0xff);
 +            tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
 +        }
 +        tcg_out_dat_reg(s, COND_AL, ARITH_MOV, TCG_REG_TMP, 0, t_addr,
 +                        SHIFT_IMM_LSR(TARGET_PAGE_BITS));
 +        tcg_out_dat_reg(s, (a_mask ? COND_EQ : COND_AL), ARITH_CMP,
 +                        0, TCG_REG_R2, TCG_REG_TMP,
 +                        SHIFT_IMM_LSL(TARGET_PAGE_BITS));
 +    }
 +
 +    if (TARGET_LONG_BITS == 64) {
 +        tcg_out_dat_reg(s, COND_EQ, ARITH_CMP, 0, TCG_REG_R3, addrhi, 0);
 +    }
 +
 +    *h = (HostAddress){
 +        .cond = COND_AL,
 +        .base = addrlo,
 +        .index = TCG_REG_R1,
 +        .index_scratch = true,
 +    };
 +#else
-+    if (a_mask) {
++/* Fallback definition that must be optimized away, or error.  */
-+        ldst = new_ldst_label(s);
++uint64_t QEMU_ERROR("unsupported atomic")
-+        ldst->is_ld = is_ld;
++    load_atom_extract_al16_or_al8(void *pv, int s);
 +        ldst->oi = oi;
 +        ldst->addrlo_reg = addrlo;
 +        ldst->addrhi_reg = addrhi;
 +
 +        /* We are expecting a_bits to max out at 7 */
 +        tcg_debug_assert(a_mask <= 0xff);
 +        /* tst addr, #mask */
 +        tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
 +    }
 +
 +    *h = (HostAddress){
 +        .cond = COND_AL,
 +        .base = addrlo,
 +        .index = guest_base ? TCG_REG_GUEST_BASE : -1,
 +        .index_scratch = false,
 +    };
 +#endif
 +
-+    return ldst;
++#endif /* X86_64_LOAD_EXTRACT_AL16_AL8_H */
 +}
 +
  static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg datalo,
                                     TCGReg datahi, HostAddress h)
  {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
                              MemOpIdx oi, TCGType data_type)
  {
      MemOp opc = get_memop(oi);
 +    TCGLabelQemuLdst *ldst;
      HostAddress h;
 -#ifdef CONFIG_SOFTMMU
 -    h.cond = COND_AL;
 -    h.base = addrlo;
 -    h.index_scratch = true;
 -    h.index = tcg_out_tlb_read(s, addrlo, addrhi, opc, get_mmuidx(oi), 1);
 +    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
 +    if (ldst) {
 +        ldst->type = data_type;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 -    /*
 -     * This a conditional BL only to load a pointer within this opcode into
 -     * LR for the slow path.  We will not be using the value for a tail call.
 -     */
 -    tcg_insn_unit *label_ptr = s->code_ptr;
 -    tcg_out_bl_imm(s, COND_NE, 0);
 +        /*
 +         * This a conditional BL only to load a pointer within this
 +         * opcode into LR for the slow path.  We will not be using
 +         * the value for a tail call.
 +         */
 +        ldst->label_ptr[0] = s->code_ptr;
 +        tcg_out_bl_imm(s, COND_NE, 0);
 -    tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
 -
 -    add_qemu_ldst_label(s, true, oi, data_type, datalo, datahi,
 -                        addrlo, addrhi, s->code_ptr, label_ptr);
 -#else
 -    unsigned a_bits = get_alignment_bits(opc);
 -    if (a_bits) {
 -        tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
 +        tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    } else {
 +        tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
      }
 -
 -    h.cond = COND_AL;
 -    h.base = addrlo;
 -    h.index = guest_base ? TCG_REG_GUEST_BASE : -1;
 -    h.index_scratch = false;
 -    tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
 -#endif
  }
  static void tcg_out_qemu_st_direct(TCGContext *s, MemOp opc, TCGReg datalo,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
                              MemOpIdx oi, TCGType data_type)
  {
      MemOp opc = get_memop(oi);
 +    TCGLabelQemuLdst *ldst;
      HostAddress h;
 -#ifdef CONFIG_SOFTMMU
 -    h.cond = COND_EQ;
 -    h.base = addrlo;
 -    h.index_scratch = true;
 -    h.index = tcg_out_tlb_read(s, addrlo, addrhi, opc, get_mmuidx(oi), 0);
 -    tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
 +    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
 +    if (ldst) {
 +        ldst->type = data_type;
 +        ldst->datalo_reg = datalo;
 +        ldst->datahi_reg = datahi;
 -    /* The conditional call must come last, as we're going to return here.  */
 -    tcg_insn_unit *label_ptr = s->code_ptr;
 -    tcg_out_bl_imm(s, COND_NE, 0);
 -
 -    add_qemu_ldst_label(s, false, oi, data_type, datalo, datahi,
 -                        addrlo, addrhi, s->code_ptr, label_ptr);
 -#else
 -    unsigned a_bits = get_alignment_bits(opc);
 -
 -    h.cond = COND_AL;
 -    if (a_bits) {
 -        tcg_out_test_alignment(s, false, addrlo, addrhi, a_bits);
          h.cond = COND_EQ;
 -    }
 +        tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
 -    h.base = addrlo;
 -    h.index = guest_base ? TCG_REG_GUEST_BASE : -1;
 -    h.index_scratch = false;
 -    tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
 -#endif
 +        /* The conditional call is last, as we're going to return here. */
 +        ldst->label_ptr[0] = s->code_ptr;
 +        tcg_out_bl_imm(s, COND_NE, 0);
 +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
 +    } else {
 +        tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
 +    }
  }
  static void tcg_out_epilogue(TCGContext *s);
 --
 .34.1

-[PULL 49/53] accel/tcg: Add cpu_in_serial_context
+[PULL 15/27] accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
-Like cpu_in_exclusive_context, but also true if
-there is no other cpu against which we could race.
-Use it in tb_flush as a direct replacement.
-Use it in cpu_loop_exit_atomic to ensure that there
-is no loop against cpu_exec_step_atomic.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- accel/tcg/internal.h        | 9 +++++++++
+ .../aarch64/host/load-extract-al16-al8.h      | 40 +++++++++++++++++++
- accel/tcg/cpu-exec-common.c | 3 +++
+file changed, 40 insertions(+)
- accel/tcg/tb-maint.c        | 2 +-
+ create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
 files changed, 13 insertions(+), 1 deletion(-)
-diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
+diff --git a/host/include/aarch64/host/load-extract-al16-al8.h b/host/include/aarch64/host/load-extract-al16-al8.h
-index XXXXXXX..XXXXXXX 100644
+new file mode 100644
---- a/accel/tcg/internal.h
+index XXXXXXX..XXXXXXX
-+++ b/accel/tcg/internal.h
+--- /dev/null
-@@ -XXX,XX +XXX,XX @@ static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
++++ b/host/include/aarch64/host/load-extract-al16-al8.h
-     }
+@@ -XXX,XX +XXX,XX @@
  }
 +/*
-+ * Return true if CS is not running in parallel with other cpus, either
++ * SPDX-License-Identifier: GPL-2.0-or-later
-+ * because there are no other cpus or we are within an exclusive context.
++ * Atomic extract 64 from 128-bit, AArch64 version.
 + *
 + * Copyright (C) 2023 Linaro, Ltd.
 + */
-+static inline bool cpu_in_serial_context(CPUState *cs)
++
 +#ifndef AARCH64_LOAD_EXTRACT_AL16_AL8_H
 +#define AARCH64_LOAD_EXTRACT_AL16_AL8_H
 +
 +#include "host/cpuinfo.h"
 +#include "tcg/debug-assert.h"
 +
 +/**
 + * load_atom_extract_al16_or_al8:
 + * @pv: host address
 + * @s: object size in bytes, @s <= 8.
 + *
 + * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
 + * cross an 16-byte boundary then the access must be 16-byte atomic,
 + * otherwise the access must be 8-byte atomic.
 + */
 +static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
 +{
-+    return !(cs->tcg_cflags & CF_PARALLEL) || cpu_in_exclusive_context(cs);
++    uintptr_t pi = (uintptr_t)pv;
 +    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
 +    int shr = (pi & 7) * 8;
 +    uint64_t l, h;
 +
 +    /*
 +     * With FEAT_LSE2, LDP is single-copy atomic if 16-byte aligned
 +     * and single-copy atomic on the parts if 8-byte aligned.
 +     * All we need do is align the pointer mod 8.
 +     */
 +    tcg_debug_assert(HAVE_ATOMIC128_RO);
 +    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*ptr_align));
 +    return (l >> shr) | (h << (-shr & 63));
 +}
 +
- extern int64_t max_delay;
++#endif /* AARCH64_LOAD_EXTRACT_AL16_AL8_H */
  extern int64_t max_advance;
 diff --git a/accel/tcg/cpu-exec-common.c b/accel/tcg/cpu-exec-common.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/cpu-exec-common.c
 +++ b/accel/tcg/cpu-exec-common.c
@@ -XXX,XX +XXX,XX @@
  #include "sysemu/tcg.h"
  #include "exec/exec-all.h"
  #include "qemu/plugin.h"
 +#include "internal.h"
  bool tcg_allowed;
@@ -XXX,XX +XXX,XX @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
  void cpu_loop_exit_atomic(CPUState *cpu, uintptr_t pc)
  {
 +    /* Prevent looping if already executing in a serial context. */
 +    g_assert(!cpu_in_serial_context(cpu));
      cpu->exception_index = EXCP_ATOMIC;
      cpu_loop_exit_restore(cpu, pc);
  }
 diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/tb-maint.c
 +++ b/accel/tcg/tb-maint.c
@@ -XXX,XX +XXX,XX @@ void tb_flush(CPUState *cpu)
      if (tcg_enabled()) {
          unsigned tb_flush_count = qatomic_read(&tb_ctx.tb_flush_count);
 -        if (cpu_in_exclusive_context(cpu)) {
 +        if (cpu_in_serial_context(cpu)) {
              do_tb_flush(cpu, RUN_ON_CPU_HOST_INT(tb_flush_count));
          } else {
              async_safe_run_on_cpu(cpu, do_tb_flush,
 --
 .34.1

-[PULL 07/53] disas: Move softmmu specific code to separate file
+[PULL 16/27] accel/tcg: Add aarch64 store_atom_insert_al16
-From: Thomas Huth <thuth@redhat.com>
+Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
 We'd like to move disas.c into the common code source set, where
 CONFIG_USER_ONLY is not available anymore. So we have to move
 the related code into a separate file instead.
 Signed-off-by: Thomas Huth <thuth@redhat.com>
 Message-Id: <20230508133745.109463-2-thuth@redhat.com>
 [rth: Type change done in a separate patch]
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- disas/disas-internal.h | 21 ++++++++++++
+ host/include/aarch64/host/store-insert-al16.h | 47 +++++++++++++++++++
- disas/disas-mon.c      | 65 ++++++++++++++++++++++++++++++++++++
+file changed, 47 insertions(+)
- disas/disas.c          | 76 ++++--------------------------------------
+ create mode 100644 host/include/aarch64/host/store-insert-al16.h
  disas/meson.build      |  1 +
 files changed, 93 insertions(+), 70 deletions(-)
  create mode 100644 disas/disas-internal.h
  create mode 100644 disas/disas-mon.c
-diff --git a/disas/disas-internal.h b/disas/disas-internal.h
+diff --git a/host/include/aarch64/host/store-insert-al16.h b/host/include/aarch64/host/store-insert-al16.h
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
-+++ b/disas/disas-internal.h
++++ b/host/include/aarch64/host/store-insert-al16.h
 @@ -XXX,XX +XXX,XX @@
 +/*
-+ * Definitions used internally in the disassembly code
++ * SPDX-License-Identifier: GPL-2.0-or-later
 + * Atomic store insert into 128-bit, AArch64 version.
 + *
-+ * SPDX-License-Identifier: GPL-2.0-or-later
++ * Copyright (C) 2023 Linaro, Ltd.
 + */
 +
-+#ifndef DISAS_INTERNAL_H
++#ifndef AARCH64_STORE_INSERT_AL16_H
-+#define DISAS_INTERNAL_H
++#define AARCH64_STORE_INSERT_AL16_H
 +
-+#include "disas/dis-asm.h"
++/**
 + * store_atom_insert_al16:
 + * @p: host address
 + * @val: shifted value to store
 + * @msk: mask for value to store
 + *
 + * Atomically store @val to @p masked by @msk.
 + */
 +static inline void ATTRIBUTE_ATOMIC128_OPT
 +store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
 +{
 +    /*
 +     * GCC only implements __sync* primitives for int128 on aarch64.
 +     * We can do better without the barriers, and integrating the
 +     * arithmetic into the load-exclusive/store-conditional pair.
 +     */
 +    uint64_t tl, th, vl, vh, ml, mh;
 +    uint32_t fail;
 +
-+typedef struct CPUDebug {
++    qemu_build_assert(!HOST_BIG_ENDIAN);
-+    struct disassemble_info info;
++    vl = int128_getlo(val);
-+    CPUState *cpu;
++    vh = int128_gethi(val);
-+} CPUDebug;
++    ml = int128_getlo(msk);
 +    mh = int128_gethi(msk);
 +
-+void disas_initialize_debug_target(CPUDebug *s, CPUState *cpu);
++    asm("0: ldxp %[l], %[h], %[mem]\n\t"
-+int disas_gstring_printf(FILE *stream, const char *fmt, ...)
++        "bic %[l], %[l], %[ml]\n\t"
-+    G_GNUC_PRINTF(2, 3);
++        "bic %[h], %[h], %[mh]\n\t"
-+
++        "orr %[l], %[l], %[vl]\n\t"
-+#endif
++        "orr %[h], %[h], %[vh]\n\t"
-diff --git a/disas/disas-mon.c b/disas/disas-mon.c
++        "stxp %w[f], %[l], %[h], %[mem]\n\t"
-new file mode 100644
++        "cbnz %w[f], 0b\n"
-index XXXXXXX..XXXXXXX
++        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
---- /dev/null
++        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
 +++ b/disas/disas-mon.c
@@ -XXX,XX +XXX,XX @@
 +/*
 + * Functions related to disassembly from the monitor
 + *
 + * SPDX-License-Identifier: GPL-2.0-or-later
 + */
 +
 +#include "qemu/osdep.h"
 +#include "disas-internal.h"
 +#include "disas/disas.h"
 +#include "exec/memory.h"
 +#include "hw/core/cpu.h"
 +#include "monitor/monitor.h"
 +
 +static int
 +physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
 +                     struct disassemble_info *info)
 +{
 +    CPUDebug *s = container_of(info, CPUDebug, info);
 +    MemTxResult res;
 +
 +    res = address_space_read(s->cpu->as, memaddr, MEMTXATTRS_UNSPECIFIED,
 +                             myaddr, length);
 +    return res == MEMTX_OK ? 0 : EIO;
 +}
 +
-+/* Disassembler for the monitor.  */
++#endif /* AARCH64_STORE_INSERT_AL16_H */
 +void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
 +                   int nb_insn, bool is_physical)
 +{
 +    int count, i;
 +    CPUDebug s;
 +    g_autoptr(GString) ds = g_string_new("");
 +
 +    disas_initialize_debug_target(&s, cpu);
 +    s.info.fprintf_func = disas_gstring_printf;
 +    s.info.stream = (FILE *)ds;  /* abuse this slot */
 +
 +    if (is_physical) {
 +        s.info.read_memory_func = physical_read_memory;
 +    }
 +    s.info.buffer_vma = pc;
 +
 +    if (s.info.cap_arch >= 0 && cap_disas_monitor(&s.info, pc, nb_insn)) {
 +        monitor_puts(mon, ds->str);
 +        return;
 +    }
 +
 +    if (!s.info.print_insn) {
 +        monitor_printf(mon, "0x%08" PRIx64
 +                       ": Asm output not supported on this arch\n", pc);
 +        return;
 +    }
 +
 +    for (i = 0; i < nb_insn; i++) {
 +        g_string_append_printf(ds, "0x%08" PRIx64 ":  ", pc);
 +        count = s.info.print_insn(pc, &s.info);
 +        g_string_append_c(ds, '\n');
 +        if (count < 0) {
 +            break;
 +        }
 +        pc += count;
 +    }
 +
 +    monitor_puts(mon, ds->str);
 +}
 diff --git a/disas/disas.c b/disas/disas.c
 index XXXXXXX..XXXXXXX 100644
 --- a/disas/disas.c
 +++ b/disas/disas.c
@@ -XXX,XX +XXX,XX @@
  /* General "disassemble this chunk" code.  Used for debugging. */
  #include "qemu/osdep.h"
 -#include "disas/dis-asm.h"
 +#include "disas/disas-internal.h"
  #include "elf.h"
  #include "qemu/qemu-print.h"
  #include "disas/disas.h"
@@ -XXX,XX +XXX,XX @@
  #include "hw/core/cpu.h"
  #include "exec/memory.h"
 -typedef struct CPUDebug {
 -    struct disassemble_info info;
 -    CPUState *cpu;
 -} CPUDebug;
 -
  /* Filled in by elfload.c.  Simplistic, but will do for now. */
  struct syminfo *syminfos = NULL;
@@ -XXX,XX +XXX,XX @@ static void initialize_debug(CPUDebug *s)
      s->info.symbol_at_address_func = symbol_at_address;
  }
 -static void initialize_debug_target(CPUDebug *s, CPUState *cpu)
 +void disas_initialize_debug_target(CPUDebug *s, CPUState *cpu)
  {
      initialize_debug(s);
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size)
      int count;
      CPUDebug s;
 -    initialize_debug_target(&s, cpu);
 +    disas_initialize_debug_target(&s, cpu);
      s.info.fprintf_func = fprintf;
      s.info.stream = out;
      s.info.buffer_vma = code;
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size)
      }
  }
 -static int G_GNUC_PRINTF(2, 3)
 -gstring_printf(FILE *stream, const char *fmt, ...)
 +int disas_gstring_printf(FILE *stream, const char *fmt, ...)
  {
      /* We abuse the FILE parameter to pass a GString. */
      GString *s = (GString *)stream;
@@ -XXX,XX +XXX,XX @@ char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size)
      CPUDebug s;
      GString *ds = g_string_new(NULL);
 -    initialize_debug_target(&s, cpu);
 -    s.info.fprintf_func = gstring_printf;
 +    disas_initialize_debug_target(&s, cpu);
 +    s.info.fprintf_func = disas_gstring_printf;
      s.info.stream = (FILE *)ds;  /* abuse this slot */
      s.info.buffer_vma = addr;
      s.info.buffer_length = size;
@@ -XXX,XX +XXX,XX @@ const char *lookup_symbol(uint64_t orig_addr)
      return symbol;
  }
 -
 -#if !defined(CONFIG_USER_ONLY)
 -
 -#include "monitor/monitor.h"
 -
 -static int
 -physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
 -                     struct disassemble_info *info)
 -{
 -    CPUDebug *s = container_of(info, CPUDebug, info);
 -    MemTxResult res;
 -
 -    res = address_space_read(s->cpu->as, memaddr, MEMTXATTRS_UNSPECIFIED,
 -                             myaddr, length);
 -    return res == MEMTX_OK ? 0 : EIO;
 -}
 -
 -/* Disassembler for the monitor.  */
 -void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
 -                   int nb_insn, bool is_physical)
 -{
 -    int count, i;
 -    CPUDebug s;
 -    g_autoptr(GString) ds = g_string_new("");
 -
 -    initialize_debug_target(&s, cpu);
 -    s.info.fprintf_func = gstring_printf;
 -    s.info.stream = (FILE *)ds;  /* abuse this slot */
 -
 -    if (is_physical) {
 -        s.info.read_memory_func = physical_read_memory;
 -    }
 -    s.info.buffer_vma = pc;
 -
 -    if (s.info.cap_arch >= 0 && cap_disas_monitor(&s.info, pc, nb_insn)) {
 -        monitor_puts(mon, ds->str);
 -        return;
 -    }
 -
 -    if (!s.info.print_insn) {
 -        monitor_printf(mon, "0x%08" PRIx64
 -                       ": Asm output not supported on this arch\n", pc);
 -        return;
 -    }
 -
 -    for (i = 0; i < nb_insn; i++) {
 -        g_string_append_printf(ds, "0x%08" PRIx64 ":  ", pc);
 -        count = s.info.print_insn(pc, &s.info);
 -        g_string_append_c(ds, '\n');
 -        if (count < 0) {
 -            break;
 -        }
 -        pc += count;
 -    }
 -
 -    monitor_puts(mon, ds->str);
 -}
 -#endif
 diff --git a/disas/meson.build b/disas/meson.build
 index XXXXXXX..XXXXXXX 100644
 --- a/disas/meson.build
 +++ b/disas/meson.build
@@ -XXX,XX +XXX,XX @@ common_ss.add(when: 'CONFIG_SPARC_DIS', if_true: files('sparc.c'))
  common_ss.add(when: 'CONFIG_XTENSA_DIS', if_true: files('xtensa.c'))
  common_ss.add(when: capstone, if_true: [files('capstone.c'), capstone])
 +softmmu_ss.add(files('disas-mon.c'))
  specific_ss.add(files('disas.c'), capstone)
 --
 .34.1

-[PULL 31/53] tcg/mips: Remove MO_BSWAP handling
+[PULL 17/27] tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
-While performing the load in the delay slot of the call to the common
+The last use was removed by e77c89fb086a.
 bswap helper function is cute, it is not worth the added complexity.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
+Fixes: e77c89fb086a ("cputlb: Remove static tlb sizing")
 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/mips/tcg-target.h     |   4 +-
+ tcg/aarch64/tcg-target.h | 1 -
- tcg/mips/tcg-target.c.inc | 284 ++++++--------------------------------
+ tcg/arm/tcg-target.h     | 1 -
-files changed, 48 insertions(+), 240 deletions(-)
+ tcg/i386/tcg-target.h    | 1 -
  tcg/mips/tcg-target.h    | 1 -
  tcg/ppc/tcg-target.h     | 1 -
  tcg/riscv/tcg-target.h   | 1 -
  tcg/s390x/tcg-target.h   | 1 -
  tcg/sparc64/tcg-target.h | 1 -
  tcg/tci/tcg-target.h     | 1 -
 files changed, 9 deletions(-)
+diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/aarch64/tcg-target.h
++++ b/tcg/aarch64/tcg-target.h
+@@ -XXX,XX +XXX,XX @@
+ #include "host/cpuinfo.h"
+ #define TCG_TARGET_INSN_UNIT_SIZE  4
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
+ typedef enum {
+diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/arm/tcg-target.h
++++ b/tcg/arm/tcg-target.h
+@@ -XXX,XX +XXX,XX @@ extern int arm_arch;
+ #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+ #define MAX_CODE_GEN_BUFFER_SIZE  UINT32_MAX
+ typedef enum {
+diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
+--- a/tcg/i386/tcg-target.h
++++ b/tcg/i386/tcg-target.h
+@@ -XXX,XX +XXX,XX @@
+ #include "host/cpuinfo.h"
+ #define TCG_TARGET_INSN_UNIT_SIZE  1
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
+ #ifdef __x86_64__
+ # define TCG_TARGET_REG_BITS  64
 diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
 index XXXXXXX..XXXXXXX 100644
 --- a/tcg/mips/tcg-target.h
 +++ b/tcg/mips/tcg-target.h
-@@ -XXX,XX +XXX,XX @@ extern bool use_mips32r2_instructions;
+@@ -XXX,XX +XXX,XX @@
  #define TCG_TARGET_HAS_ext16u_i64       0 /* andi rt, rs, 0xffff */
  #endif
--#define TCG_TARGET_DEFAULT_MO (0)
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
--#define TCG_TARGET_HAS_MEMORY_BSWAP     1
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
-+#define TCG_TARGET_DEFAULT_MO           0
+ #define TCG_TARGET_NB_REGS 32
-+#define TCG_TARGET_HAS_MEMORY_BSWAP     0
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
- #define TCG_TARGET_NEED_LDST_LABELS
+diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
 diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target.c.inc
+--- a/tcg/ppc/tcg-target.h
-+++ b/tcg/mips/tcg-target.c.inc
++++ b/tcg/ppc/tcg-target.h
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_call(TCGContext *s, const tcg_insn_unit *arg,
+@@ -XXX,XX +XXX,XX @@
- }
+ #define TCG_TARGET_NB_REGS 64
- #if defined(CONFIG_SOFTMMU)
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
--static void * const qemu_ld_helpers[(MO_SSIZE | MO_BSWAP) + 1] = {
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
-+static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
-     [MO_UB]   = helper_ret_ldub_mmu,
+ typedef enum {
-     [MO_SB]   = helper_ret_ldsb_mmu,
+     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
--    [MO_LEUW] = helper_le_lduw_mmu,
+diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
--    [MO_LESW] = helper_le_ldsw_mmu,
+index XXXXXXX..XXXXXXX 100644
--    [MO_LEUL] = helper_le_ldul_mmu,
+--- a/tcg/riscv/tcg-target.h
--    [MO_LEUQ] = helper_le_ldq_mmu,
++++ b/tcg/riscv/tcg-target.h
--    [MO_BEUW] = helper_be_lduw_mmu,
+@@ -XXX,XX +XXX,XX @@
--    [MO_BESW] = helper_be_ldsw_mmu,
+ #define TCG_TARGET_REG_BITS 64
--    [MO_BEUL] = helper_be_ldul_mmu,
--    [MO_BEUQ] = helper_be_ldq_mmu,
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
--#if TCG_TARGET_REG_BITS == 64
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
--    [MO_LESL] = helper_le_ldsl_mmu,
+ #define TCG_TARGET_NB_REGS 32
--    [MO_BESL] = helper_be_ldsl_mmu,
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
-+#if HOST_BIG_ENDIAN
-+    [MO_UW] = helper_be_lduw_mmu,
+diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
-+    [MO_SW] = helper_be_ldsw_mmu,
+index XXXXXXX..XXXXXXX 100644
-+    [MO_UL] = helper_be_ldul_mmu,
+--- a/tcg/s390x/tcg-target.h
-+    [MO_SL] = helper_be_ldsl_mmu,
++++ b/tcg/s390x/tcg-target.h
-+    [MO_UQ] = helper_be_ldq_mmu,
+@@ -XXX,XX +XXX,XX @@
-+#else
+ #define S390_TCG_TARGET_H
-+    [MO_UW] = helper_le_lduw_mmu,
-+    [MO_SW] = helper_le_ldsw_mmu,
+ #define TCG_TARGET_INSN_UNIT_SIZE 2
-+    [MO_UL] = helper_le_ldul_mmu,
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
-+    [MO_UQ] = helper_le_ldq_mmu,
-+    [MO_SL] = helper_le_ldsl_mmu,
+ /* We have a +- 4GB range on the branches; leave some slop.  */
- #endif
+ #define MAX_CODE_GEN_BUFFER_SIZE  (3 * GiB)
- };
+diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
+index XXXXXXX..XXXXXXX 100644
--static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
+--- a/tcg/sparc64/tcg-target.h
-+static void * const qemu_st_helpers[MO_SIZE + 1] = {
++++ b/tcg/sparc64/tcg-target.h
-     [MO_UB]   = helper_ret_stb_mmu,
+@@ -XXX,XX +XXX,XX @@
--    [MO_LEUW] = helper_le_stw_mmu,
+ #define SPARC_TCG_TARGET_H
--    [MO_LEUL] = helper_le_stl_mmu,
--    [MO_LEUQ] = helper_le_stq_mmu,
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
--    [MO_BEUW] = helper_be_stw_mmu,
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
--    [MO_BEUL] = helper_be_stl_mmu,
+ #define TCG_TARGET_NB_REGS 32
--    [MO_BEUQ] = helper_be_stq_mmu,
+ #define MAX_CODE_GEN_BUFFER_SIZE  (2 * GiB)
-+#if HOST_BIG_ENDIAN
-+    [MO_UW] = helper_be_stw_mmu,
+diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
-+    [MO_UL] = helper_be_stl_mmu,
+index XXXXXXX..XXXXXXX 100644
-+    [MO_UQ] = helper_be_stq_mmu,
+--- a/tcg/tci/tcg-target.h
-+#else
++++ b/tcg/tci/tcg-target.h
-+    [MO_UW] = helper_le_stw_mmu,
+@@ -XXX,XX +XXX,XX @@
-+    [MO_UL] = helper_le_stl_mmu,
-+    [MO_UQ] = helper_le_stq_mmu,
+ #define TCG_TARGET_INTERPRETER 1
-+#endif
+ #define TCG_TARGET_INSN_UNIT_SIZE 4
- };
+-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+ #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
- /* We have four temps, we might as well expose three of them. */
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+ #if UINTPTR_MAX == UINT32_MAX
      tcg_out_ld_helper_args(s, l, &ldst_helper_param);
 -    tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)], false);
 +    tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SSIZE], false);
      /* delay slot */
      tcg_out_nop(s);
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
      tcg_out_st_helper_args(s, l, &ldst_helper_param);
 -    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)], false);
 +    tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
      /* delay slot */
      tcg_out_nop(s);
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
  static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
                                     TCGReg base, MemOp opc, TCGType type)
  {
 -    switch (opc & (MO_SSIZE | MO_BSWAP)) {
 +    switch (opc & MO_SSIZE) {
      case MO_UB:
          tcg_out_opc_imm(s, OPC_LBU, lo, base, 0);
          break;
      case MO_SB:
          tcg_out_opc_imm(s, OPC_LB, lo, base, 0);
          break;
 -    case MO_UW | MO_BSWAP:
 -        tcg_out_opc_imm(s, OPC_LHU, TCG_TMP1, base, 0);
 -        tcg_out_bswap16(s, lo, TCG_TMP1, TCG_BSWAP_IZ | TCG_BSWAP_OZ);
 -        break;
      case MO_UW:
          tcg_out_opc_imm(s, OPC_LHU, lo, base, 0);
          break;
 -    case MO_SW | MO_BSWAP:
 -        tcg_out_opc_imm(s, OPC_LHU, TCG_TMP1, base, 0);
 -        tcg_out_bswap16(s, lo, TCG_TMP1, TCG_BSWAP_IZ | TCG_BSWAP_OS);
 -        break;
      case MO_SW:
          tcg_out_opc_imm(s, OPC_LH, lo, base, 0);
          break;
 -    case MO_UL | MO_BSWAP:
 -        if (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64) {
 -            if (use_mips32r2_instructions) {
 -                tcg_out_opc_imm(s, OPC_LWU, lo, base, 0);
 -                tcg_out_bswap32(s, lo, lo, TCG_BSWAP_IZ | TCG_BSWAP_OZ);
 -            } else {
 -                tcg_out_bswap_subr(s, bswap32u_addr);
 -                /* delay slot */
 -                tcg_out_opc_imm(s, OPC_LWU, TCG_TMP0, base, 0);
 -                tcg_out_mov(s, TCG_TYPE_I64, lo, TCG_TMP3);
 -            }
 -            break;
 -        }
 -        /* FALLTHRU */
 -    case MO_SL | MO_BSWAP:
 -        if (use_mips32r2_instructions) {
 -            tcg_out_opc_imm(s, OPC_LW, lo, base, 0);
 -            tcg_out_bswap32(s, lo, lo, 0);
 -        } else {
 -            tcg_out_bswap_subr(s, bswap32_addr);
 -            /* delay slot */
 -            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 0);
 -            tcg_out_mov(s, TCG_TYPE_I32, lo, TCG_TMP3);
 -        }
 -        break;
      case MO_UL:
          if (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64) {
              tcg_out_opc_imm(s, OPC_LWU, lo, base, 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
      case MO_SL:
          tcg_out_opc_imm(s, OPC_LW, lo, base, 0);
          break;
 -    case MO_UQ | MO_BSWAP:
 -        if (TCG_TARGET_REG_BITS == 64) {
 -            if (use_mips32r2_instructions) {
 -                tcg_out_opc_imm(s, OPC_LD, lo, base, 0);
 -                tcg_out_bswap64(s, lo, lo);
 -            } else {
 -                tcg_out_bswap_subr(s, bswap64_addr);
 -                /* delay slot */
 -                tcg_out_opc_imm(s, OPC_LD, TCG_TMP0, base, 0);
 -                tcg_out_mov(s, TCG_TYPE_I64, lo, TCG_TMP3);
 -            }
 -        } else if (use_mips32r2_instructions) {
 -            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 0);
 -            tcg_out_opc_imm(s, OPC_LW, TCG_TMP1, base, 4);
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, TCG_TMP0);
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, TCG_TMP1);
 -            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? lo : hi, TCG_TMP0, 16);
 -            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? hi : lo, TCG_TMP1, 16);
 -        } else {
 -            tcg_out_bswap_subr(s, bswap32_addr);
 -            /* delay slot */
 -            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 0);
 -            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 4);
 -            tcg_out_bswap_subr(s, bswap32_addr);
 -            /* delay slot */
 -            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? lo : hi, TCG_TMP3);
 -            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? hi : lo, TCG_TMP3);
 -        }
 -        break;
      case MO_UQ:
          /* Prefer to load from offset 0 first, but allow for overlap.  */
          if (TCG_TARGET_REG_BITS == 64) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
      const MIPSInsn lw2 = MIPS_BE ? OPC_LWR : OPC_LWL;
      const MIPSInsn ld1 = MIPS_BE ? OPC_LDL : OPC_LDR;
      const MIPSInsn ld2 = MIPS_BE ? OPC_LDR : OPC_LDL;
 +    bool sgn = opc & MO_SIGN;
 -    bool sgn = (opc & MO_SIGN);
 -
 -    switch (opc & (MO_SSIZE | MO_BSWAP)) {
 -    case MO_SW | MO_BE:
 -    case MO_UW | MO_BE:
 -        tcg_out_opc_imm(s, sgn ? OPC_LB : OPC_LBU, TCG_TMP0, base, 0);
 -        tcg_out_opc_imm(s, OPC_LBU, lo, base, 1);
 -        if (use_mips32r2_instructions) {
 -            tcg_out_opc_bf(s, OPC_INS, lo, TCG_TMP0, 31, 8);
 -        } else {
 -            tcg_out_opc_sa(s, OPC_SLL, TCG_TMP0, TCG_TMP0, 8);
 -            tcg_out_opc_reg(s, OPC_OR, lo, TCG_TMP0, TCG_TMP1);
 -        }
 -        break;
 -
 -    case MO_SW | MO_LE:
 -    case MO_UW | MO_LE:
 -        if (use_mips32r2_instructions && lo != base) {
 +    switch (opc & MO_SIZE) {
 +    case MO_16:
 +        if (HOST_BIG_ENDIAN) {
 +            tcg_out_opc_imm(s, sgn ? OPC_LB : OPC_LBU, TCG_TMP0, base, 0);
 +            tcg_out_opc_imm(s, OPC_LBU, lo, base, 1);
 +            if (use_mips32r2_instructions) {
 +                tcg_out_opc_bf(s, OPC_INS, lo, TCG_TMP0, 31, 8);
 +            } else {
 +                tcg_out_opc_sa(s, OPC_SLL, TCG_TMP0, TCG_TMP0, 8);
 +                tcg_out_opc_reg(s, OPC_OR, lo, lo, TCG_TMP0);
 +            }
 +        } else if (use_mips32r2_instructions && lo != base) {
              tcg_out_opc_imm(s, OPC_LBU, lo, base, 0);
              tcg_out_opc_imm(s, sgn ? OPC_LB : OPC_LBU, TCG_TMP0, base, 1);
              tcg_out_opc_bf(s, OPC_INS, lo, TCG_TMP0, 31, 8);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
          }
          break;
 -    case MO_SL:
 -    case MO_UL:
 +    case MO_32:
          tcg_out_opc_imm(s, lw1, lo, base, 0);
          tcg_out_opc_imm(s, lw2, lo, base, 3);
          if (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64 && !sgn) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
          }
          break;
 -    case MO_UL | MO_BSWAP:
 -    case MO_SL | MO_BSWAP:
 -        if (use_mips32r2_instructions) {
 -            tcg_out_opc_imm(s, lw1, lo, base, 0);
 -            tcg_out_opc_imm(s, lw2, lo, base, 3);
 -            tcg_out_bswap32(s, lo, lo,
 -                            TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64
 -                            ? (sgn ? TCG_BSWAP_OS : TCG_BSWAP_OZ) : 0);
 -        } else {
 -            const tcg_insn_unit *subr =
 -                (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64 && !sgn
 -                 ? bswap32u_addr : bswap32_addr);
 -
 -            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 0);
 -            tcg_out_bswap_subr(s, subr);
 -            /* delay slot */
 -            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 3);
 -            tcg_out_mov(s, type, lo, TCG_TMP3);
 -        }
 -        break;
 -
 -    case MO_UQ:
 +    case MO_64:
          if (TCG_TARGET_REG_BITS == 64) {
              tcg_out_opc_imm(s, ld1, lo, base, 0);
              tcg_out_opc_imm(s, ld2, lo, base, 7);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
          }
          break;
 -    case MO_UQ | MO_BSWAP:
 -        if (TCG_TARGET_REG_BITS == 64) {
 -            if (use_mips32r2_instructions) {
 -                tcg_out_opc_imm(s, ld1, lo, base, 0);
 -                tcg_out_opc_imm(s, ld2, lo, base, 7);
 -                tcg_out_bswap64(s, lo, lo);
 -            } else {
 -                tcg_out_opc_imm(s, ld1, TCG_TMP0, base, 0);
 -                tcg_out_bswap_subr(s, bswap64_addr);
 -                /* delay slot */
 -                tcg_out_opc_imm(s, ld2, TCG_TMP0, base, 7);
 -                tcg_out_mov(s, TCG_TYPE_I64, lo, TCG_TMP3);
 -            }
 -        } else if (use_mips32r2_instructions) {
 -            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 0 + 0);
 -            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 0 + 3);
 -            tcg_out_opc_imm(s, lw1, TCG_TMP1, base, 4 + 0);
 -            tcg_out_opc_imm(s, lw2, TCG_TMP1, base, 4 + 3);
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, TCG_TMP0);
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, TCG_TMP1);
 -            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? lo : hi, TCG_TMP0, 16);
 -            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? hi : lo, TCG_TMP1, 16);
 -        } else {
 -            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 0 + 0);
 -            tcg_out_bswap_subr(s, bswap32_addr);
 -            /* delay slot */
 -            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 0 + 3);
 -            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 4 + 0);
 -            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? lo : hi, TCG_TMP3);
 -            tcg_out_bswap_subr(s, bswap32_addr);
 -            /* delay slot */
 -            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 4 + 3);
 -            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? hi : lo, TCG_TMP3);
 -        }
 -        break;
 -
      default:
          g_assert_not_reached();
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
  static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg lo, TCGReg hi,
                                     TCGReg base, MemOp opc)
  {
 -    /* Don't clutter the code below with checks to avoid bswapping ZERO.  */
 -    if ((lo | hi) == 0) {
 -        opc &= ~MO_BSWAP;
 -    }
 -
 -    switch (opc & (MO_SIZE | MO_BSWAP)) {
 +    switch (opc & MO_SIZE) {
      case MO_8:
          tcg_out_opc_imm(s, OPC_SB, lo, base, 0);
          break;
 -
 -    case MO_16 | MO_BSWAP:
 -        tcg_out_bswap16(s, TCG_TMP1, lo, 0);
 -        lo = TCG_TMP1;
 -        /* FALLTHRU */
      case MO_16:
          tcg_out_opc_imm(s, OPC_SH, lo, base, 0);
          break;
 -
 -    case MO_32 | MO_BSWAP:
 -        tcg_out_bswap32(s, TCG_TMP3, lo, 0);
 -        lo = TCG_TMP3;
 -        /* FALLTHRU */
      case MO_32:
          tcg_out_opc_imm(s, OPC_SW, lo, base, 0);
          break;
 -
 -    case MO_64 | MO_BSWAP:
 -        if (TCG_TARGET_REG_BITS == 64) {
 -            tcg_out_bswap64(s, TCG_TMP3, lo);
 -            tcg_out_opc_imm(s, OPC_SD, TCG_TMP3, base, 0);
 -        } else if (use_mips32r2_instructions) {
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, MIPS_BE ? lo : hi);
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, MIPS_BE ? hi : lo);
 -            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP0, TCG_TMP0, 16);
 -            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP1, TCG_TMP1, 16);
 -            tcg_out_opc_imm(s, OPC_SW, TCG_TMP0, base, 0);
 -            tcg_out_opc_imm(s, OPC_SW, TCG_TMP1, base, 4);
 -        } else {
 -            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? lo : hi, 0);
 -            tcg_out_opc_imm(s, OPC_SW, TCG_TMP3, base, 0);
 -            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? hi : lo, 0);
 -            tcg_out_opc_imm(s, OPC_SW, TCG_TMP3, base, 4);
 -        }
 -        break;
      case MO_64:
          if (TCG_TARGET_REG_BITS == 64) {
              tcg_out_opc_imm(s, OPC_SD, lo, base, 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg lo, TCGReg hi,
              tcg_out_opc_imm(s, OPC_SW, MIPS_BE ? lo : hi, base, 4);
          }
          break;
 -
      default:
          g_assert_not_reached();
      }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
      const MIPSInsn sd1 = MIPS_BE ? OPC_SDL : OPC_SDR;
      const MIPSInsn sd2 = MIPS_BE ? OPC_SDR : OPC_SDL;
 -    /* Don't clutter the code below with checks to avoid bswapping ZERO.  */
 -    if ((lo | hi) == 0) {
 -        opc &= ~MO_BSWAP;
 -    }
 -
 -    switch (opc & (MO_SIZE | MO_BSWAP)) {
 -    case MO_16 | MO_BE:
 +    switch (opc & MO_SIZE) {
 +    case MO_16:
          tcg_out_opc_sa(s, OPC_SRL, TCG_TMP0, lo, 8);
 -        tcg_out_opc_imm(s, OPC_SB, TCG_TMP0, base, 0);
 -        tcg_out_opc_imm(s, OPC_SB, lo, base, 1);
 +        tcg_out_opc_imm(s, OPC_SB, HOST_BIG_ENDIAN ? TCG_TMP0 : lo, base, 0);
 +        tcg_out_opc_imm(s, OPC_SB, HOST_BIG_ENDIAN ? lo : TCG_TMP0, base, 1);
          break;
 -    case MO_16 | MO_LE:
 -        tcg_out_opc_sa(s, OPC_SRL, TCG_TMP0, lo, 8);
 -        tcg_out_opc_imm(s, OPC_SB, lo, base, 0);
 -        tcg_out_opc_imm(s, OPC_SB, TCG_TMP0, base, 1);
 -        break;
 -
 -    case MO_32 | MO_BSWAP:
 -        tcg_out_bswap32(s, TCG_TMP3, lo, 0);
 -        lo = TCG_TMP3;
 -        /* fall through */
      case MO_32:
          tcg_out_opc_imm(s, sw1, lo, base, 0);
          tcg_out_opc_imm(s, sw2, lo, base, 3);
          break;
 -    case MO_64 | MO_BSWAP:
 -        if (TCG_TARGET_REG_BITS == 64) {
 -            tcg_out_bswap64(s, TCG_TMP3, lo);
 -            lo = TCG_TMP3;
 -        } else if (use_mips32r2_instructions) {
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, MIPS_BE ? hi : lo);
 -            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, MIPS_BE ? lo : hi);
 -            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP0, TCG_TMP0, 16);
 -            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP1, TCG_TMP1, 16);
 -            hi = MIPS_BE ? TCG_TMP0 : TCG_TMP1;
 -            lo = MIPS_BE ? TCG_TMP1 : TCG_TMP0;
 -        } else {
 -            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? lo : hi, 0);
 -            tcg_out_opc_imm(s, sw1, TCG_TMP3, base, 0 + 0);
 -            tcg_out_opc_imm(s, sw2, TCG_TMP3, base, 0 + 3);
 -            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? hi : lo, 0);
 -            tcg_out_opc_imm(s, sw1, TCG_TMP3, base, 4 + 0);
 -            tcg_out_opc_imm(s, sw2, TCG_TMP3, base, 4 + 3);
 -            break;
 -        }
 -        /* fall through */
      case MO_64:
          if (TCG_TARGET_REG_BITS == 64) {
              tcg_out_opc_imm(s, sd1, lo, base, 0);
 --
 .34.1

-[PULL 45/53] target/nios2: Remove TARGET_ALIGNED_ONLY
+[PULL 18/27] decodetree: Add --test-for-error
-In gen_ldx/gen_stx, the only two locations for memory operations,
+Invert the exit code, for use with the testsuite.
 mark the operation as either aligned (softmmu) or unaligned
 (user-only, as if emulated by the kernel).
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- configs/targets/nios2-softmmu.mak |  1 -
+ scripts/decodetree.py | 9 +++++++--
- target/nios2/translate.c          | 10 ++++++++++
+file changed, 7 insertions(+), 2 deletions(-)
 files changed, 10 insertions(+), 1 deletion(-)
-diff --git a/configs/targets/nios2-softmmu.mak b/configs/targets/nios2-softmmu.mak
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/configs/targets/nios2-softmmu.mak
+--- a/scripts/decodetree.py
-+++ b/configs/targets/nios2-softmmu.mak
++++ b/scripts/decodetree.py
 @@ -XXX,XX +XXX,XX @@
- TARGET_ARCH=nios2
+ formats = {}
--TARGET_ALIGNED_ONLY=y
+ allpatterns = []
- TARGET_NEED_FDT=y
+ anyextern = False
-diff --git a/target/nios2/translate.c b/target/nios2/translate.c
++testforerror = False
-index XXXXXXX..XXXXXXX 100644
---- a/target/nios2/translate.c
+ translate_prefix = 'trans'
-+++ b/target/nios2/translate.c
+ translate_scope = 'static '
-@@ -XXX,XX +XXX,XX @@ static void gen_ldx(DisasContext *dc, uint32_t code, uint32_t flags)
+@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
-     TCGv data = dest_gpr(dc, instr.b);
+     if output_file and output_fd:
+         output_fd.close()
-     tcg_gen_addi_tl(addr, load_gpr(dc, instr.a), instr.imm16.s);
+         os.remove(output_file)
-+#ifdef CONFIG_USER_ONLY
+-    exit(1)
-+    flags |= MO_UNALN;
++    exit(0 if testforerror else 1)
-+#else
+ # end error_with_file
-+    flags |= MO_ALIGN;
-+#endif
-     tcg_gen_qemu_ld_tl(data, addr, dc->mem_idx, flags);
+@@ -XXX,XX +XXX,XX @@ def main():
- }
+     global bitop_width
+     global variablewidth
-@@ -XXX,XX +XXX,XX @@ static void gen_stx(DisasContext *dc, uint32_t code, uint32_t flags)
+     global anyextern
++    global testforerror
-     TCGv addr = tcg_temp_new();
-     tcg_gen_addi_tl(addr, load_gpr(dc, instr.a), instr.imm16.s);
+     decode_scope = 'static '
-+#ifdef CONFIG_USER_ONLY
-+    flags |= MO_UNALN;
+     long_opts = ['decode=', 'translate=', 'output=', 'insnwidth=',
-+#else
+-                 'static-decode=', 'varinsnwidth=']
-+    flags |= MO_ALIGN;
++                 'static-decode=', 'varinsnwidth=', 'test-for-error']
-+#endif
+     try:
-     tcg_gen_qemu_st_tl(val, addr, dc->mem_idx, flags);
+         (opts, args) = getopt.gnu_getopt(sys.argv[1:], 'o:vw:', long_opts)
- }
+     except getopt.GetoptError as err:
@@ -XXX,XX +XXX,XX @@ def main():
                  bitop_width = 64
              elif insnwidth != 32:
                  error(0, 'cannot handle insns of width', insnwidth)
 +        elif o == '--test-for-error':
 +            testforerror = True
          else:
              assert False, 'unhandled option'
@@ -XXX,XX +XXX,XX @@ def main():
      if output_file:
          output_fd.close()
 +    exit(1 if testforerror else 0)
  # end main
 --
 .34.1

-[PULL 26/53] tcg/mips: Convert tcg_out_qemu_{ld,st}_slow_path
+[PULL 19/27] decodetree: Fix recursion in prop_format and build_tree
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
+Two copy-paste errors walking the parse tree.
 and tcg_out_st_helper_args.  This allows our local
 tcg_out_arg_* infrastructure to be removed.
-We are no longer filling the call or return branch
-delay slots, nor are we tail-calling for the store,
-but this seems a small price to pay.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- tcg/mips/tcg-target.c.inc | 154 ++++++--------------------------------
+ scripts/decodetree.py | 4 ++--
-file changed, 22 insertions(+), 132 deletions(-)
+file changed, 2 insertions(+), 2 deletions(-)
-diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target.c.inc
+--- a/scripts/decodetree.py
-+++ b/tcg/mips/tcg-target.c.inc
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
+@@ -XXX,XX +XXX,XX @@ def build_tree(self):
-     [MO_BEUQ] = helper_be_stq_mmu,
- };
+     def prop_format(self):
+         for p in self.pats:
--/* Helper routines for marshalling helper function arguments into
+-            p.build_tree()
-- * the correct registers and stack.
++            p.prop_format()
-- * I is where we want to put this argument, and is updated and returned
-- * for the next call. ARG is the argument itself.
+     def prop_width(self):
-- *
+         width = None
-- * We provide routines for arguments which are: immediate, 32 bit
+@@ -XXX,XX +XXX,XX @@ def __build_tree(pats, outerbits, outermask):
-- * value in register, 16 and 8 bit values in register (which must be zero
+         return t
-- * extended before use) and 64 bit value in a lo:hi register pair.
-- */
+     def build_tree(self):
--
+-        super().prop_format()
--static int tcg_out_call_iarg_reg(TCGContext *s, int i, TCGReg arg)
++        super().build_tree()
--{
+         self.tree = self.__build_tree(self.pats, self.fixedbits,
--    if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
+                                       self.fixedmask)
 -        tcg_out_mov(s, TCG_TYPE_REG, tcg_target_call_iarg_regs[i], arg);
 -    } else {
 -        /* For N32 and N64, the initial offset is different.  But there
 -           we also have 8 argument register so we don't run out here.  */
 -        tcg_debug_assert(TCG_TARGET_REG_BITS == 32);
 -        tcg_out_st(s, TCG_TYPE_REG, arg, TCG_REG_SP, 4 * i);
 -    }
 -    return i + 1;
 -}
 -
 -static int tcg_out_call_iarg_reg8(TCGContext *s, int i, TCGReg arg)
 -{
 -    TCGReg tmp = TCG_TMP0;
 -    if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
 -        tmp = tcg_target_call_iarg_regs[i];
 -    }
 -    tcg_out_ext8u(s, tmp, arg);
 -    return tcg_out_call_iarg_reg(s, i, tmp);
 -}
 -
 -static int tcg_out_call_iarg_reg16(TCGContext *s, int i, TCGReg arg)
 -{
 -    TCGReg tmp = TCG_TMP0;
 -    if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
 -        tmp = tcg_target_call_iarg_regs[i];
 -    }
 -    tcg_out_opc_imm(s, OPC_ANDI, tmp, arg, 0xffff);
 -    return tcg_out_call_iarg_reg(s, i, tmp);
 -}
 -
 -static int tcg_out_call_iarg_imm(TCGContext *s, int i, TCGArg arg)
 -{
 -    TCGReg tmp = TCG_TMP0;
 -    if (arg == 0) {
 -        tmp = TCG_REG_ZERO;
 -    } else {
 -        if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
 -            tmp = tcg_target_call_iarg_regs[i];
 -        }
 -        tcg_out_movi(s, TCG_TYPE_REG, tmp, arg);
 -    }
 -    return tcg_out_call_iarg_reg(s, i, tmp);
 -}
 -
 -static int tcg_out_call_iarg_reg2(TCGContext *s, int i, TCGReg al, TCGReg ah)
 -{
 -    tcg_debug_assert(TCG_TARGET_REG_BITS == 32);
 -    i = (i + 1) & ~1;
 -    i = tcg_out_call_iarg_reg(s, i, (MIPS_BE ? ah : al));
 -    i = tcg_out_call_iarg_reg(s, i, (MIPS_BE ? al : ah));
 -    return i;
 -}
 +/* We have four temps, we might as well expose three of them. */
 +static const TCGLdstHelperParam ldst_helper_param = {
 +    .ntmp = 3, .tmp = { TCG_TMP0, TCG_TMP1, TCG_TMP2 }
 +};
  static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  {
      const tcg_insn_unit *tgt_rx = tcg_splitwx_to_rx(s->code_ptr);
 -    MemOpIdx oi = l->oi;
 -    MemOp opc = get_memop(oi);
 -    TCGReg v0;
 -    int i;
 +    MemOp opc = get_memop(l->oi);
      /* resolve label address */
      if (!reloc_pc16(l->label_ptr[0], tgt_rx)
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
          return false;
      }
 -    i = 1;
 -    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
 -        i = tcg_out_call_iarg_reg2(s, i, l->addrlo_reg, l->addrhi_reg);
 -    } else {
 -        i = tcg_out_call_iarg_reg(s, i, l->addrlo_reg);
 -    }
 -    i = tcg_out_call_iarg_imm(s, i, oi);
 -    i = tcg_out_call_iarg_imm(s, i, (intptr_t)l->raddr);
 +    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
 +
      tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)], false);
      /* delay slot */
 -    tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
 +    tcg_out_nop(s);
 -    v0 = l->datalo_reg;
 -    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
 -        /* We eliminated V0 from the possible output registers, so it
 -           cannot be clobbered here.  So we must move V1 first.  */
 -        if (MIPS_BE) {
 -            tcg_out_mov(s, TCG_TYPE_I32, v0, TCG_REG_V1);
 -            v0 = l->datahi_reg;
 -        } else {
 -            tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_V1);
 -        }
 -    }
 +    tcg_out_ld_helper_ret(s, l, true, &ldst_helper_param);
      tcg_out_opc_br(s, OPC_BEQ, TCG_REG_ZERO, TCG_REG_ZERO);
      if (!reloc_pc16(s->code_ptr - 1, l->raddr)) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
      }
      /* delay slot */
 -    if (TCG_TARGET_REG_BITS == 64 && l->type == TCG_TYPE_I32) {
 -        /* we always sign-extend 32-bit loads */
 -        tcg_out_ext32s(s, v0, TCG_REG_V0);
 -    } else {
 -        tcg_out_opc_reg(s, OPC_OR, v0, TCG_REG_V0, TCG_REG_ZERO);
 -    }
 +    tcg_out_nop(s);
      return true;
  }
  static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  {
      const tcg_insn_unit *tgt_rx = tcg_splitwx_to_rx(s->code_ptr);
 -    MemOpIdx oi = l->oi;
 -    MemOp opc = get_memop(oi);
 -    MemOp s_bits = opc & MO_SIZE;
 -    int i;
 +    MemOp opc = get_memop(l->oi);
      /* resolve label address */
      if (!reloc_pc16(l->label_ptr[0], tgt_rx)
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
          return false;
      }
 -    i = 1;
 -    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
 -        i = tcg_out_call_iarg_reg2(s, i, l->addrlo_reg, l->addrhi_reg);
 -    } else {
 -        i = tcg_out_call_iarg_reg(s, i, l->addrlo_reg);
 -    }
 -    switch (s_bits) {
 -    case MO_8:
 -        i = tcg_out_call_iarg_reg8(s, i, l->datalo_reg);
 -        break;
 -    case MO_16:
 -        i = tcg_out_call_iarg_reg16(s, i, l->datalo_reg);
 -        break;
 -    case MO_32:
 -        i = tcg_out_call_iarg_reg(s, i, l->datalo_reg);
 -        break;
 -    case MO_64:
 -        if (TCG_TARGET_REG_BITS == 32) {
 -            i = tcg_out_call_iarg_reg2(s, i, l->datalo_reg, l->datahi_reg);
 -        } else {
 -            i = tcg_out_call_iarg_reg(s, i, l->datalo_reg);
 -        }
 -        break;
 -    default:
 -        g_assert_not_reached();
 -    }
 -    i = tcg_out_call_iarg_imm(s, i, oi);
 +    tcg_out_st_helper_args(s, l, &ldst_helper_param);
 -    /* Tail call to the store helper.  Thus force the return address
 -       computation to take place in the return address register.  */
 -    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RA, (intptr_t)l->raddr);
 -    i = tcg_out_call_iarg_reg(s, i, TCG_REG_RA);
 -    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)], true);
 +    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)], false);
      /* delay slot */
 -    tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
 +    tcg_out_nop(s);
 +
 +    tcg_out_opc_br(s, OPC_BEQ, TCG_REG_ZERO, TCG_REG_ZERO);
 +    if (!reloc_pc16(s->code_ptr - 1, l->raddr)) {
 +        return false;
 +    }
 +
 +    /* delay slot */
 +    tcg_out_nop(s);
      return true;
  }
 --
 .34.1

-[PULL 42/53] target/mips: Add missing default_tcg_memop_mask
+[PULL 20/27] decodetree: Diagnose empty pattern group
-Memory operations that are not already aligned, or otherwise
+Test err_pattern_group_empty.decode failed with exception:
-marked up, require addition of ctx->default_tcg_memop_mask.
 Traceback (most recent call last):
   File "./scripts/decodetree.py", line 1424, in <module> main()
   File "./scripts/decodetree.py", line 1342, in main toppat.build_tree()
   File "./scripts/decodetree.py", line 627, in build_tree
     self.tree = self.__build_tree(self.pats, self.fixedbits,
   File "./scripts/decodetree.py", line 607, in __build_tree
     fb = i.fixedbits & innermask
 TypeError: unsupported operand type(s) for &: 'NoneType' and 'int'
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/mips/tcg/mxu_translate.c           |  3 ++-
+ scripts/decodetree.py | 6 ++++++
- target/mips/tcg/micromips_translate.c.inc | 24 ++++++++++++++--------
+file changed, 6 insertions(+)
  target/mips/tcg/mips16e_translate.c.inc   | 18 ++++++++++------
  target/mips/tcg/nanomips_translate.c.inc  | 25 +++++++++++------------
 files changed, 42 insertions(+), 28 deletions(-)
-diff --git a/target/mips/tcg/mxu_translate.c b/target/mips/tcg/mxu_translate.c
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/target/mips/tcg/mxu_translate.c
+--- a/scripts/decodetree.py
-+++ b/target/mips/tcg/mxu_translate.c
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ static void gen_mxu_s32ldd_s32lddr(DisasContext *ctx)
+@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
-         tcg_gen_ori_tl(t1, t1, 0xFFFFF000);
+                 output(ind, '}\n')
-     }
+             else:
-     tcg_gen_add_tl(t1, t0, t1);
+                 p.output_code(i, extracted, p.fixedbits, p.fixedmask)
--    tcg_gen_qemu_ld_tl(t1, t1, ctx->mem_idx, MO_TESL ^ (sel * MO_BSWAP));
++
-+    tcg_gen_qemu_ld_tl(t1, t1, ctx->mem_idx, (MO_TESL ^ (sel * MO_BSWAP)) |
++    def build_tree(self):
-+                       ctx->default_tcg_memop_mask);
++        if not self.pats:
++            error_with_file(self.file, self.lineno, 'empty pattern group')
-     gen_store_mxu_gpr(t1, XRa);
++        super().build_tree()
- }
++
-diff --git a/target/mips/tcg/micromips_translate.c.inc b/target/mips/tcg/micromips_translate.c.inc
+ #end IncMultiPattern
-index XXXXXXX..XXXXXXX 100644
---- a/target/mips/tcg/micromips_translate.c.inc
 +++ b/target/mips/tcg/micromips_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_ldst_pair(DisasContext *ctx, uint32_t opc, int rd,
              gen_reserved_instruction(ctx);
              return;
          }
 -        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL);
 +        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL |
 +                           ctx->default_tcg_memop_mask);
          gen_store_gpr(t1, rd);
          tcg_gen_movi_tl(t1, 4);
          gen_op_addr_add(ctx, t0, t0, t1);
 -        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL);
 +        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL |
 +                           ctx->default_tcg_memop_mask);
          gen_store_gpr(t1, rd + 1);
          break;
      case SWP:
          gen_load_gpr(t1, rd);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
 +                           ctx->default_tcg_memop_mask);
          tcg_gen_movi_tl(t1, 4);
          gen_op_addr_add(ctx, t0, t0, t1);
          gen_load_gpr(t1, rd + 1);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
 +                           ctx->default_tcg_memop_mask);
          break;
  #ifdef TARGET_MIPS64
      case LDP:
@@ -XXX,XX +XXX,XX @@ static void gen_ldst_pair(DisasContext *ctx, uint32_t opc, int rd,
              gen_reserved_instruction(ctx);
              return;
          }
 -        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
 +        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
 +                           ctx->default_tcg_memop_mask);
          gen_store_gpr(t1, rd);
          tcg_gen_movi_tl(t1, 8);
          gen_op_addr_add(ctx, t0, t0, t1);
 -        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
 +        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
 +                           ctx->default_tcg_memop_mask);
          gen_store_gpr(t1, rd + 1);
          break;
      case SDP:
          gen_load_gpr(t1, rd);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
 +                           ctx->default_tcg_memop_mask);
          tcg_gen_movi_tl(t1, 8);
          gen_op_addr_add(ctx, t0, t0, t1);
          gen_load_gpr(t1, rd + 1);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
 +                           ctx->default_tcg_memop_mask);
          break;
  #endif
      }
 diff --git a/target/mips/tcg/mips16e_translate.c.inc b/target/mips/tcg/mips16e_translate.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/target/mips/tcg/mips16e_translate.c.inc
 +++ b/target/mips/tcg/mips16e_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_mips16_save(DisasContext *ctx,
      case 4:
          gen_base_offset_addr(ctx, t0, 29, 12);
          gen_load_gpr(t1, 7);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
 +                           ctx->default_tcg_memop_mask);
          /* Fall through */
      case 3:
          gen_base_offset_addr(ctx, t0, 29, 8);
          gen_load_gpr(t1, 6);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
 +                           ctx->default_tcg_memop_mask);
          /* Fall through */
      case 2:
          gen_base_offset_addr(ctx, t0, 29, 4);
          gen_load_gpr(t1, 5);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
 +                           ctx->default_tcg_memop_mask);
          /* Fall through */
      case 1:
          gen_base_offset_addr(ctx, t0, 29, 0);
          gen_load_gpr(t1, 4);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
 +                           ctx->default_tcg_memop_mask);
      }
      gen_load_gpr(t0, 29);
@@ -XXX,XX +XXX,XX @@ static void gen_mips16_save(DisasContext *ctx,
          tcg_gen_movi_tl(t2, -4);                                 \
          gen_op_addr_add(ctx, t0, t0, t2);                        \
          gen_load_gpr(t1, reg);                                   \
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL); \
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |       \
 +                           ctx->default_tcg_memop_mask);         \
      } while (0)
      if (do_ra) {
@@ -XXX,XX +XXX,XX @@ static void gen_mips16_restore(DisasContext *ctx,
  #define DECR_AND_LOAD(reg) do {                            \
          tcg_gen_movi_tl(t2, -4);                           \
          gen_op_addr_add(ctx, t0, t0, t2);                  \
 -        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL); \
 +        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL | \
 +                           ctx->default_tcg_memop_mask);   \
          gen_store_gpr(t1, reg);                            \
      } while (0)
 diff --git a/target/mips/tcg/nanomips_translate.c.inc b/target/mips/tcg/nanomips_translate.c.inc
 index XXXXXXX..XXXXXXX 100644
 --- a/target/mips/tcg/nanomips_translate.c.inc
 +++ b/target/mips/tcg/nanomips_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_p_lsx(DisasContext *ctx, int rd, int rs, int rt)
      switch (extract32(ctx->opcode, 7, 4)) {
      case NM_LBX:
 -        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
 -                           MO_SB);
 +        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx, MO_SB);
          gen_store_gpr(t0, rd);
          break;
      case NM_LHX:
      /*case NM_LHXS:*/
          tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
 -                           MO_TESW);
 +                           MO_TESW | ctx->default_tcg_memop_mask);
          gen_store_gpr(t0, rd);
          break;
      case NM_LWX:
      /*case NM_LWXS:*/
          tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
 -                           MO_TESL);
 +                           MO_TESL | ctx->default_tcg_memop_mask);
          gen_store_gpr(t0, rd);
          break;
      case NM_LBUX:
 -        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
 -                           MO_UB);
 +        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx, MO_UB);
          gen_store_gpr(t0, rd);
          break;
      case NM_LHUX:
      /*case NM_LHUXS:*/
          tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
 -                           MO_TEUW);
 +                           MO_TEUW | ctx->default_tcg_memop_mask);
          gen_store_gpr(t0, rd);
          break;
      case NM_SBX:
          check_nms(ctx);
          gen_load_gpr(t1, rd);
 -        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
 -                           MO_8);
 +        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_8);
          break;
      case NM_SHX:
      /*case NM_SHXS:*/
          check_nms(ctx);
          gen_load_gpr(t1, rd);
          tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
 -                           MO_TEUW);
 +                           MO_TEUW | ctx->default_tcg_memop_mask);
          break;
      case NM_SWX:
      /*case NM_SWXS:*/
          check_nms(ctx);
          gen_load_gpr(t1, rd);
          tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
 -                           MO_TEUL);
 +                           MO_TEUL | ctx->default_tcg_memop_mask);
          break;
      case NM_LWC1X:
      /*case NM_LWC1XS:*/
@@ -XXX,XX +XXX,XX @@ static int decode_nanomips_32_48_opc(CPUMIPSState *env, DisasContext *ctx)
                                                  addr_off);
                      tcg_gen_movi_tl(t0, addr);
 -                    tcg_gen_qemu_ld_tl(cpu_gpr[rt], t0, ctx->mem_idx, MO_TESL);
 +                    tcg_gen_qemu_ld_tl(cpu_gpr[rt], t0, ctx->mem_idx,
 +                                       MO_TESL | ctx->default_tcg_memop_mask);
                  }
                  break;
              case NM_SWPC48:
@@ -XXX,XX +XXX,XX @@ static int decode_nanomips_32_48_opc(CPUMIPSState *env, DisasContext *ctx)
                      tcg_gen_movi_tl(t0, addr);
                      gen_load_gpr(t1, rt);
 -                    tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
 +                    tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
 +                                       MO_TEUL | ctx->default_tcg_memop_mask);
                  }
                  break;
              default:
 --
 .34.1

-[PULL 41/53] target/mips: Add MO_ALIGN to gen_llwp, gen_scwp
+[PULL 21/27] decodetree: Do not remove output_file from /dev
-These are atomic operations, so mark as requiring alignment.
+Nor report any PermissionError on remove.
 The primary purpose is testing with -o /dev/null.
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- target/mips/tcg/nanomips_translate.c.inc | 5 +++--
+ scripts/decodetree.py | 7 ++++++-
-file changed, 3 insertions(+), 2 deletions(-)
+file changed, 6 insertions(+), 1 deletion(-)
-diff --git a/target/mips/tcg/nanomips_translate.c.inc b/target/mips/tcg/nanomips_translate.c.inc
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/target/mips/tcg/nanomips_translate.c.inc
+--- a/scripts/decodetree.py
-+++ b/target/mips/tcg/nanomips_translate.c.inc
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ static void gen_llwp(DisasContext *ctx, uint32_t base, int16_t offset,
+@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
-     TCGv tmp2 = tcg_temp_new();
+     if output_file and output_fd:
-     gen_base_offset_addr(ctx, taddr, base, offset);
+         output_fd.close()
--    tcg_gen_qemu_ld_i64(tval, taddr, ctx->mem_idx, MO_TEUQ);
+-        os.remove(output_file)
-+    tcg_gen_qemu_ld_i64(tval, taddr, ctx->mem_idx, MO_TEUQ | MO_ALIGN);
++        # Do not try to remove e.g. -o /dev/null
-     if (cpu_is_bigendian(ctx)) {
++        if not output_file.startswith("/dev"):
-         tcg_gen_extr_i64_tl(tmp2, tmp1, tval);
++            try:
-     } else {
++                os.remove(output_file)
-@@ -XXX,XX +XXX,XX @@ static void gen_scwp(DisasContext *ctx, uint32_t base, int16_t offset,
++            except PermissionError:
++                pass
-     tcg_gen_ld_i64(llval, cpu_env, offsetof(CPUMIPSState, llval_wp));
+     exit(0 if testforerror else 1)
-     tcg_gen_atomic_cmpxchg_i64(val, taddr, llval, tval,
+ # end error_with_file
--                               eva ? MIPS_HFLAG_UM : ctx->mem_idx, MO_64);
 +                               eva ? MIPS_HFLAG_UM : ctx->mem_idx,
 +                               MO_64 | MO_ALIGN);
      if (reg1 != 0) {
          tcg_gen_movi_tl(cpu_gpr[reg1], 1);
      }
 --
 .34.1

-[PULL 44/53] target/mips: Remove TARGET_ALIGNED_ONLY
+[PULL 22/27] tests/decode: Convert tests to meson
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
- configs/targets/mips-linux-user.mak      | 1 -
+ tests/decode/check.sh    | 24 ----------------
- configs/targets/mips-softmmu.mak         | 1 -
+ tests/decode/meson.build | 59 ++++++++++++++++++++++++++++++++++++++++
- configs/targets/mips64-linux-user.mak    | 1 -
+ tests/meson.build        |  5 +---
- configs/targets/mips64-softmmu.mak       | 1 -
+files changed, 60 insertions(+), 28 deletions(-)
- configs/targets/mips64el-linux-user.mak  | 1 -
+ delete mode 100755 tests/decode/check.sh
- configs/targets/mips64el-softmmu.mak     | 1 -
+ create mode 100644 tests/decode/meson.build
  configs/targets/mipsel-linux-user.mak    | 1 -
  configs/targets/mipsel-softmmu.mak       | 1 -
  configs/targets/mipsn32-linux-user.mak   | 1 -
  configs/targets/mipsn32el-linux-user.mak | 1 -
 files changed, 10 deletions(-)
-diff --git a/configs/targets/mips-linux-user.mak b/configs/targets/mips-linux-user.mak
+diff --git a/tests/decode/check.sh b/tests/decode/check.sh
 deleted file mode 100755
 index XXXXXXX..XXXXXXX
 --- a/tests/decode/check.sh
 +++ /dev/null
@@ -XXX,XX +XXX,XX @@
 -#!/bin/sh
 -# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 -# See the COPYING.LIB file in the top-level directory.
 -
 -PYTHON=$1
 -DECODETREE=$2
 -E=0
 -
 -# All of these tests should produce errors
 -for i in err_*.decode; do
 -    if $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
 -        # Pass, aka failed to fail.
 -        echo FAIL: $i 1>&2
 -        E=1
 -    fi
 -done
 -
 -for i in succ_*.decode; do
 -    if ! $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
 -        echo FAIL:$i 1>&2
 -    fi
 -done
 -
 -exit $E
 diff --git a/tests/decode/meson.build b/tests/decode/meson.build
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@
 +err_tests = [
 +    'err_argset1.decode',
 +    'err_argset2.decode',
 +    'err_field1.decode',
 +    'err_field2.decode',
 +    'err_field3.decode',
 +    'err_field4.decode',
 +    'err_field5.decode',
 +    'err_field6.decode',
 +    'err_init1.decode',
 +    'err_init2.decode',
 +    'err_init3.decode',
 +    'err_init4.decode',
 +    'err_overlap1.decode',
 +    'err_overlap2.decode',
 +    'err_overlap3.decode',
 +    'err_overlap4.decode',
 +    'err_overlap5.decode',
 +    'err_overlap6.decode',
 +    'err_overlap7.decode',
 +    'err_overlap8.decode',
 +    'err_overlap9.decode',
 +    'err_pattern_group_empty.decode',
 +    'err_pattern_group_ident1.decode',
 +    'err_pattern_group_ident2.decode',
 +    'err_pattern_group_nest1.decode',
 +    'err_pattern_group_nest2.decode',
 +    'err_pattern_group_nest3.decode',
 +    'err_pattern_group_overlap1.decode',
 +    'err_width1.decode',
 +    'err_width2.decode',
 +    'err_width3.decode',
 +    'err_width4.decode',
 +]
 +
 +succ_tests = [
 +    'succ_argset_type1.decode',
 +    'succ_function.decode',
 +    'succ_ident1.decode',
 +    'succ_pattern_group_nest1.decode',
 +    'succ_pattern_group_nest2.decode',
 +    'succ_pattern_group_nest3.decode',
 +    'succ_pattern_group_nest4.decode',
 +]
 +
 +suite = 'decodetree'
 +decodetree = find_program(meson.project_source_root() / 'scripts/decodetree.py')
 +
 +foreach t: err_tests
 +    test(fs.replace_suffix(t, ''),
 +         decodetree, args: ['-o', '/dev/null', '--test-for-error', files(t)],
 +         suite: suite)
 +endforeach
 +
 +foreach t: succ_tests
 +    test(fs.replace_suffix(t, ''),
 +         decodetree, args: ['-o', '/dev/null', files(t)],
 +         suite: suite)
 +endforeach
 diff --git a/tests/meson.build b/tests/meson.build
 index XXXXXXX..XXXXXXX 100644
---- a/configs/targets/mips-linux-user.mak
+--- a/tests/meson.build
-+++ b/configs/targets/mips-linux-user.mak
++++ b/tests/meson.build
-@@ -XXX,XX +XXX,XX @@ TARGET_ARCH=mips
+@@ -XXX,XX +XXX,XX @@ if have_tools and have_vhost_user and 'CONFIG_LINUX' in config_host
- TARGET_ABI_MIPSO32=y
+              dependencies: [qemuutil, vhost_user])
- TARGET_SYSTBL_ABI=o32
+ endif
- TARGET_SYSTBL=syscall_o32.tbl
--TARGET_ALIGNED_ONLY=y
+-test('decodetree', sh,
- TARGET_BIG_ENDIAN=y
+-     args: [ files('decode/check.sh'), config_host['PYTHON'], files('../scripts/decodetree.py') ],
-diff --git a/configs/targets/mips-softmmu.mak b/configs/targets/mips-softmmu.mak
+-     workdir: meson.current_source_dir() / 'decode',
-index XXXXXXX..XXXXXXX 100644
+-     suite: 'decodetree')
---- a/configs/targets/mips-softmmu.mak
++subdir('decode')
-+++ b/configs/targets/mips-softmmu.mak
-@@ -XXX,XX +XXX,XX @@
+ if 'CONFIG_TCG' in config_all
- TARGET_ARCH=mips
+   subdir('fp')
 -TARGET_ALIGNED_ONLY=y
  TARGET_BIG_ENDIAN=y
  TARGET_SUPPORTS_MTTCG=y
 diff --git a/configs/targets/mips64-linux-user.mak b/configs/targets/mips64-linux-user.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mips64-linux-user.mak
 +++ b/configs/targets/mips64-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI_MIPSN64=y
  TARGET_BASE_ARCH=mips
  TARGET_SYSTBL_ABI=n64
  TARGET_SYSTBL=syscall_n64.tbl
 -TARGET_ALIGNED_ONLY=y
  TARGET_BIG_ENDIAN=y
 diff --git a/configs/targets/mips64-softmmu.mak b/configs/targets/mips64-softmmu.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mips64-softmmu.mak
 +++ b/configs/targets/mips64-softmmu.mak
@@ -XXX,XX +XXX,XX @@
  TARGET_ARCH=mips64
  TARGET_BASE_ARCH=mips
 -TARGET_ALIGNED_ONLY=y
  TARGET_BIG_ENDIAN=y
 diff --git a/configs/targets/mips64el-linux-user.mak b/configs/targets/mips64el-linux-user.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mips64el-linux-user.mak
 +++ b/configs/targets/mips64el-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI_MIPSN64=y
  TARGET_BASE_ARCH=mips
  TARGET_SYSTBL_ABI=n64
  TARGET_SYSTBL=syscall_n64.tbl
 -TARGET_ALIGNED_ONLY=y
 diff --git a/configs/targets/mips64el-softmmu.mak b/configs/targets/mips64el-softmmu.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mips64el-softmmu.mak
 +++ b/configs/targets/mips64el-softmmu.mak
@@ -XXX,XX +XXX,XX @@
  TARGET_ARCH=mips64
  TARGET_BASE_ARCH=mips
 -TARGET_ALIGNED_ONLY=y
  TARGET_NEED_FDT=y
 diff --git a/configs/targets/mipsel-linux-user.mak b/configs/targets/mipsel-linux-user.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mipsel-linux-user.mak
 +++ b/configs/targets/mipsel-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ARCH=mips
  TARGET_ABI_MIPSO32=y
  TARGET_SYSTBL_ABI=o32
  TARGET_SYSTBL=syscall_o32.tbl
 -TARGET_ALIGNED_ONLY=y
 diff --git a/configs/targets/mipsel-softmmu.mak b/configs/targets/mipsel-softmmu.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mipsel-softmmu.mak
 +++ b/configs/targets/mipsel-softmmu.mak
@@ -XXX,XX +XXX,XX @@
  TARGET_ARCH=mips
 -TARGET_ALIGNED_ONLY=y
  TARGET_SUPPORTS_MTTCG=y
 diff --git a/configs/targets/mipsn32-linux-user.mak b/configs/targets/mipsn32-linux-user.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mipsn32-linux-user.mak
 +++ b/configs/targets/mipsn32-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI32=y
  TARGET_BASE_ARCH=mips
  TARGET_SYSTBL_ABI=n32
  TARGET_SYSTBL=syscall_n32.tbl
 -TARGET_ALIGNED_ONLY=y
  TARGET_BIG_ENDIAN=y
 diff --git a/configs/targets/mipsn32el-linux-user.mak b/configs/targets/mipsn32el-linux-user.mak
 index XXXXXXX..XXXXXXX 100644
 --- a/configs/targets/mipsn32el-linux-user.mak
 +++ b/configs/targets/mipsn32el-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI32=y
  TARGET_BASE_ARCH=mips
  TARGET_SYSTBL_ABI=n32
  TARGET_SYSTBL=syscall_n32.tbl
 -TARGET_ALIGNED_ONLY=y
 --
 .34.1

-[PULL 08/53] disas: Move disas.c into the target-independent source set
+[PULL 23/27] docs: Document decodetree named field syntax
-From: Thomas Huth <thuth@redhat.com>
+From: Peter Maydell <peter.maydell@linaro.org>
-By using target_words_bigendian() instead of an ifdef,
+Document the named field syntax that we want to implement for the
-we can build this code once.
+decodetree script.  This allows a field to be defined in terms of
 some other field that the instruction pattern has already set, for
 example:
-Signed-off-by: Thomas Huth <thuth@redhat.com>
+   %sz_imm 10:3 sz:3 !function=expand_sz_imm
-Message-Id: <20230508133745.109463-3-thuth@redhat.com>
-[rth: Type change done in a separate patch]
+to allow a function to be passed both an immediate field from the
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+instruction and also a sz value which might have been specified by
 the instruction pattern directly (sz=1, etc) rather than being a
 simple field within the instruction.
 Note that the restriction on not having the format referring to the
 pattern and the pattern referring to the format simultaneously is a
 restriction of the decoder generator rather than inherently being a
 silly thing to do.
 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230523120447.728365-3-peter.maydell@linaro.org>
 ---
- disas/disas.c     | 10 +++++-----
+ docs/devel/decodetree.rst | 33 ++++++++++++++++++++++++++++-----
- disas/meson.build |  3 ++-
+file changed, 28 insertions(+), 5 deletions(-)
 files changed, 7 insertions(+), 6 deletions(-)
-diff --git a/disas/disas.c b/disas/disas.c
+diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
 index XXXXXXX..XXXXXXX 100644
---- a/disas/disas.c
+--- a/docs/devel/decodetree.rst
-+++ b/disas/disas.c
++++ b/docs/devel/decodetree.rst
-@@ -XXX,XX +XXX,XX @@ void disas_initialize_debug_target(CPUDebug *s, CPUState *cpu)
+@@ -XXX,XX +XXX,XX @@ Fields
-     s->cpu = cpu;
-     s->info.read_memory_func = target_read_memory;
+ Syntax::
-     s->info.print_address_func = print_address;
--#if TARGET_BIG_ENDIAN
+-  field_def     := '%' identifier ( unnamed_field )* ( !function=identifier )?
--    s->info.endian = BFD_ENDIAN_BIG;
++  field_def     := '%' identifier ( field )* ( !function=identifier )?
--#else
++  field         := unnamed_field | named_field
--    s->info.endian = BFD_ENDIAN_LITTLE;
+   unnamed_field := number ':' ( 's' ) number
--#endif
++  named_field   := identifier ':' ( 's' ) number
-+    if (target_words_bigendian()) {
-+        s->info.endian = BFD_ENDIAN_BIG;
+ For *unnamed_field*, the first number is the least-significant bit position
-+    } else {
+ of the field and the second number is the length of the field.  If the 's' is
-+        s->info.endian =  BFD_ENDIAN_LITTLE;
+-present, the field is considered signed.  If multiple ``unnamed_fields`` are
-+    }
+-present, they are concatenated.  In this way one can define disjoint fields.
++present, the field is considered signed.
-     CPUClass *cc = CPU_GET_CLASS(cpu);
++
-     if (cc->disas_set_info) {
++A *named_field* refers to some other field in the instruction pattern
-diff --git a/disas/meson.build b/disas/meson.build
++or format. Regardless of the length of the other field where it is
-index XXXXXXX..XXXXXXX 100644
++defined, it will be inserted into this field with the specified
---- a/disas/meson.build
++signedness and bit width.
-+++ b/disas/meson.build
++
-@@ -XXX,XX +XXX,XX @@ common_ss.add(when: 'CONFIG_SH4_DIS', if_true: files('sh4.c'))
++Field definitions that involve loops (i.e. where a field is defined
- common_ss.add(when: 'CONFIG_SPARC_DIS', if_true: files('sparc.c'))
++directly or indirectly in terms of itself) are errors.
- common_ss.add(when: 'CONFIG_XTENSA_DIS', if_true: files('xtensa.c'))
++
- common_ss.add(when: capstone, if_true: [files('capstone.c'), capstone])
++A format can include fields that refer to named fields that are
-+common_ss.add(files('disas.c'))
++defined in the instruction pattern(s) that use the format.
++Conversely, an instruction pattern can include fields that refer to
- softmmu_ss.add(files('disas-mon.c'))
++named fields that are defined in the format it uses. However you
--specific_ss.add(files('disas.c'), capstone)
++cannot currently do both at once (i.e. pattern P uses format F; F has
-+specific_ss.add(capstone)
++a field A that refers to a named field B that is defined in P, and P
 +has a field C that refers to a named field D that is defined in F).
 +
 +If multiple ``fields`` are present, they are concatenated.
 +In this way one can define disjoint fields.
  If ``!function`` is specified, the concatenated result is passed through the
  named function, taking and returning an integral value.
 -One may use ``!function`` with zero ``unnamed_fields``.  This case is called
 +One may use ``!function`` with zero ``fields``.  This case is called
  a *parameter*, and the named function is only passed the ``DisasContext``
  and returns an integral value extracted from there.
 -A field with no ``unnamed_fields`` and no ``!function`` is in error.
 +A field with no ``fields`` and no ``!function`` is in error.
  Field examples:
@@ -XXX,XX +XXX,XX @@ Field examples:
  | %shimm8 5:s8 13:1         | expand_shimm8(sextract(i, 5, 8) << 1 |      |
  |   !function=expand_shimm8 |               extract(i, 13, 1))            |
  +---------------------------+---------------------------------------------+
 +| %sz_imm 10:2 sz:3         | expand_sz_imm(extract(i, 10, 2) << 3 |      |
 +|   !function=expand_sz_imm |               extract(a->sz, 0, 3))         |
 ++---------------------------+---------------------------------------------+
  Argument Sets
  =============
 --
 .34.1

-[PULL 10/53] accel/tcg/tcg-accel-ops-rr: ensure fairness with icount
+[PULL 24/27] scripts/decodetree: Pass lvalue-formatter function to str_extract()
-From: Jamie Iles <quic_jiles@quicinc.com>
+From: Peter Maydell <peter.maydell@linaro.org>
-The round-robin scheduler will iterate over the CPU list with an
+To support referring to other named fields in field definitions, we
-assigned budget until the next timer expiry and may exit early because
+need to pass the str_extract() method a function which tells it how
-of a TB exit.  This is fine under normal operation but with icount
+to emit the code for a previously initialized named field.  (In
-enabled and SMP it is possible for a CPU to be starved of run time and
+Pattern::output_code() the other field will be "u.f_foo.field", and
-the system live-locks.
+in Format::output_extract() it is "a->field".)
-For example, booting a riscv64 platform with '-icount
+Refactor the two callsites that currently do "output code to
-shift=0,align=off,sleep=on -smp 2' we observe a livelock once the kernel
+initialize each field", and have them pass a lambda that defines how
-has timers enabled and starts performing TLB shootdowns.  In this case
+to format the lvalue in each case.  This is then used both in
-we have CPU 0 in M-mode with interrupts disabled sending an IPI to CPU
+emitting the LHS of the assignment and also passed down to
-.  As we enter the TCG loop, we assign the icount budget to next timer
+str_extract() as a new argument (unused at the moment, but will be
-interrupt to CPU 0 and begin executing where the guest is sat in a busy
+used in the following patch).
 loop exhausting all of the budget before we try to execute CPU 1 which
 is the target of the IPI but CPU 1 is left with no budget with which to
 execute and the process repeats.
-We try here to add some fairness by splitting the budget across all of
+Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
-the CPUs on the thread fairly before entering each one.  The CPU count
+Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-is cached on CPU list generation ID to avoid iterating the list on each
+Message-Id: <20230523120447.728365-4-peter.maydell@linaro.org>
-loop iteration.  With this change it is possible to boot an SMP rv64
+---
-guest with icount enabled and no hangs.
+ scripts/decodetree.py | 26 +++++++++++++++-----------
 file changed, 15 insertions(+), 11 deletions(-)
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 Tested-by: Peter Maydell <peter.maydell@linaro.org>
 Signed-off-by: Jamie Iles <quic_jiles@quicinc.com>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230427020925.51003-3-quic_jiles@quicinc.com>
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
  accel/tcg/tcg-accel-ops-icount.h |  3 ++-
  accel/tcg/tcg-accel-ops-icount.c | 21 ++++++++++++++----
  accel/tcg/tcg-accel-ops-rr.c     | 37 +++++++++++++++++++++++++++++++-
  replay/replay.c                  |  3 +--
 files changed, 56 insertions(+), 8 deletions(-)
 diff --git a/accel/tcg/tcg-accel-ops-icount.h b/accel/tcg/tcg-accel-ops-icount.h
 index XXXXXXX..XXXXXXX 100644
---- a/accel/tcg/tcg-accel-ops-icount.h
+--- a/scripts/decodetree.py
-+++ b/accel/tcg/tcg-accel-ops-icount.h
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ def __str__(self):
- #define TCG_ACCEL_OPS_ICOUNT_H
+             s = ''
+         return str(self.pos) + ':' + s + str(self.len)
- void icount_handle_deadline(void);
--void icount_prepare_for_run(CPUState *cpu);
+-    def str_extract(self):
-+void icount_prepare_for_run(CPUState *cpu, int64_t cpu_budget);
++    def str_extract(self, lvalue_formatter):
-+int64_t icount_percpu_budget(int cpu_count);
+         global bitop_width
- void icount_process_data(CPUState *cpu);
+         s = 's' if self.sign else ''
+         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
- void icount_handle_interrupt(CPUState *cpu, int mask);
+@@ -XXX,XX +XXX,XX @@ def __init__(self, subs, mask):
-diff --git a/accel/tcg/tcg-accel-ops-icount.c b/accel/tcg/tcg-accel-ops-icount.c
+     def __str__(self):
-index XXXXXXX..XXXXXXX 100644
+         return str(self.subs)
---- a/accel/tcg/tcg-accel-ops-icount.c
-+++ b/accel/tcg/tcg-accel-ops-icount.c
+-    def str_extract(self):
-@@ -XXX,XX +XXX,XX @@ void icount_handle_deadline(void)
++    def str_extract(self, lvalue_formatter):
-     }
+         global bitop_width
- }
+         ret = '0'
+         pos = 0
--void icount_prepare_for_run(CPUState *cpu)
+         for f in reversed(self.subs):
-+/* Distribute the budget evenly across all CPUs */
+-            ext = f.str_extract()
-+int64_t icount_percpu_budget(int cpu_count)
++            ext = f.str_extract(lvalue_formatter)
-+{
+             if pos == 0:
-+    int64_t limit = icount_get_limit();
+                 ret = ext
-+    int64_t timeslice = limit / cpu_count;
+             else:
@@ -XXX,XX +XXX,XX @@ def __init__(self, value):
      def __str__(self):
          return str(self.value)
 -    def str_extract(self):
 +    def str_extract(self, lvalue_formatter):
          return str(self.value)
      def __cmp__(self, other):
@@ -XXX,XX +XXX,XX @@ def __init__(self, func, base):
      def __str__(self):
          return self.func + '(' + str(self.base) + ')'
 -    def str_extract(self):
 -        return self.func + '(ctx, ' + self.base.str_extract() + ')'
 +    def str_extract(self, lvalue_formatter):
 +        return (self.func + '(ctx, '
 +                + self.base.str_extract(lvalue_formatter) + ')')
      def __eq__(self, other):
          return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __init__(self, func):
      def __str__(self):
          return self.func
 -    def str_extract(self):
 +    def str_extract(self, lvalue_formatter):
          return self.func + '(ctx)'
      def __eq__(self, other):
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str1(self, i):
          return str_indent(i) + self.__str__()
 +
-+    if (timeslice == 0) {
++    def output_fields(self, indent, lvalue_formatter):
-+        timeslice = limit;
++        for n, f in self.fields.items():
-+    }
++            output(indent, lvalue_formatter(n), ' = ',
-+
++                   f.str_extract(lvalue_formatter), ';\n')
-+    return timeslice;
+ # end General
-+}
-+
-+void icount_prepare_for_run(CPUState *cpu, int64_t cpu_budget)
+@@ -XXX,XX +XXX,XX @@ def extract_name(self):
- {
+     def output_extract(self):
-     int insns_left;
+         output('static void ', self.extract_name(), '(DisasContext *ctx, ',
+                self.base.struct_name(), ' *a, ', insntype, ' insn)\n{\n')
-@@ -XXX,XX +XXX,XX @@ void icount_prepare_for_run(CPUState *cpu)
+-        for n, f in self.fields.items():
-     g_assert(cpu_neg(cpu)->icount_decr.u16.low == 0);
+-            output('    a->', n, ' = ', f.str_extract(), ';\n')
-     g_assert(cpu->icount_extra == 0);
++        self.output_fields(str_indent(4), lambda n: 'a->' + n)
+         output('}\n\n')
--    cpu->icount_budget = icount_get_limit();
+ # end Format
-+    replay_mutex_lock();
-+
+@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
-+    cpu->icount_budget = MIN(icount_get_limit(), cpu_budget);
+         if not extracted:
-     insns_left = MIN(0xffff, cpu->icount_budget);
+             output(ind, self.base.extract_name(),
-     cpu_neg(cpu)->icount_decr.u16.low = insns_left;
+                    '(ctx, &u.f_', arg, ', insn);\n')
-     cpu->icount_extra = cpu->icount_budget - insns_left;
+-        for n, f in self.fields.items():
+-            output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
--    replay_mutex_lock();
++        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
--
+         output(ind, 'if (', translate_prefix, '_', self.name,
-     if (cpu->icount_budget == 0) {
+                '(ctx, &u.f_', arg, ')) return true;\n')
          /*
           * We're called without the iothread lock, so must take it while
 diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
 index XXXXXXX..XXXXXXX 100644
 --- a/accel/tcg/tcg-accel-ops-rr.c
 +++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -XXX,XX +XXX,XX @@
   */
  #include "qemu/osdep.h"
 +#include "qemu/lockable.h"
  #include "sysemu/tcg.h"
  #include "sysemu/replay.h"
  #include "sysemu/cpu-timers.h"
@@ -XXX,XX +XXX,XX @@ static void rr_force_rcu(Notifier *notify, void *data)
      rr_kick_next_cpu();
  }
 +/*
 + * Calculate the number of CPUs that we will process in a single iteration of
 + * the main CPU thread loop so that we can fairly distribute the instruction
 + * count across CPUs.
 + *
 + * The CPU count is cached based on the CPU list generation ID to avoid
 + * iterating the list every time.
 + */
 +static int rr_cpu_count(void)
 +{
 +    static unsigned int last_gen_id = ~0;
 +    static int cpu_count;
 +    CPUState *cpu;
 +
 +    QEMU_LOCK_GUARD(&qemu_cpu_list_lock);
 +
 +    if (cpu_list_generation_id_get() != last_gen_id) {
 +        cpu_count = 0;
 +        CPU_FOREACH(cpu) {
 +            ++cpu_count;
 +        }
 +        last_gen_id = cpu_list_generation_id_get();
 +    }
 +
 +    return cpu_count;
 +}
 +
  /*
   * In the single-threaded case each vCPU is simulated in turn. If
   * there is more than a single vCPU we create a simple timer to kick
@@ -XXX,XX +XXX,XX @@ static void *rr_cpu_thread_fn(void *arg)
      cpu->exit_request = 1;
      while (1) {
 +        /* Only used for icount_enabled() */
 +        int64_t cpu_budget = 0;
 +
          qemu_mutex_unlock_iothread();
          replay_mutex_lock();
          qemu_mutex_lock_iothread();
          if (icount_enabled()) {
 +            int cpu_count = rr_cpu_count();
 +
              /* Account partial waits to QEMU_CLOCK_VIRTUAL.  */
              icount_account_warp_timer();
              /*
@@ -XXX,XX +XXX,XX @@ static void *rr_cpu_thread_fn(void *arg)
               * waking up the I/O thread and waiting for completion.
               */
              icount_handle_deadline();
 +
 +            cpu_budget = icount_percpu_budget(cpu_count);
          }
          replay_mutex_unlock();
@@ -XXX,XX +XXX,XX @@ static void *rr_cpu_thread_fn(void *arg)
                  qemu_mutex_unlock_iothread();
                  if (icount_enabled()) {
 -                    icount_prepare_for_run(cpu);
 +                    icount_prepare_for_run(cpu, cpu_budget);
                  }
                  r = tcg_cpus_exec(cpu);
                  if (icount_enabled()) {
 diff --git a/replay/replay.c b/replay/replay.c
 index XXXXXXX..XXXXXXX 100644
 --- a/replay/replay.c
 +++ b/replay/replay.c
@@ -XXX,XX +XXX,XX @@ uint64_t replay_get_current_icount(void)
  int replay_get_instructions(void)
  {
      int res = 0;
 -    replay_mutex_lock();
 +    g_assert(replay_mutex_locked());
      if (replay_next_event_is(EVENT_INSTRUCTION)) {
          res = replay_state.instruction_count;
          if (replay_break_icount != -1LL) {
@@ -XXX,XX +XXX,XX @@ int replay_get_instructions(void)
              }
          }
      }
 -    replay_mutex_unlock();
      return res;
  }
 --
 .34.1

-[PULL 06/53] disas: Remove target-specific headers
+[PULL 25/27] scripts/decodetree: Implement a topological sort
-Reviewed-by: Thomas Huth <thuth@redhat.com>
+From: Peter Maydell <peter.maydell@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20230503072331.1747057-83-richard.henderson@linaro.org>
+To support named fields, we will need to be able to do a topological
 sort (so that we ensure that we output the assignment to field A
 before the assignment to field B if field B refers to field A by
 name). The good news is that there is a tsort in the python standard
 library; the bad news is that it was only added in Python 3.9.
 To bridge the gap between our current minimum supported Python
 version and 3.9, provide a local implementation that has the
 same API as the stdlib version for the parts we care about.
 In future when QEMU's minimum Python version requirement reaches
 .9 we can delete this code and replace it with an 'import' line.
 The core of this implementation is based on
 https://code.activestate.com/recipes/578272-topological-sort/
 which is MIT-licensed.
 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
 Acked-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230523120447.728365-5-peter.maydell@linaro.org>
 ---
- include/disas/disas.h | 6 ------
+ scripts/decodetree.py | 74 +++++++++++++++++++++++++++++++++++++++++++
- disas/disas.c         | 3 ++-
+file changed, 74 insertions(+)
 files changed, 2 insertions(+), 7 deletions(-)
-diff --git a/include/disas/disas.h b/include/disas/disas.h
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/include/disas/disas.h
+--- a/scripts/decodetree.py
-+++ b/include/disas/disas.h
++++ b/scripts/decodetree.py
 @@ -XXX,XX +XXX,XX @@
- #ifndef QEMU_DISAS_H
+ re_fmt_ident = '@[a-zA-Z0-9_]*'
- #define QEMU_DISAS_H
+ re_pat_ident = '[a-zA-Z0-9_]*'
--#include "exec/hwaddr.h"
++# Local implementation of a topological sort. We use the same API that
--
++# the Python graphlib does, so that when QEMU moves forward to a
--#ifdef NEED_CPU_H
++# baseline of Python 3.9 or newer this code can all be dropped and
--#include "cpu.h"
++# replaced with:
--
++#    from graphlib import TopologicalSorter, CycleError
- /* Disassemble this for me please... (debugging). */
++#
- void disas(FILE *out, const void *code, size_t size);
++# https://docs.python.org/3.9/library/graphlib.html#graphlib.TopologicalSorter
- void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size);
++#
-@@ -XXX,XX +XXX,XX @@ char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size);
++# We only implement the parts of TopologicalSorter we care about:
++#  ts = TopologicalSorter(graph=None)
- /* Look up symbol for debugging purpose.  Returns "" if unknown. */
++#    create the sorter. graph is a dictionary whose keys are
- const char *lookup_symbol(uint64_t orig_addr);
++#    nodes and whose values are lists of the predecessors of that node.
--#endif
++#    (That is, if graph contains "A" -> ["B", "C"] then we must output
++#    B and C before A.)
- struct syminfo;
++#  ts.static_order()
- struct elf32_sym;
++#    returns a list of all the nodes in sorted order, or raises CycleError
-diff --git a/disas/disas.c b/disas/disas.c
++#  CycleError
-index XXXXXXX..XXXXXXX 100644
++#    exception raised if there are cycles in the graph. The second
---- a/disas/disas.c
++#    element in the args attribute is a list of nodes which form a
-+++ b/disas/disas.c
++#    cycle; the first and last element are the same, eg [a, b, c, a]
-@@ -XXX,XX +XXX,XX @@
++#    (Our implementation doesn't give the order correctly.)
- #include "disas/dis-asm.h"
++#
- #include "elf.h"
++# For our purposes we can assume that the data set is always small
- #include "qemu/qemu-print.h"
++# (typically 10 nodes or less, actual links in the graph very rare),
--
++# so we don't need to worry about efficiency of implementation.
- #include "disas/disas.h"
++#
- #include "disas/capstone.h"
++# The core of this implementation is from
-+#include "hw/core/cpu.h"
++# https://code.activestate.com/recipes/578272-topological-sort/
-+#include "exec/memory.h"
++# (but updated to Python 3), and is under the MIT license.
++
- typedef struct CPUDebug {
++class CycleError(ValueError):
-     struct disassemble_info info;
++    """Subclass of ValueError raised if cycles exist in the graph"""
 +    pass
 +
 +class TopologicalSorter:
 +    """Topologically sort a graph"""
 +    def __init__(self, graph=None):
 +        self.graph = graph
 +
 +    def static_order(self):
 +        # We do the sort right here, unlike the stdlib version
 +        from functools import reduce
 +        data = {}
 +        r = []
 +
 +        if not self.graph:
 +            return []
 +
 +        # This code wants the values in the dict to be specifically sets
 +        for k, v in self.graph.items():
 +            data[k] = set(v)
 +
 +        # Find all items that don't depend on anything.
 +        extra_items_in_deps = (reduce(set.union, data.values())
 +                               - set(data.keys()))
 +        # Add empty dependencies where needed
 +        data.update({item:{} for item in extra_items_in_deps})
 +        while True:
 +            ordered = set(item for item, dep in data.items() if not dep)
 +            if not ordered:
 +                break
 +            r.extend(ordered)
 +            data = {item: (dep - ordered)
 +                    for item, dep in data.items()
 +                        if item not in ordered}
 +        if data:
 +            # This doesn't give as nice results as the stdlib, which
 +            # gives you the cycle by listing the nodes in order. Here
 +            # we only know the nodes in the cycle but not their order.
 +            raise CycleError(f'nodes are in a cycle', list(data.keys()))
 +
 +        return r
 +# end TopologicalSorter
 +
  def error_with_file(file, lineno, *args):
      """Print an error message from file:line and args and exit."""
      global output_file
 --
 .34.1

-[PULL 03/53] disas: Fix tabs and braces in disas.c
+[PULL 26/27] scripts/decodetree: Implement named field support
-Fix these before moving the file, for checkpatch.pl.
+From: Peter Maydell <peter.maydell@linaro.org>
-Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
+Implement support for named fields, i.e.  where one field is defined
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+in terms of another, rather than directly in terms of bits extracted
-Message-Id: <20230510170812.663149-1-richard.henderson@linaro.org>
+from the instruction.
 The new method referenced_fields() on all the Field classes returns a
 list of fields that this field references.  This just passes through,
 except for the new NamedField class.
 We can then use referenced_fields() to:
  * construct a list of 'dangling references' for a format or
    pattern, which is the fields that the format/pattern uses but
    doesn't define itself
  * do a topological sort, so that we output "field = value"
    assignments in an order that means that we assign a field before
    we reference it in a subsequent assignment
  * check when we output the code for a pattern whether we need to
    fill in the format fields before or after the pattern fields, and
    do other error checking
 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230523120447.728365-6-peter.maydell@linaro.org>
 ---
- disas.c | 11 ++++++-----
+ scripts/decodetree.py | 145 ++++++++++++++++++++++++++++++++++++++++--
-file changed, 6 insertions(+), 5 deletions(-)
+file changed, 139 insertions(+), 6 deletions(-)
-diff --git a/disas.c b/disas.c
+diff --git a/scripts/decodetree.py b/scripts/decodetree.py
 index XXXXXXX..XXXXXXX 100644
---- a/disas.c
+--- a/scripts/decodetree.py
-+++ b/disas.c
++++ b/scripts/decodetree.py
-@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,
+@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
-     }
+         s = 's' if self.sign else ''
+         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
-     for (pc = code; size > 0; pc += count, size -= count) {
--    fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
++    def referenced_fields(self):
--    count = s.info.print_insn(pc, &s.info);
++        return []
--    fprintf(out, "\n");
++
--    if (count < 0)
+     def __eq__(self, other):
--        break;
+         return self.sign == other.sign and self.mask == other.mask
-+        fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
-+        count = s.info.print_insn(pc, &s.info);
+@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
-+        fprintf(out, "\n");
+             pos += f.len
-+        if (count < 0) {
+         return ret
-+            break;
-+        }
++    def referenced_fields(self):
-         if (size < count) {
++        l = []
-             fprintf(out,
++        for f in self.subs:
-                     "Disassembler disagrees with translator over instruction "
++            l.extend(f.referenced_fields())
 +        return l
 +
      def __ne__(self, other):
          if len(self.subs) != len(other.subs):
              return True
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str_extract(self, lvalue_formatter):
          return str(self.value)
 +    def referenced_fields(self):
 +        return []
 +
      def __cmp__(self, other):
          return self.value - other.value
  # end ConstField
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
          return (self.func + '(ctx, '
                  + self.base.str_extract(lvalue_formatter) + ')')
 +    def referenced_fields(self):
 +        return self.base.referenced_fields()
 +
      def __eq__(self, other):
          return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str_extract(self, lvalue_formatter):
          return self.func + '(ctx)'
 +    def referenced_fields(self):
 +        return []
 +
      def __eq__(self, other):
          return self.func == other.func
@@ -XXX,XX +XXX,XX @@ def __ne__(self, other):
          return not self.__eq__(other)
  # end ParameterField
 +class NamedField:
 +    """Class representing a field already named in the pattern"""
 +    def __init__(self, name, sign, len):
 +        self.mask = 0
 +        self.sign = sign
 +        self.len = len
 +        self.name = name
 +
 +    def __str__(self):
 +        return self.name
 +
 +    def str_extract(self, lvalue_formatter):
 +        global bitop_width
 +        s = 's' if self.sign else ''
 +        lvalue = lvalue_formatter(self.name)
 +        return f'{s}extract{bitop_width}({lvalue}, 0, {self.len})'
 +
 +    def referenced_fields(self):
 +        return [self.name]
 +
 +    def __eq__(self, other):
 +        return self.name == other.name
 +
 +    def __ne__(self, other):
 +        return not self.__eq__(other)
 +# end NamedField
  class Arguments:
      """Class representing the extracted fields of a format"""
@@ -XXX,XX +XXX,XX @@ def output_def(self):
              output('} ', self.struct_name(), ';\n\n')
  # end Arguments
 -
  class General:
      """Common code between instruction formats and instruction patterns"""
      def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
@@ -XXX,XX +XXX,XX @@ def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
          self.fieldmask = fldm
          self.fields = flds
          self.width = w
 +        self.dangling = None
      def __str__(self):
          return self.name + ' ' + str_match_bits(self.fixedbits, self.fixedmask)
@@ -XXX,XX +XXX,XX @@ def __str__(self):
      def str1(self, i):
          return str_indent(i) + self.__str__()
 +    def dangling_references(self):
 +        # Return a list of all named references which aren't satisfied
 +        # directly by this format/pattern. This will be either:
 +        #  * a format referring to a field which is specified by the
 +        #    pattern(s) using it
 +        #  * a pattern referring to a field which is specified by the
 +        #    format it uses
 +        #  * a user error (referring to a field that doesn't exist at all)
 +        if self.dangling is None:
 +            # Compute this once and cache the answer
 +            dangling = []
 +            for n, f in self.fields.items():
 +                for r in f.referenced_fields():
 +                    if r not in self.fields:
 +                        dangling.append(r)
 +            self.dangling = dangling
 +        return self.dangling
 +
      def output_fields(self, indent, lvalue_formatter):
 +        # We use a topological sort to ensure that any use of NamedField
 +        # comes after the initialization of the field it is referencing.
 +        graph = {}
          for n, f in self.fields.items():
 -            output(indent, lvalue_formatter(n), ' = ',
 -                   f.str_extract(lvalue_formatter), ';\n')
 +            refs = f.referenced_fields()
 +            graph[n] = refs
 +
 +        try:
 +            ts = TopologicalSorter(graph)
 +            for n in ts.static_order():
 +                # We only want to emit assignments for the keys
 +                # in our fields list, not for anything that ends up
 +                # in the tsort graph only because it was referenced as
 +                # a NamedField.
 +                try:
 +                    f = self.fields[n]
 +                    output(indent, lvalue_formatter(n), ' = ',
 +                           f.str_extract(lvalue_formatter), ';\n')
 +                except KeyError:
 +                    pass
 +        except CycleError as e:
 +            # The second element of args is a list of nodes which form
 +            # a cycle (there might be others too, but only one is reported).
 +            # Pretty-print it to tell the user.
 +            cycle = ' => '.join(e.args[1])
 +            error(self.lineno, 'field definitions form a cycle: ' + cycle)
  # end General
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
          ind = str_indent(i)
          arg = self.base.base.name
          output(ind, '/* ', self.file, ':', str(self.lineno), ' */\n')
 +        # We might have named references in the format that refer to fields
 +        # in the pattern, or named references in the pattern that refer
 +        # to fields in the format. This affects whether we extract the fields
 +        # for the format before or after the ones for the pattern.
 +        # For simplicity we don't allow cross references in both directions.
 +        # This is also where we catch the syntax error of referring to
 +        # a nonexistent field.
 +        fmt_refs = self.base.dangling_references()
 +        for r in fmt_refs:
 +            if r not in self.fields:
 +                error(self.lineno, f'format refers to undefined field {r}')
 +        pat_refs = self.dangling_references()
 +        for r in pat_refs:
 +            if r not in self.base.fields:
 +                error(self.lineno, f'pattern refers to undefined field {r}')
 +        if pat_refs and fmt_refs:
 +            error(self.lineno, ('pattern that uses fields defined in format '
 +                                'cannot use format that uses fields defined '
 +                                'in pattern'))
 +        if fmt_refs:
 +            # pattern fields first
 +            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +            assert not extracted, "dangling fmt refs but it was already extracted"
          if not extracted:
              output(ind, self.base.extract_name(),
                     '(ctx, &u.f_', arg, ', insn);\n')
 -        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +        if not fmt_refs:
 +            # pattern fields last
 +            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
 +
          output(ind, 'if (', translate_prefix, '_', self.name,
                 '(ctx, &u.f_', arg, ')) return true;\n')
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
          ind = str_indent(i)
          # If we identified all nodes below have the same format,
 -        # extract the fields now.
 -        if not extracted and self.base:
 +        # extract the fields now. But don't do it if the format relies
 +        # on named fields from the insn pattern, as those won't have
 +        # been initialised at this point.
 +        if not extracted and self.base and not self.base.dangling_references():
              output(ind, self.base.extract_name(),
                     '(ctx, &u.f_', self.base.base.name, ', insn);\n')
              extracted = True
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
      """Parse one instruction field from TOKS at LINENO"""
      global fields
      global insnwidth
 +    global re_C_ident
      # A "simple" field will have only one entry;
      # a "multifield" will have several.
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
              func = func[1]
              continue
 +        if re.fullmatch(re_C_ident + ':s[0-9]+', t):
 +            # Signed named field
 +            subtoks = t.split(':')
 +            n = subtoks[0]
 +            le = int(subtoks[1])
 +            f = NamedField(n, True, le)
 +            subs.append(f)
 +            width += le
 +            continue
 +        if re.fullmatch(re_C_ident + ':[0-9]+', t):
 +            # Unsigned named field
 +            subtoks = t.split(':')
 +            n = subtoks[0]
 +            le = int(subtoks[1])
 +            f = NamedField(n, False, le)
 +            subs.append(f)
 +            width += le
 +            continue
 +
          if re.fullmatch('[0-9]+:s[0-9]+', t):
              # Signed field extract
              subtoks = t.split(':s')
 --
 .34.1

-[PULL 05/53] disas: Remove target_ulong from the interface
+[PULL 27/27] tests/decode: Add tests for various named-field cases
-Use uint64_t for the pc, and size_t for the size.
+From: Peter Maydell <peter.maydell@linaro.org>
-Reviewed-by: Thomas Huth <thuth@redhat.com>
+Add some tests for various cases of named-field use, both ones that
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+should work and ones that should be diagnosed as errors.
-Message-Id: <20230503072331.1747057-81-richard.henderson@linaro.org>
 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
 Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
 Message-Id: <20230523120447.728365-7-peter.maydell@linaro.org>
 ---
- include/disas/disas.h | 17 ++++++-----------
+ tests/decode/err_field10.decode      |  7 +++++++
- bsd-user/elfload.c    |  5 +++--
+ tests/decode/err_field7.decode       |  7 +++++++
- disas/disas.c         | 19 +++++++++----------
+ tests/decode/err_field8.decode       |  8 ++++++++
- linux-user/elfload.c  |  5 +++--
+ tests/decode/err_field9.decode       | 14 ++++++++++++++
-files changed, 21 insertions(+), 25 deletions(-)
+ tests/decode/succ_named_field.decode | 19 +++++++++++++++++++
  tests/decode/meson.build             |  5 +++++
 files changed, 60 insertions(+)
  create mode 100644 tests/decode/err_field10.decode
  create mode 100644 tests/decode/err_field7.decode
  create mode 100644 tests/decode/err_field8.decode
  create mode 100644 tests/decode/err_field9.decode
  create mode 100644 tests/decode/succ_named_field.decode
-diff --git a/include/disas/disas.h b/include/disas/disas.h
+diff --git a/tests/decode/err_field10.decode b/tests/decode/err_field10.decode
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/err_field10.decode
@@ -XXX,XX +XXX,XX @@
 +# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 +# See the COPYING.LIB file in the top-level directory.
 +
 +# Diagnose formats which refer to undefined fields
 +%field1        field2:3
 +@fmt ........ ........ ........ ........ %field1
 +insn 00000000 00000000 00000000 00000000 @fmt
 diff --git a/tests/decode/err_field7.decode b/tests/decode/err_field7.decode
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/err_field7.decode
@@ -XXX,XX +XXX,XX @@
 +# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 +# See the COPYING.LIB file in the top-level directory.
 +
 +# Diagnose fields whose definitions form a loop
 +%field1        field2:3
 +%field2        field1:4
 +insn 00000000 00000000 00000000 00000000 %field1 %field2
 diff --git a/tests/decode/err_field8.decode b/tests/decode/err_field8.decode
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/err_field8.decode
@@ -XXX,XX +XXX,XX @@
 +# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 +# See the COPYING.LIB file in the top-level directory.
 +
 +# Diagnose patterns which refer to undefined fields
 +&f1 f1 a
 +%field1        field2:3
 +@fmt ........ ........ ........ .... a:4 &f1
 +insn 00000000 00000000 00000000 0000 .... @fmt f1=%field1
 diff --git a/tests/decode/err_field9.decode b/tests/decode/err_field9.decode
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/err_field9.decode
@@ -XXX,XX +XXX,XX @@
 +# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 +# See the COPYING.LIB file in the top-level directory.
 +
 +# Diagnose fields where the format refers to a field defined in the
 +# pattern and the pattern refers to a field defined in the format.
 +# This is theoretically not impossible to implement, but is not
 +# supported by the script at this time.
 +&abcd a b c d
 +%refa        a:3
 +%refc        c:4
 +# Format defines 'c' and sets 'b' to an indirect ref to 'a'
 +@fmt ........ ........ ........ c:8 &abcd b=%refa
 +# Pattern defines 'a' and sets 'd' to an indirect ref to 'c'
 +insn 00000000 00000000 00000000 ........ @fmt d=%refc a=6
 diff --git a/tests/decode/succ_named_field.decode b/tests/decode/succ_named_field.decode
 new file mode 100644
 index XXXXXXX..XXXXXXX
 --- /dev/null
 +++ b/tests/decode/succ_named_field.decode
@@ -XXX,XX +XXX,XX @@
 +# This work is licensed under the terms of the GNU LGPL, version 2 or later.
 +# See the COPYING.LIB file in the top-level directory.
 +
 +# field using a named_field
 +%imm_sz    8:8 sz:3
 +insn 00000000 00000000 ........ 00000000 imm_sz=%imm_sz sz=1
 +
 +# Ditto, via a format. Here a field in the format
 +# references a named field defined in the insn pattern:
 +&imm_a imm alpha
 +%foo 0:16 alpha:4
 +@foo 00000001 ........ ........ ........ &imm_a imm=%foo
 +i1   ........ 00000000 ........ ........ @foo alpha=1
 +i2   ........ 00000001 ........ ........ @foo alpha=2
 +
 +# Here the named field is defined in the format and referenced
 +# from the insn pattern:
 +@bar 00000010 ........ ........ ........ &imm_a alpha=4
 +i3   ........ 00000000 ........ ........ @bar imm=%foo
 diff --git a/tests/decode/meson.build b/tests/decode/meson.build
 index XXXXXXX..XXXXXXX 100644
---- a/include/disas/disas.h
+--- a/tests/decode/meson.build
-+++ b/include/disas/disas.h
++++ b/tests/decode/meson.build
-@@ -XXX,XX +XXX,XX @@
+@@ -XXX,XX +XXX,XX @@ err_tests = [
- #include "cpu.h"
+     'err_field4.decode',
+     'err_field5.decode',
- /* Disassemble this for me please... (debugging). */
+     'err_field6.decode',
--void disas(FILE *out, const void *code, unsigned long size);
++    'err_field7.decode',
--void target_disas(FILE *out, CPUState *cpu, target_ulong code,
++    'err_field8.decode',
--                  target_ulong size);
++    'err_field9.decode',
-+void disas(FILE *out, const void *code, size_t size);
++    'err_field10.decode',
-+void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size);
+     'err_init1.decode',
+     'err_init2.decode',
--void monitor_disas(Monitor *mon, CPUState *cpu,
+     'err_init3.decode',
--                   target_ulong pc, int nb_insn, int is_physical);
+@@ -XXX,XX +XXX,XX @@ succ_tests = [
-+void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
+     'succ_argset_type1.decode',
-+                   int nb_insn, bool is_physical);
+     'succ_function.decode',
+     'succ_ident1.decode',
- char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size);
++    'succ_named_field.decode',
+     'succ_pattern_group_nest1.decode',
- /* Look up symbol for debugging purpose.  Returns "" if unknown. */
+     'succ_pattern_group_nest2.decode',
--const char *lookup_symbol(target_ulong orig_addr);
+     'succ_pattern_group_nest3.decode',
 +const char *lookup_symbol(uint64_t orig_addr);
  #endif
  struct syminfo;
  struct elf32_sym;
  struct elf64_sym;
 -#if defined(CONFIG_USER_ONLY)
 -typedef const char *(*lookup_symbol_t)(struct syminfo *s, target_ulong orig_addr);
 -#else
 -typedef const char *(*lookup_symbol_t)(struct syminfo *s, hwaddr orig_addr);
 -#endif
 +typedef const char *(*lookup_symbol_t)(struct syminfo *s, uint64_t orig_addr);
  struct syminfo {
      lookup_symbol_t lookup_symbol;
 diff --git a/bsd-user/elfload.c b/bsd-user/elfload.c
 index XXXXXXX..XXXXXXX 100644
 --- a/bsd-user/elfload.c
 +++ b/bsd-user/elfload.c
@@ -XXX,XX +XXX,XX @@ static abi_ulong load_elf_interp(struct elfhdr *interp_elf_ex,
  static int symfind(const void *s0, const void *s1)
  {
 -    target_ulong addr = *(target_ulong *)s0;
 +    __typeof(sym->st_value) addr = *(uint64_t *)s0;
      struct elf_sym *sym = (struct elf_sym *)s1;
      int result = 0;
 +
      if (addr < sym->st_value) {
          result = -1;
      } else if (addr >= sym->st_value + sym->st_size) {
@@ -XXX,XX +XXX,XX @@ static int symfind(const void *s0, const void *s1)
      return result;
  }
 -static const char *lookup_symbolxx(struct syminfo *s, target_ulong orig_addr)
 +static const char *lookup_symbolxx(struct syminfo *s, uint64_t orig_addr)
  {
  #if ELF_CLASS == ELFCLASS32
      struct elf_sym *syms = s->disas_symtab.elf32;
 diff --git a/disas/disas.c b/disas/disas.c
 index XXXXXXX..XXXXXXX 100644
 --- a/disas/disas.c
 +++ b/disas/disas.c
@@ -XXX,XX +XXX,XX @@ static void initialize_debug_host(CPUDebug *s)
  }
  /* Disassemble this for me please... (debugging).  */
 -void target_disas(FILE *out, CPUState *cpu, target_ulong code,
 -                  target_ulong size)
 +void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size)
  {
 -    target_ulong pc;
 +    uint64_t pc;
      int count;
      CPUDebug s;
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,
      }
      for (pc = code; size > 0; pc += count, size -= count) {
 -        fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
 +        fprintf(out, "0x%08" PRIx64 ":  ", pc);
          count = s.info.print_insn(pc, &s.info);
          fprintf(out, "\n");
          if (count < 0) {
@@ -XXX,XX +XXX,XX @@ char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size)
  }
  /* Disassemble this for me please... (debugging). */
 -void disas(FILE *out, const void *code, unsigned long size)
 +void disas(FILE *out, const void *code, size_t size)
  {
      uintptr_t pc;
      int count;
@@ -XXX,XX +XXX,XX @@ void disas(FILE *out, const void *code, unsigned long size)
  }
  /* Look up symbol for debugging purpose.  Returns "" if unknown. */
 -const char *lookup_symbol(target_ulong orig_addr)
 +const char *lookup_symbol(uint64_t orig_addr)
  {
      const char *symbol = "";
      struct syminfo *s;
@@ -XXX,XX +XXX,XX @@ physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
  }
  /* Disassembler for the monitor.  */
 -void monitor_disas(Monitor *mon, CPUState *cpu,
 -                   target_ulong pc, int nb_insn, int is_physical)
 +void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
 +                   int nb_insn, bool is_physical)
  {
      int count, i;
      CPUDebug s;
@@ -XXX,XX +XXX,XX @@ void monitor_disas(Monitor *mon, CPUState *cpu,
      }
      if (!s.info.print_insn) {
 -        monitor_printf(mon, "0x" TARGET_FMT_lx
 +        monitor_printf(mon, "0x%08" PRIx64
                         ": Asm output not supported on this arch\n", pc);
          return;
      }
      for (i = 0; i < nb_insn; i++) {
 -        g_string_append_printf(ds, "0x" TARGET_FMT_lx ":  ", pc);
 +        g_string_append_printf(ds, "0x%08" PRIx64 ":  ", pc);
          count = s.info.print_insn(pc, &s.info);
          g_string_append_c(ds, '\n');
          if (count < 0) {
 diff --git a/linux-user/elfload.c b/linux-user/elfload.c
 index XXXXXXX..XXXXXXX 100644
 --- a/linux-user/elfload.c
 +++ b/linux-user/elfload.c
@@ -XXX,XX +XXX,XX @@ static void load_elf_interp(const char *filename, struct image_info *info,
  static int symfind(const void *s0, const void *s1)
  {
 -    target_ulong addr = *(target_ulong *)s0;
      struct elf_sym *sym = (struct elf_sym *)s1;
 +    __typeof(sym->st_value) addr = *(uint64_t *)s0;
      int result = 0;
 +
      if (addr < sym->st_value) {
          result = -1;
      } else if (addr >= sym->st_value + sym->st_size) {
@@ -XXX,XX +XXX,XX @@ static int symfind(const void *s0, const void *s1)
      return result;
  }
 -static const char *lookup_symbolxx(struct syminfo *s, target_ulong orig_addr)
 +static const char *lookup_symbolxx(struct syminfo *s, uint64_t orig_addr)
  {
  #if ELF_CLASS == ELFCLASS32
      struct elf_sym *syms = s->disas_symtab.elf32;
 --
 .34.1

-[PULL 09/53] cpu: expose qemu_cpu_list_lock for lock-guard use
+Deleted patch
-From: Jamie Iles <quic_jiles@quicinc.com>
-Expose qemu_cpu_list_lock globally so that we can use
-WITH_QEMU_LOCK_GUARD and QEMU_LOCK_GUARD to simplify a few code paths
-now and in future.
-Signed-off-by: Jamie Iles <quic_jiles@quicinc.com>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20230427020925.51003-2-quic_jiles@quicinc.com>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- include/exec/cpu-common.h |  1 +
- cpus-common.c             |  2 +-
- linux-user/elfload.c      | 13 +++++++------
- migration/dirtyrate.c     | 26 +++++++++++++-------------
- trace/control-target.c    |  9 ++++-----
-files changed, 26 insertions(+), 25 deletions(-)
-diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/exec/cpu-common.h
-+++ b/include/exec/cpu-common.h
-@@ -XXX,XX +XXX,XX @@ extern intptr_t qemu_host_page_mask;
- #define REAL_HOST_PAGE_ALIGN(addr) ROUND_UP((addr), qemu_real_host_page_size())
- /* The CPU list lock nests outside page_(un)lock or mmap_(un)lock */
-+extern QemuMutex qemu_cpu_list_lock;
- void qemu_init_cpu_list(void);
- void cpu_list_lock(void);
- void cpu_list_unlock(void);
-diff --git a/cpus-common.c b/cpus-common.c
-index XXXXXXX..XXXXXXX 100644
---- a/cpus-common.c
-+++ b/cpus-common.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu/lockable.h"
- #include "trace/trace-root.h"
--static QemuMutex qemu_cpu_list_lock;
-+QemuMutex qemu_cpu_list_lock;
- static QemuCond exclusive_cond;
- static QemuCond exclusive_resume;
- static QemuCond qemu_work_cond;
-diff --git a/linux-user/elfload.c b/linux-user/elfload.c
-index XXXXXXX..XXXXXXX 100644
---- a/linux-user/elfload.c
-+++ b/linux-user/elfload.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu/guest-random.h"
- #include "qemu/units.h"
- #include "qemu/selfmap.h"
-+#include "qemu/lockable.h"
- #include "qapi/error.h"
- #include "qemu/error-report.h"
- #include "target_signal.h"
-@@ -XXX,XX +XXX,XX @@ static int fill_note_info(struct elf_note_info *info,
-         info->notes_size += note_size(&info->notes[i]);
-     /* read and fill status of all threads */
--    cpu_list_lock();
--    CPU_FOREACH(cpu) {
--        if (cpu == thread_cpu) {
--            continue;
-+    WITH_QEMU_LOCK_GUARD(&qemu_cpu_list_lock) {
-+        CPU_FOREACH(cpu) {
-+            if (cpu == thread_cpu) {
-+                continue;
-+            }
-+            fill_thread_info(info, cpu->env_ptr);
-         }
--        fill_thread_info(info, cpu->env_ptr);
-     }
--    cpu_list_unlock();
-     return (0);
- }
-diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
-index XXXXXXX..XXXXXXX 100644
---- a/migration/dirtyrate.c
-+++ b/migration/dirtyrate.c
-@@ -XXX,XX +XXX,XX @@ int64_t vcpu_calculate_dirtyrate(int64_t calc_time_ms,
- retry:
-     init_time_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
--    cpu_list_lock();
--    gen_id = cpu_list_generation_id_get();
--    records = vcpu_dirty_stat_alloc(stat);
--    vcpu_dirty_stat_collect(stat, records, true);
--    cpu_list_unlock();
-+    WITH_QEMU_LOCK_GUARD(&qemu_cpu_list_lock) {
-+        gen_id = cpu_list_generation_id_get();
-+        records = vcpu_dirty_stat_alloc(stat);
-+        vcpu_dirty_stat_collect(stat, records, true);
-+    }
-     duration = dirty_stat_wait(calc_time_ms, init_time_ms);
-     global_dirty_log_sync(flag, one_shot);
--    cpu_list_lock();
--    if (gen_id != cpu_list_generation_id_get()) {
--        g_free(records);
--        g_free(stat->rates);
--        cpu_list_unlock();
--        goto retry;
-+    WITH_QEMU_LOCK_GUARD(&qemu_cpu_list_lock) {
-+        if (gen_id != cpu_list_generation_id_get()) {
-+            g_free(records);
-+            g_free(stat->rates);
-+            cpu_list_unlock();
-+            goto retry;
-+        }
-+        vcpu_dirty_stat_collect(stat, records, false);
-     }
--    vcpu_dirty_stat_collect(stat, records, false);
--    cpu_list_unlock();
-     for (i = 0; i < stat->nvcpu; i++) {
-         dirtyrate = do_calculate_dirtyrate(records[i], duration);
-diff --git a/trace/control-target.c b/trace/control-target.c
-index XXXXXXX..XXXXXXX 100644
---- a/trace/control-target.c
-+++ b/trace/control-target.c
-@@ -XXX,XX +XXX,XX @@
-  */
- #include "qemu/osdep.h"
-+#include "qemu/lockable.h"
- #include "cpu.h"
- #include "trace/trace-root.h"
- #include "trace/control.h"
-@@ -XXX,XX +XXX,XX @@ static bool adding_first_cpu1(void)
- static bool adding_first_cpu(void)
- {
--    bool res;
--    cpu_list_lock();
--    res = adding_first_cpu1();
--    cpu_list_unlock();
--    return res;
-+    QEMU_LOCK_GUARD(&qemu_cpu_list_lock);
-+
-+    return adding_first_cpu1();
- }
- void trace_init_vcpu(CPUState *vcpu)
---
-.34.1

-[PULL 11/53] tcg/i386: Introduce prepare_host_addr
+Deleted patch
-Merge tcg_out_tlb_load, add_qemu_ldst_label,
-tcg_out_test_alignment, and some code that lived in both
-tcg_out_qemu_ld and tcg_out_qemu_st into one function
-that returns HostAddress and TCGLabelQemuLdst structures.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/i386/tcg-target.c.inc | 346 ++++++++++++++++----------------------
-file changed, 145 insertions(+), 201 deletions(-)
-diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/i386/tcg-target.c.inc
-+++ b/tcg/i386/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-     [MO_BEUQ] = helper_be_stq_mmu,
- };
--/* Perform the TLB load and compare.
--
--   Inputs:
--   ADDRLO and ADDRHI contain the low and high part of the address.
--
--   MEM_INDEX and S_BITS are the memory context and log2 size of the load.
--
--   WHICH is the offset into the CPUTLBEntry structure of the slot to read.
--   This should be offsetof addr_read or addr_write.
--
--   Outputs:
--   LABEL_PTRS is filled with 1 (32-bit addresses) or 2 (64-bit addresses)
--   positions of the displacements of forward jumps to the TLB miss case.
--
--   Second argument register is loaded with the low part of the address.
--   In the TLB hit case, it has been adjusted as indicated by the TLB
--   and so is a host address.  In the TLB miss case, it continues to
--   hold a guest address.
--
--   First argument register is clobbered.  */
--
--static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
--                                    int mem_index, MemOp opc,
--                                    tcg_insn_unit **label_ptr, int which)
--{
--    TCGType ttype = TCG_TYPE_I32;
--    TCGType tlbtype = TCG_TYPE_I32;
--    int trexw = 0, hrexw = 0, tlbrexw = 0;
--    unsigned a_bits = get_alignment_bits(opc);
--    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_mask = (1 << a_bits) - 1;
--    unsigned s_mask = (1 << s_bits) - 1;
--    target_ulong tlb_mask;
--
--    if (TCG_TARGET_REG_BITS == 64) {
--        if (TARGET_LONG_BITS == 64) {
--            ttype = TCG_TYPE_I64;
--            trexw = P_REXW;
--        }
--        if (TCG_TYPE_PTR == TCG_TYPE_I64) {
--            hrexw = P_REXW;
--            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
--                tlbtype = TCG_TYPE_I64;
--                tlbrexw = P_REXW;
--            }
--        }
--    }
--
--    tcg_out_mov(s, tlbtype, TCG_REG_L0, addrlo);
--    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, TCG_REG_L0,
--                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--
--    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, TCG_REG_L0, TCG_AREG0,
--                         TLB_MASK_TABLE_OFS(mem_index) +
--                         offsetof(CPUTLBDescFast, mask));
--
--    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, TCG_REG_L0, TCG_AREG0,
--                         TLB_MASK_TABLE_OFS(mem_index) +
--                         offsetof(CPUTLBDescFast, table));
--
--    /* If the required alignment is at least as large as the access, simply
--       copy the address and mask.  For lesser alignments, check that we don't
--       cross pages for the complete access.  */
--    if (a_bits >= s_bits) {
--        tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
--    } else {
--        tcg_out_modrm_offset(s, OPC_LEA + trexw, TCG_REG_L1,
--                             addrlo, s_mask - a_mask);
--    }
--    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
--    tgen_arithi(s, ARITH_AND + trexw, TCG_REG_L1, tlb_mask, 0);
--
--    /* cmp 0(TCG_REG_L0), TCG_REG_L1 */
--    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw,
--                         TCG_REG_L1, TCG_REG_L0, which);
--
--    /* Prepare for both the fast path add of the tlb addend, and the slow
--       path function argument setup.  */
--    tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
--
--    /* jne slow_path */
--    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
--    label_ptr[0] = s->code_ptr;
--    s->code_ptr += 4;
--
--    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
--        /* cmp 4(TCG_REG_L0), addrhi */
--        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, TCG_REG_L0, which + 4);
--
--        /* jne slow_path */
--        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
--        label_ptr[1] = s->code_ptr;
--        s->code_ptr += 4;
--    }
--
--    /* TLB Hit.  */
--
--    /* add addend(TCG_REG_L0), TCG_REG_L1 */
--    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, TCG_REG_L1, TCG_REG_L0,
--                         offsetof(CPUTLBEntry, addend));
--}
--
--/*
-- * Record the context of a call to the out of line helper code for the slow path
-- * for a load or store, so that we can later generate the correct helper code
-- */
--static void add_qemu_ldst_label(TCGContext *s, bool is_ld,
--                                TCGType type, MemOpIdx oi,
--                                TCGReg datalo, TCGReg datahi,
--                                TCGReg addrlo, TCGReg addrhi,
--                                tcg_insn_unit *raddr,
--                                tcg_insn_unit **label_ptr)
--{
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->oi = oi;
--    label->type = type;
--    label->datalo_reg = datalo;
--    label->datahi_reg = datahi;
--    label->addrlo_reg = addrlo;
--    label->addrhi_reg = addrhi;
--    label->raddr = tcg_splitwx_to_rx(raddr);
--    label->label_ptr[0] = label_ptr[0];
--    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
--        label->label_ptr[1] = label_ptr[1];
--    }
--}
--
- /*
-  * Generate code for the slow path for a load at the end of block
-  */
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-     return true;
- }
- #else
--
--static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
--                                   TCGReg addrhi, unsigned a_bits)
--{
--    unsigned a_mask = (1 << a_bits) - 1;
--    TCGLabelQemuLdst *label;
--
--    tcg_out_testi(s, addrlo, a_mask);
--    /* jne slow_path */
--    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
--
--    label = new_ldst_label(s);
--    label->is_ld = is_ld;
--    label->addrlo_reg = addrlo;
--    label->addrhi_reg = addrhi;
--    label->raddr = tcg_splitwx_to_rx(s->code_ptr + 4);
--    label->label_ptr[0] = s->code_ptr;
--
--    s->code_ptr += 4;
--}
--
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     /* resolve label address */
-@@ -XXX,XX +XXX,XX @@ static inline int setup_guest_base_seg(void)
- #endif /* setup_guest_base_seg */
- #endif /* SOFTMMU */
-+/*
-+ * For softmmu, perform the TLB load and compare.
-+ * For useronly, perform any required alignment tests.
-+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
-+ * is required and fill in @h with the host address for the fast path.
-+ */
-+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+                                           TCGReg addrlo, TCGReg addrhi,
-+                                           MemOpIdx oi, bool is_ld)
-+{
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+    unsigned a_mask = (1 << a_bits) - 1;
-+
-+#ifdef CONFIG_SOFTMMU
-+    int cmp_ofs = is_ld ? offsetof(CPUTLBEntry, addr_read)
-+                        : offsetof(CPUTLBEntry, addr_write);
-+    TCGType ttype = TCG_TYPE_I32;
-+    TCGType tlbtype = TCG_TYPE_I32;
-+    int trexw = 0, hrexw = 0, tlbrexw = 0;
-+    unsigned mem_index = get_mmuidx(oi);
-+    unsigned s_bits = opc & MO_SIZE;
-+    unsigned s_mask = (1 << s_bits) - 1;
-+    target_ulong tlb_mask;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addrlo;
-+    ldst->addrhi_reg = addrhi;
-+
-+    if (TCG_TARGET_REG_BITS == 64) {
-+        if (TARGET_LONG_BITS == 64) {
-+            ttype = TCG_TYPE_I64;
-+            trexw = P_REXW;
-+        }
-+        if (TCG_TYPE_PTR == TCG_TYPE_I64) {
-+            hrexw = P_REXW;
-+            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
-+                tlbtype = TCG_TYPE_I64;
-+                tlbrexw = P_REXW;
-+            }
-+        }
-+    }
-+
-+    tcg_out_mov(s, tlbtype, TCG_REG_L0, addrlo);
-+    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, TCG_REG_L0,
-+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+
-+    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, TCG_REG_L0, TCG_AREG0,
-+                         TLB_MASK_TABLE_OFS(mem_index) +
-+                         offsetof(CPUTLBDescFast, mask));
-+
-+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, TCG_REG_L0, TCG_AREG0,
-+                         TLB_MASK_TABLE_OFS(mem_index) +
-+                         offsetof(CPUTLBDescFast, table));
-+
-+    /*
-+     * If the required alignment is at least as large as the access, simply
-+     * copy the address and mask.  For lesser alignments, check that we don't
-+     * cross pages for the complete access.
-+     */
-+    if (a_bits >= s_bits) {
-+        tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
-+    } else {
-+        tcg_out_modrm_offset(s, OPC_LEA + trexw, TCG_REG_L1,
-+                             addrlo, s_mask - a_mask);
-+    }
-+    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-+    tgen_arithi(s, ARITH_AND + trexw, TCG_REG_L1, tlb_mask, 0);
-+
-+    /* cmp 0(TCG_REG_L0), TCG_REG_L1 */
-+    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw,
-+                         TCG_REG_L1, TCG_REG_L0, cmp_ofs);
-+
-+    /*
-+     * Prepare for both the fast path add of the tlb addend, and the slow
-+     * path function argument setup.
-+     */
-+    *h = (HostAddress) {
-+        .base = TCG_REG_L1,
-+        .index = -1
-+    };
-+    tcg_out_mov(s, ttype, h->base, addrlo);
-+
-+    /* jne slow_path */
-+    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-+    ldst->label_ptr[0] = s->code_ptr;
-+    s->code_ptr += 4;
-+
-+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-+        /* cmp 4(TCG_REG_L0), addrhi */
-+        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, TCG_REG_L0, cmp_ofs + 4);
-+
-+        /* jne slow_path */
-+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-+        ldst->label_ptr[1] = s->code_ptr;
-+        s->code_ptr += 4;
-+    }
-+
-+    /* TLB Hit.  */
-+
-+    /* add addend(TCG_REG_L0), TCG_REG_L1 */
-+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, h->base, TCG_REG_L0,
-+                         offsetof(CPUTLBEntry, addend));
-+#else
-+    if (a_bits) {
-+        ldst = new_ldst_label(s);
-+
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addrlo;
-+        ldst->addrhi_reg = addrhi;
-+
-+        tcg_out_testi(s, addrlo, a_mask);
-+        /* jne slow_path */
-+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-+        ldst->label_ptr[0] = s->code_ptr;
-+        s->code_ptr += 4;
-+    }
-+
-+    *h = x86_guest_base;
-+    h->base = addrlo;
-+#endif
-+
-+    return ldst;
-+}
-+
- static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
-                                    HostAddress h, TCGType type, MemOp memop)
- {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
-                             TCGReg addrlo, TCGReg addrhi,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#if defined(CONFIG_SOFTMMU)
--    tcg_insn_unit *label_ptr[2];
-+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
-+    tcg_out_qemu_ld_direct(s, datalo, datahi, h, data_type, get_memop(oi));
--    tcg_out_tlb_load(s, addrlo, addrhi, get_mmuidx(oi), opc,
--                     label_ptr, offsetof(CPUTLBEntry, addr_read));
--
--    /* TLB Hit.  */
--    h.base = TCG_REG_L1;
--    h.index = -1;
--    h.ofs = 0;
--    h.seg = 0;
--    tcg_out_qemu_ld_direct(s, datalo, datahi, h, data_type, opc);
--
--    /* Record the current context of a load into ldst label */
--    add_qemu_ldst_label(s, true, data_type, oi, datalo, datahi,
--                        addrlo, addrhi, s->code_ptr, label_ptr);
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = datalo;
-+        ldst->datahi_reg = datahi;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--
--    h = x86_guest_base;
--    h.base = addrlo;
--    tcg_out_qemu_ld_direct(s, datalo, datahi, h, data_type, opc);
--#endif
- }
- static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
-                             TCGReg addrlo, TCGReg addrhi,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#if defined(CONFIG_SOFTMMU)
--    tcg_insn_unit *label_ptr[2];
-+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
-+    tcg_out_qemu_st_direct(s, datalo, datahi, h, get_memop(oi));
--    tcg_out_tlb_load(s, addrlo, addrhi, get_mmuidx(oi), opc,
--                     label_ptr, offsetof(CPUTLBEntry, addr_write));
--
--    /* TLB Hit.  */
--    h.base = TCG_REG_L1;
--    h.index = -1;
--    h.ofs = 0;
--    h.seg = 0;
--    tcg_out_qemu_st_direct(s, datalo, datahi, h, opc);
--
--    /* Record the current context of a store into ldst label */
--    add_qemu_ldst_label(s, false, data_type, oi, datalo, datahi,
--                        addrlo, addrhi, s->code_ptr, label_ptr);
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, false, addrlo, addrhi, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = datalo;
-+        ldst->datahi_reg = datahi;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--
--    h = x86_guest_base;
--    h.base = addrlo;
--
--    tcg_out_qemu_st_direct(s, datalo, datahi, h, opc);
--#endif
- }
- static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
---
-.34.1

-[PULL 12/53] tcg/i386: Use indexed addressing for softmmu fast path
+Deleted patch
-Since tcg_out_{ld,st}_helper_args, the slow path no longer requires
-the address argument to be set up by the tlb load sequence.  Use a
-plain load for the addend and indexed addressing with the original
-input address register.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/i386/tcg-target.c.inc | 25 ++++++++++---------------
-file changed, 10 insertions(+), 15 deletions(-)
-diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/i386/tcg-target.c.inc
-+++ b/tcg/i386/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-         tcg_out_sti(s, TCG_TYPE_PTR, (uintptr_t)l->raddr, TCG_REG_ESP, ofs);
-     } else {
-         tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
--        /* The second argument is already loaded with addrlo.  */
-+        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
-+                    l->addrlo_reg);
-         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], oi);
-         tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
-                      (uintptr_t)l->raddr);
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-         tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP, ofs);
-     } else {
-         tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
--        /* The second argument is already loaded with addrlo.  */
-+        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
-+                    l->addrlo_reg);
-         tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
-                     tcg_target_call_iarg_regs[2], l->datalo_reg);
-         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3], oi);
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw,
-                          TCG_REG_L1, TCG_REG_L0, cmp_ofs);
--    /*
--     * Prepare for both the fast path add of the tlb addend, and the slow
--     * path function argument setup.
--     */
--    *h = (HostAddress) {
--        .base = TCG_REG_L1,
--        .index = -1
--    };
--    tcg_out_mov(s, ttype, h->base, addrlo);
--
-     /* jne slow_path */
-     tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-     ldst->label_ptr[0] = s->code_ptr;
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     }
-     /* TLB Hit.  */
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_L0, TCG_REG_L0,
-+               offsetof(CPUTLBEntry, addend));
--    /* add addend(TCG_REG_L0), TCG_REG_L1 */
--    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, h->base, TCG_REG_L0,
--                         offsetof(CPUTLBEntry, addend));
-+    *h = (HostAddress) {
-+        .base = addrlo,
-+        .index = TCG_REG_L0,
-+    };
- #else
-     if (a_bits) {
-         ldst = new_ldst_label(s);
---
-.34.1

-[PULL 13/53] tcg/aarch64: Introduce prepare_host_addr
+Deleted patch
-Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
-and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
-into one function that returns HostAddress and TCGLabelQemuLdst structures.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/aarch64/tcg-target.c.inc | 313 +++++++++++++++--------------------
-file changed, 133 insertions(+), 180 deletions(-)
-diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/aarch64/tcg-target.c.inc
-+++ b/tcg/aarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
-     tcg_out_goto(s, lb->raddr);
-     return true;
- }
--
--static void add_qemu_ldst_label(TCGContext *s, bool is_ld, MemOpIdx oi,
--                                TCGType ext, TCGReg data_reg, TCGReg addr_reg,
--                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
--{
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->oi = oi;
--    label->type = ext;
--    label->datalo_reg = data_reg;
--    label->addrlo_reg = addr_reg;
--    label->raddr = tcg_splitwx_to_rx(raddr);
--    label->label_ptr[0] = label_ptr;
--}
--
--/* We expect to use a 7-bit scaled negative offset from ENV.  */
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
--
--/* These offsets are built into the LDP below.  */
--QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
--QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
--
--/* Load and compare a TLB entry, emitting the conditional jump to the
--   slow path for the failure case, which will be patched later when finalizing
--   the slow path. Generated code returns the host addend in X1,
--   clobbers X0,X2,X3,TMP. */
--static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, MemOp opc,
--                             tcg_insn_unit **label_ptr, int mem_index,
--                             bool is_read)
--{
--    unsigned a_bits = get_alignment_bits(opc);
--    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_mask = (1u << a_bits) - 1;
--    unsigned s_mask = (1u << s_bits) - 1;
--    TCGReg x3;
--    TCGType mask_type;
--    uint64_t compare_mask;
--
--    mask_type = (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32
--                 ? TCG_TYPE_I64 : TCG_TYPE_I32);
--
--    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
--    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
--                 TLB_MASK_TABLE_OFS(mem_index), 1, 0);
--
--    /* Extract the TLB index from the address into X0.  */
--    tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
--                 TCG_REG_X0, TCG_REG_X0, addr_reg,
--                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--
--    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
--    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
--
--    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
--    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_X0, TCG_REG_X1, is_read
--               ? offsetof(CPUTLBEntry, addr_read)
--               : offsetof(CPUTLBEntry, addr_write));
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
--               offsetof(CPUTLBEntry, addend));
--
--    /* For aligned accesses, we check the first byte and include the alignment
--       bits within the address.  For unaligned access, we check that we don't
--       cross pages using the address of the last byte of the access.  */
--    if (a_bits >= s_bits) {
--        x3 = addr_reg;
--    } else {
--        tcg_out_insn(s, 3401, ADDI, TARGET_LONG_BITS == 64,
--                     TCG_REG_X3, addr_reg, s_mask - a_mask);
--        x3 = TCG_REG_X3;
--    }
--    compare_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
--
--    /* Store the page mask part of the address into X3.  */
--    tcg_out_logicali(s, I3404_ANDI, TARGET_LONG_BITS == 64,
--                     TCG_REG_X3, x3, compare_mask);
--
--    /* Perform the address comparison. */
--    tcg_out_cmp(s, TARGET_LONG_BITS == 64, TCG_REG_X0, TCG_REG_X3, 0);
--
--    /* If not equal, we jump to the slow path. */
--    *label_ptr = s->code_ptr;
--    tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
--}
--
- #else
--static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
--                                   unsigned a_bits)
--{
--    unsigned a_mask = (1 << a_bits) - 1;
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->addrlo_reg = addr_reg;
--
--    /* tst addr, #mask */
--    tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, a_mask);
--
--    label->label_ptr[0] = s->code_ptr;
--
--    /* b.ne slow_path */
--    tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
--
--    label->raddr = tcg_splitwx_to_rx(s->code_ptr);
--}
--
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     if (!reloc_pc19(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- }
- #endif /* CONFIG_SOFTMMU */
-+/*
-+ * For softmmu, perform the TLB load and compare.
-+ * For useronly, perform any required alignment tests.
-+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
-+ * is required and fill in @h with the host address for the fast path.
-+ */
-+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+                                           TCGReg addr_reg, MemOpIdx oi,
-+                                           bool is_ld)
-+{
-+    TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+    unsigned a_mask = (1u << a_bits) - 1;
-+
-+#ifdef CONFIG_SOFTMMU
-+    unsigned s_bits = opc & MO_SIZE;
-+    unsigned s_mask = (1u << s_bits) - 1;
-+    unsigned mem_index = get_mmuidx(oi);
-+    TCGReg x3;
-+    TCGType mask_type;
-+    uint64_t compare_mask;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addr_reg;
-+
-+    mask_type = (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32
-+                 ? TCG_TYPE_I64 : TCG_TYPE_I32);
-+
-+    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
-+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
-+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
-+    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
-+                 TLB_MASK_TABLE_OFS(mem_index), 1, 0);
-+
-+    /* Extract the TLB index from the address into X0.  */
-+    tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
-+                 TCG_REG_X0, TCG_REG_X0, addr_reg,
-+                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+
-+    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
-+    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
-+
-+    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
-+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_X0, TCG_REG_X1,
-+               is_ld ? offsetof(CPUTLBEntry, addr_read)
-+                     : offsetof(CPUTLBEntry, addr_write));
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
-+               offsetof(CPUTLBEntry, addend));
-+
-+    /*
-+     * For aligned accesses, we check the first byte and include the alignment
-+     * bits within the address.  For unaligned access, we check that we don't
-+     * cross pages using the address of the last byte of the access.
-+     */
-+    if (a_bits >= s_bits) {
-+        x3 = addr_reg;
-+    } else {
-+        tcg_out_insn(s, 3401, ADDI, TARGET_LONG_BITS == 64,
-+                     TCG_REG_X3, addr_reg, s_mask - a_mask);
-+        x3 = TCG_REG_X3;
-+    }
-+    compare_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
-+
-+    /* Store the page mask part of the address into X3.  */
-+    tcg_out_logicali(s, I3404_ANDI, TARGET_LONG_BITS == 64,
-+                     TCG_REG_X3, x3, compare_mask);
-+
-+    /* Perform the address comparison. */
-+    tcg_out_cmp(s, TARGET_LONG_BITS == 64, TCG_REG_X0, TCG_REG_X3, 0);
-+
-+    /* If not equal, we jump to the slow path. */
-+    ldst->label_ptr[0] = s->code_ptr;
-+    tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
-+
-+    *h = (HostAddress){
-+        .base = TCG_REG_X1,
-+        .index = addr_reg,
-+        .index_ext = addr_type
-+    };
-+#else
-+    if (a_mask) {
-+        ldst = new_ldst_label(s);
-+
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addr_reg;
-+
-+        /* tst addr, #mask */
-+        tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, a_mask);
-+
-+        /* b.ne slow_path */
-+        ldst->label_ptr[0] = s->code_ptr;
-+        tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
-+    }
-+
-+    if (USE_GUEST_BASE) {
-+        *h = (HostAddress){
-+            .base = TCG_REG_GUEST_BASE,
-+            .index = addr_reg,
-+            .index_ext = addr_type
-+        };
-+    } else {
-+        *h = (HostAddress){
-+            .base = addr_reg,
-+            .index = TCG_REG_XZR,
-+            .index_ext = TCG_TYPE_I64
-+        };
-+    }
-+#endif
-+
-+    return ldst;
-+}
-+
- static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp memop, TCGType ext,
-                                    TCGReg data_r, HostAddress h)
- {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp memop,
- static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp memop = get_memop(oi);
--    TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--    /* Byte swapping is left to middle-end expansion. */
--    tcg_debug_assert((memop & MO_BSWAP) == 0);
-+    ldst = prepare_host_addr(s, &h, addr_reg, oi, true);
-+    tcg_out_qemu_ld_direct(s, get_memop(oi), data_type, data_reg, h);
--#ifdef CONFIG_SOFTMMU
--    tcg_insn_unit *label_ptr;
--
--    tcg_out_tlb_read(s, addr_reg, memop, &label_ptr, get_mmuidx(oi), 1);
--
--    h = (HostAddress){
--        .base = TCG_REG_X1,
--        .index = addr_reg,
--        .index_ext = addr_type
--    };
--    tcg_out_qemu_ld_direct(s, memop, data_type, data_reg, h);
--
--    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#else /* !CONFIG_SOFTMMU */
--    unsigned a_bits = get_alignment_bits(memop);
--    if (a_bits) {
--        tcg_out_test_alignment(s, true, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    if (USE_GUEST_BASE) {
--        h = (HostAddress){
--            .base = TCG_REG_GUEST_BASE,
--            .index = addr_reg,
--            .index_ext = addr_type
--        };
--    } else {
--        h = (HostAddress){
--            .base = addr_reg,
--            .index = TCG_REG_XZR,
--            .index_ext = TCG_TYPE_I64
--        };
--    }
--    tcg_out_qemu_ld_direct(s, memop, data_type, data_reg, h);
--#endif /* CONFIG_SOFTMMU */
- }
- static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp memop = get_memop(oi);
--    TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--    /* Byte swapping is left to middle-end expansion. */
--    tcg_debug_assert((memop & MO_BSWAP) == 0);
-+    ldst = prepare_host_addr(s, &h, addr_reg, oi, false);
-+    tcg_out_qemu_st_direct(s, get_memop(oi), data_reg, h);
--#ifdef CONFIG_SOFTMMU
--    tcg_insn_unit *label_ptr;
--
--    tcg_out_tlb_read(s, addr_reg, memop, &label_ptr, get_mmuidx(oi), 0);
--
--    h = (HostAddress){
--        .base = TCG_REG_X1,
--        .index = addr_reg,
--        .index_ext = addr_type
--    };
--    tcg_out_qemu_st_direct(s, memop, data_reg, h);
--
--    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#else /* !CONFIG_SOFTMMU */
--    unsigned a_bits = get_alignment_bits(memop);
--    if (a_bits) {
--        tcg_out_test_alignment(s, false, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    if (USE_GUEST_BASE) {
--        h = (HostAddress){
--            .base = TCG_REG_GUEST_BASE,
--            .index = addr_reg,
--            .index_ext = addr_type
--        };
--    } else {
--        h = (HostAddress){
--            .base = addr_reg,
--            .index = TCG_REG_XZR,
--            .index_ext = TCG_TYPE_I64
--        };
--    }
--    tcg_out_qemu_st_direct(s, memop, data_reg, h);
--#endif /* CONFIG_SOFTMMU */
- }
- static const tcg_insn_unit *tb_ret_addr;
---
-.34.1

-[PULL 15/53] tcg/loongarch64: Introduce prepare_host_addr
+Deleted patch
-Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
-tcg_out_zext_addr_if_32_bit, and some code that lived in both
-tcg_out_qemu_ld and tcg_out_qemu_st into one function that returns
-HostAddress and TCGLabelQemuLdst structures.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/loongarch64/tcg-target.c.inc | 255 +++++++++++++------------------
-file changed, 105 insertions(+), 150 deletions(-)
-diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/loongarch64/tcg-target.c.inc
-+++ b/tcg/loongarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[4] = {
-     [MO_64] = helper_le_stq_mmu,
- };
--/* We expect to use a 12-bit negative offset from ENV.  */
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
--
- static bool tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
- {
-     tcg_out_opc_b(s, 0);
-     return reloc_br_sd10k16(s->code_ptr - 1, target);
- }
--/*
-- * Emits common code for TLB addend lookup, that eventually loads the
-- * addend in TCG_REG_TMP2.
-- */
--static void tcg_out_tlb_load(TCGContext *s, TCGReg addrl, MemOpIdx oi,
--                             tcg_insn_unit **label_ptr, bool is_load)
--{
--    MemOp opc = get_memop(oi);
--    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_bits = get_alignment_bits(opc);
--    tcg_target_long compare_mask;
--    int mem_index = get_mmuidx(oi);
--    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
--    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
--    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
--
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, TCG_AREG0, mask_ofs);
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, table_ofs);
--
--    tcg_out_opc_srli_d(s, TCG_REG_TMP2, addrl,
--                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--    tcg_out_opc_and(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
--    tcg_out_opc_add_d(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
--
--    /* Load the tlb comparator and the addend.  */
--    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
--               is_load ? offsetof(CPUTLBEntry, addr_read)
--               : offsetof(CPUTLBEntry, addr_write));
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
--               offsetof(CPUTLBEntry, addend));
--
--    /* We don't support unaligned accesses.  */
--    if (a_bits < s_bits) {
--        a_bits = s_bits;
--    }
--    /* Clear the non-page, non-alignment bits from the address.  */
--    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
--    tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
--    tcg_out_opc_and(s, TCG_REG_TMP1, TCG_REG_TMP1, addrl);
--
--    /* Compare masked address with the TLB entry.  */
--    label_ptr[0] = s->code_ptr;
--    tcg_out_opc_bne(s, TCG_REG_TMP0, TCG_REG_TMP1, 0);
--
--    /* TLB Hit - addend in TCG_REG_TMP2, ready for use.  */
--}
--
--static void add_qemu_ldst_label(TCGContext *s, int is_ld, MemOpIdx oi,
--                                TCGType type,
--                                TCGReg datalo, TCGReg addrlo,
--                                void *raddr, tcg_insn_unit **label_ptr)
--{
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->oi = oi;
--    label->type = type;
--    label->datalo_reg = datalo;
--    label->datahi_reg = 0; /* unused */
--    label->addrlo_reg = addrlo;
--    label->addrhi_reg = 0; /* unused */
--    label->raddr = tcg_splitwx_to_rx(raddr);
--    label->label_ptr[0] = label_ptr[0];
--}
--
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     MemOpIdx oi = l->oi;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-     return tcg_out_goto(s, l->raddr);
- }
- #else
--
--/*
-- * Alignment helpers for user-mode emulation
-- */
--
--static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
--                                   unsigned a_bits)
--{
--    TCGLabelQemuLdst *l = new_ldst_label(s);
--
--    l->is_ld = is_ld;
--    l->addrlo_reg = addr_reg;
--
--    /*
--     * Without micro-architecture details, we don't know which of bstrpick or
--     * andi is faster, so use bstrpick as it's not constrained by imm field
--     * width. (Not to say alignments >= 2^12 are going to happen any time
--     * soon, though)
--     */
--    tcg_out_opc_bstrpick_d(s, TCG_REG_TMP1, addr_reg, 0, a_bits - 1);
--
--    l->label_ptr[0] = s->code_ptr;
--    tcg_out_opc_bne(s, TCG_REG_TMP1, TCG_REG_ZERO, 0);
--
--    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
--}
--
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     /* resolve label address */
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- #endif /* CONFIG_SOFTMMU */
--/*
-- * `ext32u` the address register into the temp register given,
-- * if target is 32-bit, no-op otherwise.
-- *
-- * Returns the address register ready for use with TLB addend.
-- */
--static TCGReg tcg_out_zext_addr_if_32_bit(TCGContext *s,
--                                          TCGReg addr, TCGReg tmp)
--{
--    if (TARGET_LONG_BITS == 32) {
--        tcg_out_ext32u(s, tmp, addr);
--        return tmp;
--    }
--    return addr;
--}
--
- typedef struct {
-     TCGReg base;
-     TCGReg index;
- } HostAddress;
-+/*
-+ * For softmmu, perform the TLB load and compare.
-+ * For useronly, perform any required alignment tests.
-+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
-+ * is required and fill in @h with the host address for the fast path.
-+ */
-+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+                                           TCGReg addr_reg, MemOpIdx oi,
-+                                           bool is_ld)
-+{
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+
-+#ifdef CONFIG_SOFTMMU
-+    unsigned s_bits = opc & MO_SIZE;
-+    int mem_index = get_mmuidx(oi);
-+    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
-+    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
-+    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
-+    tcg_target_long compare_mask;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addr_reg;
-+
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, TCG_AREG0, mask_ofs);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, table_ofs);
-+
-+    tcg_out_opc_srli_d(s, TCG_REG_TMP2, addr_reg,
-+                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+    tcg_out_opc_and(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
-+    tcg_out_opc_add_d(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
-+
-+    /* Load the tlb comparator and the addend.  */
-+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
-+               is_ld ? offsetof(CPUTLBEntry, addr_read)
-+                     : offsetof(CPUTLBEntry, addr_write));
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
-+               offsetof(CPUTLBEntry, addend));
-+
-+    /* We don't support unaligned accesses.  */
-+    if (a_bits < s_bits) {
-+        a_bits = s_bits;
-+    }
-+    /* Clear the non-page, non-alignment bits from the address.  */
-+    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
-+    tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
-+    tcg_out_opc_and(s, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
-+
-+    /* Compare masked address with the TLB entry.  */
-+    ldst->label_ptr[0] = s->code_ptr;
-+    tcg_out_opc_bne(s, TCG_REG_TMP0, TCG_REG_TMP1, 0);
-+
-+    h->index = TCG_REG_TMP2;
-+#else
-+    if (a_bits) {
-+        ldst = new_ldst_label(s);
-+
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addr_reg;
-+
-+        /*
-+         * Without micro-architecture details, we don't know which of
-+         * bstrpick or andi is faster, so use bstrpick as it's not
-+         * constrained by imm field width. Not to say alignments >= 2^12
-+         * are going to happen any time soon.
-+         */
-+        tcg_out_opc_bstrpick_d(s, TCG_REG_TMP1, addr_reg, 0, a_bits - 1);
-+
-+        ldst->label_ptr[0] = s->code_ptr;
-+        tcg_out_opc_bne(s, TCG_REG_TMP1, TCG_REG_ZERO, 0);
-+    }
-+
-+    h->index = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
-+#endif
-+
-+    if (TARGET_LONG_BITS == 32) {
-+        h->base = TCG_REG_TMP0;
-+        tcg_out_ext32u(s, h->base, addr_reg);
-+    } else {
-+        h->base = addr_reg;
-+    }
-+
-+    return ldst;
-+}
-+
- static void tcg_out_qemu_ld_indexed(TCGContext *s, MemOp opc, TCGType type,
-                                     TCGReg rd, HostAddress h)
- {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_indexed(TCGContext *s, MemOp opc, TCGType type,
- static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#ifdef CONFIG_SOFTMMU
--    tcg_insn_unit *label_ptr[1];
-+    ldst = prepare_host_addr(s, &h, addr_reg, oi, true);
-+    tcg_out_qemu_ld_indexed(s, get_memop(oi), data_type, data_reg, h);
--    tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 1);
--    h.index = TCG_REG_TMP2;
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, true, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    h.index = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
--#endif
--
--    h.base = tcg_out_zext_addr_if_32_bit(s, addr_reg, TCG_REG_TMP0);
--    tcg_out_qemu_ld_indexed(s, opc, data_type, data_reg, h);
--
--#ifdef CONFIG_SOFTMMU
--    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#endif
- }
- static void tcg_out_qemu_st_indexed(TCGContext *s, MemOp opc,
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_indexed(TCGContext *s, MemOp opc,
- static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#ifdef CONFIG_SOFTMMU
--    tcg_insn_unit *label_ptr[1];
-+    ldst = prepare_host_addr(s, &h, addr_reg, oi, false);
-+    tcg_out_qemu_st_indexed(s, get_memop(oi), data_reg, h);
--    tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 0);
--    h.index = TCG_REG_TMP2;
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, false, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    h.index = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
--#endif
--
--    h.base = tcg_out_zext_addr_if_32_bit(s, addr_reg, TCG_REG_TMP0);
--    tcg_out_qemu_st_indexed(s, opc, data_reg, h);
--
--#ifdef CONFIG_SOFTMMU
--    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#endif
- }
- /*
---
-.34.1

-[PULL 17/53] tcg/ppc: Introduce prepare_host_addr
+Deleted patch
-Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
-and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
-into one function that returns HostAddress and TCGLabelQemuLdst structures.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/ppc/tcg-target.c.inc | 381 ++++++++++++++++++---------------------
-file changed, 172 insertions(+), 209 deletions(-)
-diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-     [MO_BEUQ] = helper_be_stq_mmu,
- };
--/* We expect to use a 16-bit negative offset from ENV.  */
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
--
--/* Perform the TLB load and compare.  Places the result of the comparison
--   in CR7, loads the addend of the TLB into R3, and returns the register
--   containing the guest address (zero-extended into R4).  Clobbers R0 and R2. */
--
--static TCGReg tcg_out_tlb_read(TCGContext *s, MemOp opc,
--                               TCGReg addrlo, TCGReg addrhi,
--                               int mem_index, bool is_read)
--{
--    int cmp_off
--        = (is_read
--           ? offsetof(CPUTLBEntry, addr_read)
--           : offsetof(CPUTLBEntry, addr_write));
--    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
--    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
--    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
--    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_bits = get_alignment_bits(opc);
--
--    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_AREG0, mask_off);
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R4, TCG_AREG0, table_off);
--
--    /* Extract the page index, shifted into place for tlb index.  */
--    if (TCG_TARGET_REG_BITS == 32) {
--        tcg_out_shri32(s, TCG_REG_TMP1, addrlo,
--                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--    } else {
--        tcg_out_shri64(s, TCG_REG_TMP1, addrlo,
--                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--    }
--    tcg_out32(s, AND | SAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_TMP1));
--
--    /* Load the TLB comparator.  */
--    if (cmp_off == 0 && TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
--        uint32_t lxu = (TCG_TARGET_REG_BITS == 32 || TARGET_LONG_BITS == 32
--                        ? LWZUX : LDUX);
--        tcg_out32(s, lxu | TAB(TCG_REG_TMP1, TCG_REG_R3, TCG_REG_R4));
--    } else {
--        tcg_out32(s, ADD | TAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_R4));
--        if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
--            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP1, TCG_REG_R3, cmp_off + 4);
--            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_R4, TCG_REG_R3, cmp_off);
--        } else {
--            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP1, TCG_REG_R3, cmp_off);
--        }
--    }
--
--    /* Load the TLB addend for use on the fast path.  Do this asap
--       to minimize any load use delay.  */
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_REG_R3,
--               offsetof(CPUTLBEntry, addend));
--
--    /* Clear the non-page, non-alignment bits from the address */
--    if (TCG_TARGET_REG_BITS == 32) {
--        /* We don't support unaligned accesses on 32-bits.
--         * Preserve the bottom bits and thus trigger a comparison
--         * failure on unaligned accesses.
--         */
--        if (a_bits < s_bits) {
--            a_bits = s_bits;
--        }
--        tcg_out_rlw(s, RLWINM, TCG_REG_R0, addrlo, 0,
--                    (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
--    } else {
--        TCGReg t = addrlo;
--
--        /* If the access is unaligned, we need to make sure we fail if we
--         * cross a page boundary.  The trick is to add the access size-1
--         * to the address before masking the low bits.  That will make the
--         * address overflow to the next page if we cross a page boundary,
--         * which will then force a mismatch of the TLB compare.
--         */
--        if (a_bits < s_bits) {
--            unsigned a_mask = (1 << a_bits) - 1;
--            unsigned s_mask = (1 << s_bits) - 1;
--            tcg_out32(s, ADDI | TAI(TCG_REG_R0, t, s_mask - a_mask));
--            t = TCG_REG_R0;
--        }
--
--        /* Mask the address for the requested alignment.  */
--        if (TARGET_LONG_BITS == 32) {
--            tcg_out_rlw(s, RLWINM, TCG_REG_R0, t, 0,
--                        (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
--            /* Zero-extend the address for use in the final address.  */
--            tcg_out_ext32u(s, TCG_REG_R4, addrlo);
--            addrlo = TCG_REG_R4;
--        } else if (a_bits == 0) {
--            tcg_out_rld(s, RLDICR, TCG_REG_R0, t, 0, 63 - TARGET_PAGE_BITS);
--        } else {
--            tcg_out_rld(s, RLDICL, TCG_REG_R0, t,
--                        64 - TARGET_PAGE_BITS, TARGET_PAGE_BITS - a_bits);
--            tcg_out_rld(s, RLDICL, TCG_REG_R0, TCG_REG_R0, TARGET_PAGE_BITS, 0);
--        }
--    }
--
--    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
--        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
--                    0, 7, TCG_TYPE_I32);
--        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_R4, 0, 6, TCG_TYPE_I32);
--        tcg_out32(s, CRAND | BT(7, CR_EQ) | BA(6, CR_EQ) | BB(7, CR_EQ));
--    } else {
--        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
--                    0, 7, TCG_TYPE_TL);
--    }
--
--    return addrlo;
--}
--
--/* Record the context of a call to the out of line helper code for the slow
--   path for a load or store, so that we can later generate the correct
--   helper code.  */
--static void add_qemu_ldst_label(TCGContext *s, bool is_ld,
--                                TCGType type, MemOpIdx oi,
--                                TCGReg datalo_reg, TCGReg datahi_reg,
--                                TCGReg addrlo_reg, TCGReg addrhi_reg,
--                                tcg_insn_unit *raddr, tcg_insn_unit *lptr)
--{
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->type = type;
--    label->oi = oi;
--    label->datalo_reg = datalo_reg;
--    label->datahi_reg = datahi_reg;
--    label->addrlo_reg = addrlo_reg;
--    label->addrhi_reg = addrhi_reg;
--    label->raddr = tcg_splitwx_to_rx(raddr);
--    label->label_ptr[0] = lptr;
--}
--
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
-     MemOpIdx oi = lb->oi;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
-     return true;
- }
- #else
--
--static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
--                                   TCGReg addrhi, unsigned a_bits)
--{
--    unsigned a_mask = (1 << a_bits) - 1;
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->addrlo_reg = addrlo;
--    label->addrhi_reg = addrhi;
--
--    /* We are expecting a_bits to max out at 7, much lower than ANDI. */
--    tcg_debug_assert(a_bits < 16);
--    tcg_out32(s, ANDI | SAI(addrlo, TCG_REG_R0, a_mask));
--
--    label->label_ptr[0] = s->code_ptr;
--    tcg_out32(s, BC | BI(0, CR_EQ) | BO_COND_FALSE | LK);
--
--    label->raddr = tcg_splitwx_to_rx(s->code_ptr);
--}
--
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     if (!reloc_pc14(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-@@ -XXX,XX +XXX,XX @@ typedef struct {
-     TCGReg index;
- } HostAddress;
-+/*
-+ * For softmmu, perform the TLB load and compare.
-+ * For useronly, perform any required alignment tests.
-+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
-+ * is required and fill in @h with the host address for the fast path.
-+ */
-+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+                                           TCGReg addrlo, TCGReg addrhi,
-+                                           MemOpIdx oi, bool is_ld)
-+{
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+
-+#ifdef CONFIG_SOFTMMU
-+    int mem_index = get_mmuidx(oi);
-+    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
-+                        : offsetof(CPUTLBEntry, addr_write);
-+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-+    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-+    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-+    unsigned s_bits = opc & MO_SIZE;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addrlo;
-+    ldst->addrhi_reg = addrhi;
-+
-+    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_AREG0, mask_off);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R4, TCG_AREG0, table_off);
-+
-+    /* Extract the page index, shifted into place for tlb index.  */
-+    if (TCG_TARGET_REG_BITS == 32) {
-+        tcg_out_shri32(s, TCG_REG_TMP1, addrlo,
-+                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+    } else {
-+        tcg_out_shri64(s, TCG_REG_TMP1, addrlo,
-+                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+    }
-+    tcg_out32(s, AND | SAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_TMP1));
-+
-+    /* Load the TLB comparator.  */
-+    if (cmp_off == 0 && TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-+        uint32_t lxu = (TCG_TARGET_REG_BITS == 32 || TARGET_LONG_BITS == 32
-+                        ? LWZUX : LDUX);
-+        tcg_out32(s, lxu | TAB(TCG_REG_TMP1, TCG_REG_R3, TCG_REG_R4));
-+    } else {
-+        tcg_out32(s, ADD | TAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_R4));
-+        if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-+            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP1, TCG_REG_R3, cmp_off + 4);
-+            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_R4, TCG_REG_R3, cmp_off);
-+        } else {
-+            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP1, TCG_REG_R3, cmp_off);
-+        }
-+    }
-+
-+    /*
-+     * Load the TLB addend for use on the fast path.
-+     * Do this asap to minimize any load use delay.
-+     */
-+    h->base = TCG_REG_R3;
-+    tcg_out_ld(s, TCG_TYPE_PTR, h->base, TCG_REG_R3,
-+               offsetof(CPUTLBEntry, addend));
-+
-+    /* Clear the non-page, non-alignment bits from the address */
-+    if (TCG_TARGET_REG_BITS == 32) {
-+        /*
-+         * We don't support unaligned accesses on 32-bits.
-+         * Preserve the bottom bits and thus trigger a comparison
-+         * failure on unaligned accesses.
-+         */
-+        if (a_bits < s_bits) {
-+            a_bits = s_bits;
-+        }
-+        tcg_out_rlw(s, RLWINM, TCG_REG_R0, addrlo, 0,
-+                    (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
-+    } else {
-+        TCGReg t = addrlo;
-+
-+        /*
-+         * If the access is unaligned, we need to make sure we fail if we
-+         * cross a page boundary.  The trick is to add the access size-1
-+         * to the address before masking the low bits.  That will make the
-+         * address overflow to the next page if we cross a page boundary,
-+         * which will then force a mismatch of the TLB compare.
-+         */
-+        if (a_bits < s_bits) {
-+            unsigned a_mask = (1 << a_bits) - 1;
-+            unsigned s_mask = (1 << s_bits) - 1;
-+            tcg_out32(s, ADDI | TAI(TCG_REG_R0, t, s_mask - a_mask));
-+            t = TCG_REG_R0;
-+        }
-+
-+        /* Mask the address for the requested alignment.  */
-+        if (TARGET_LONG_BITS == 32) {
-+            tcg_out_rlw(s, RLWINM, TCG_REG_R0, t, 0,
-+                        (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
-+            /* Zero-extend the address for use in the final address.  */
-+            tcg_out_ext32u(s, TCG_REG_R4, addrlo);
-+            addrlo = TCG_REG_R4;
-+        } else if (a_bits == 0) {
-+            tcg_out_rld(s, RLDICR, TCG_REG_R0, t, 0, 63 - TARGET_PAGE_BITS);
-+        } else {
-+            tcg_out_rld(s, RLDICL, TCG_REG_R0, t,
-+                        64 - TARGET_PAGE_BITS, TARGET_PAGE_BITS - a_bits);
-+            tcg_out_rld(s, RLDICL, TCG_REG_R0, TCG_REG_R0, TARGET_PAGE_BITS, 0);
-+        }
-+    }
-+    h->index = addrlo;
-+
-+    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
-+                    0, 7, TCG_TYPE_I32);
-+        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_R4, 0, 6, TCG_TYPE_I32);
-+        tcg_out32(s, CRAND | BT(7, CR_EQ) | BA(6, CR_EQ) | BB(7, CR_EQ));
-+    } else {
-+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
-+                    0, 7, TCG_TYPE_TL);
-+    }
-+
-+    /* Load a pointer into the current opcode w/conditional branch-link. */
-+    ldst->label_ptr[0] = s->code_ptr;
-+    tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
-+#else
-+    if (a_bits) {
-+        ldst = new_ldst_label(s);
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addrlo;
-+        ldst->addrhi_reg = addrhi;
-+
-+        /* We are expecting a_bits to max out at 7, much lower than ANDI. */
-+        tcg_debug_assert(a_bits < 16);
-+        tcg_out32(s, ANDI | SAI(addrlo, TCG_REG_R0, (1 << a_bits) - 1));
-+
-+        ldst->label_ptr[0] = s->code_ptr;
-+        tcg_out32(s, BC | BI(0, CR_EQ) | BO_COND_FALSE | LK);
-+    }
-+
-+    h->base = guest_base ? TCG_GUEST_BASE_REG : 0;
-+    h->index = addrlo;
-+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-+        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
-+        h->index = TCG_REG_TMP1;
-+    }
-+#endif
-+
-+    return ldst;
-+}
-+
- static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
-                             TCGReg addrlo, TCGReg addrhi,
-                             MemOpIdx oi, TCGType data_type)
- {
-     MemOp opc = get_memop(oi);
--    MemOp s_bits = opc & MO_SIZE;
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#ifdef CONFIG_SOFTMMU
--    tcg_insn_unit *label_ptr;
-+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
--    h.index = tcg_out_tlb_read(s, opc, addrlo, addrhi, get_mmuidx(oi), true);
--    h.base = TCG_REG_R3;
--
--    /* Load a pointer into the current opcode w/conditional branch-link. */
--    label_ptr = s->code_ptr;
--    tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
--#else  /* !CONFIG_SOFTMMU */
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
--    }
--    h.base = guest_base ? TCG_GUEST_BASE_REG : 0;
--    h.index = addrlo;
--    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
--        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
--        h.index = TCG_REG_TMP1;
--    }
--#endif
--
--    if (TCG_TARGET_REG_BITS == 32 && s_bits == MO_64) {
-+    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
-         if (opc & MO_BSWAP) {
-             tcg_out32(s, ADDI | TAI(TCG_REG_R0, h.index, 4));
-             tcg_out32(s, LWBRX | TAB(datalo, h.base, h.index));
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
-         }
-     }
--#ifdef CONFIG_SOFTMMU
--    add_qemu_ldst_label(s, true, data_type, oi, datalo, datahi,
--                        addrlo, addrhi, s->code_ptr, label_ptr);
--#endif
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = datalo;
-+        ldst->datahi_reg = datahi;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-+    }
- }
- static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
-                             MemOpIdx oi, TCGType data_type)
- {
-     MemOp opc = get_memop(oi);
--    MemOp s_bits = opc & MO_SIZE;
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#ifdef CONFIG_SOFTMMU
--    tcg_insn_unit *label_ptr;
-+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
--    h.index = tcg_out_tlb_read(s, opc, addrlo, addrhi, get_mmuidx(oi), false);
--    h.base = TCG_REG_R3;
--
--    /* Load a pointer into the current opcode w/conditional branch-link. */
--    label_ptr = s->code_ptr;
--    tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
--#else  /* !CONFIG_SOFTMMU */
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, false, addrlo, addrhi, a_bits);
--    }
--    h.base = guest_base ? TCG_GUEST_BASE_REG : 0;
--    h.index = addrlo;
--    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
--        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
--        h.index = TCG_REG_TMP1;
--    }
--#endif
--
--    if (TCG_TARGET_REG_BITS == 32 && s_bits == MO_64) {
-+    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
-         if (opc & MO_BSWAP) {
-             tcg_out32(s, ADDI | TAI(TCG_REG_R0, h.index, 4));
-             tcg_out32(s, STWBRX | SAB(datalo, h.base, h.index));
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
-         }
-     }
--#ifdef CONFIG_SOFTMMU
--    add_qemu_ldst_label(s, false, data_type, oi, datalo, datahi,
--                        addrlo, addrhi, s->code_ptr, label_ptr);
--#endif
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = datalo;
-+        ldst->datahi_reg = datahi;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-+    }
- }
- static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
---
-.34.1

-[PULL 18/53] tcg/riscv: Introduce prepare_host_addr
+Deleted patch
-Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
-and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
-into one function that returns TCGReg and TCGLabelQemuLdst.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/riscv/tcg-target.c.inc | 253 +++++++++++++++++--------------------
-file changed, 114 insertions(+), 139 deletions(-)
-diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/riscv/tcg-target.c.inc
-+++ b/tcg/riscv/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
- #endif
- };
--/* We expect to use a 12-bit negative offset from ENV.  */
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
--
- static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
- {
-     tcg_out_opc_jump(s, OPC_JAL, TCG_REG_ZERO, 0);
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
-     tcg_debug_assert(ok);
- }
--static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, MemOpIdx oi,
--                               tcg_insn_unit **label_ptr, bool is_load)
--{
--    MemOp opc = get_memop(oi);
--    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_bits = get_alignment_bits(opc);
--    tcg_target_long compare_mask;
--    int mem_index = get_mmuidx(oi);
--    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
--    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
--    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
--    TCGReg mask_base = TCG_AREG0, table_base = TCG_AREG0;
--
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, mask_base, mask_ofs);
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, table_base, table_ofs);
--
--    tcg_out_opc_imm(s, OPC_SRLI, TCG_REG_TMP2, addr,
--                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--    tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
--    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
--
--    /* Load the tlb comparator and the addend.  */
--    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
--               is_load ? offsetof(CPUTLBEntry, addr_read)
--               : offsetof(CPUTLBEntry, addr_write));
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
--               offsetof(CPUTLBEntry, addend));
--
--    /* We don't support unaligned accesses. */
--    if (a_bits < s_bits) {
--        a_bits = s_bits;
--    }
--    /* Clear the non-page, non-alignment bits from the address.  */
--    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
--    if (compare_mask == sextreg(compare_mask, 0, 12)) {
--        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr, compare_mask);
--    } else {
--        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
--        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr);
--    }
--
--    /* Compare masked address with the TLB entry. */
--    label_ptr[0] = s->code_ptr;
--    tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP0, TCG_REG_TMP1, 0);
--
--    /* TLB Hit - translate address using addend.  */
--    if (TARGET_LONG_BITS == 32) {
--        tcg_out_ext32u(s, TCG_REG_TMP0, addr);
--        addr = TCG_REG_TMP0;
--    }
--    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr);
--    return TCG_REG_TMP0;
--}
--
--static void add_qemu_ldst_label(TCGContext *s, int is_ld, MemOpIdx oi,
--                                TCGType data_type, TCGReg data_reg,
--                                TCGReg addr_reg, void *raddr,
--                                tcg_insn_unit **label_ptr)
--{
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->oi = oi;
--    label->type = data_type;
--    label->datalo_reg = data_reg;
--    label->addrlo_reg = addr_reg;
--    label->raddr = tcg_splitwx_to_rx(raddr);
--    label->label_ptr[0] = label_ptr[0];
--}
--
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     MemOpIdx oi = l->oi;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-     return true;
- }
- #else
--
--static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
--                                   unsigned a_bits)
--{
--    unsigned a_mask = (1 << a_bits) - 1;
--    TCGLabelQemuLdst *l = new_ldst_label(s);
--
--    l->is_ld = is_ld;
--    l->addrlo_reg = addr_reg;
--
--    /* We are expecting a_bits to max out at 7, so we can always use andi. */
--    tcg_debug_assert(a_bits < 12);
--    tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, a_mask);
--
--    l->label_ptr[0] = s->code_ptr;
--    tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP1, TCG_REG_ZERO, 0);
--
--    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
--}
--
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     /* resolve label address */
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     return tcg_out_fail_alignment(s, l);
- }
--
- #endif /* CONFIG_SOFTMMU */
-+/*
-+ * For softmmu, perform the TLB load and compare.
-+ * For useronly, perform any required alignment tests.
-+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
-+ * is required and fill in @h with the host address for the fast path.
-+ */
-+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
-+                                           TCGReg addr_reg, MemOpIdx oi,
-+                                           bool is_ld)
-+{
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+    unsigned a_mask = (1u << a_bits) - 1;
-+
-+#ifdef CONFIG_SOFTMMU
-+    unsigned s_bits = opc & MO_SIZE;
-+    int mem_index = get_mmuidx(oi);
-+    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
-+    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
-+    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
-+    TCGReg mask_base = TCG_AREG0, table_base = TCG_AREG0;
-+    tcg_target_long compare_mask;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addr_reg;
-+
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, mask_base, mask_ofs);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, table_base, table_ofs);
-+
-+    tcg_out_opc_imm(s, OPC_SRLI, TCG_REG_TMP2, addr_reg,
-+                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+    tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
-+    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
-+
-+    /* Load the tlb comparator and the addend.  */
-+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
-+               is_ld ? offsetof(CPUTLBEntry, addr_read)
-+                     : offsetof(CPUTLBEntry, addr_write));
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
-+               offsetof(CPUTLBEntry, addend));
-+
-+    /* We don't support unaligned accesses. */
-+    if (a_bits < s_bits) {
-+        a_bits = s_bits;
-+    }
-+    /* Clear the non-page, non-alignment bits from the address.  */
-+    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | a_mask;
-+    if (compare_mask == sextreg(compare_mask, 0, 12)) {
-+        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, compare_mask);
-+    } else {
-+        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
-+        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
-+    }
-+
-+    /* Compare masked address with the TLB entry. */
-+    ldst->label_ptr[0] = s->code_ptr;
-+    tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP0, TCG_REG_TMP1, 0);
-+
-+    /* TLB Hit - translate address using addend.  */
-+    if (TARGET_LONG_BITS == 32) {
-+        tcg_out_ext32u(s, TCG_REG_TMP0, addr_reg);
-+        addr_reg = TCG_REG_TMP0;
-+    }
-+    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr_reg);
-+    *pbase = TCG_REG_TMP0;
-+#else
-+    if (a_mask) {
-+        ldst = new_ldst_label(s);
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addr_reg;
-+
-+        /* We are expecting a_bits max 7, so we can always use andi. */
-+        tcg_debug_assert(a_bits < 12);
-+        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, a_mask);
-+
-+        ldst->label_ptr[0] = s->code_ptr;
-+        tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP1, TCG_REG_ZERO, 0);
-+    }
-+
-+    TCGReg base = addr_reg;
-+    if (TARGET_LONG_BITS == 32) {
-+        tcg_out_ext32u(s, TCG_REG_TMP0, base);
-+        base = TCG_REG_TMP0;
-+    }
-+    if (guest_base != 0) {
-+        tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_GUEST_BASE_REG, base);
-+        base = TCG_REG_TMP0;
-+    }
-+    *pbase = base;
-+#endif
-+
-+    return ldst;
-+}
-+
- static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg val,
-                                    TCGReg base, MemOp opc, TCGType type)
- {
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg val,
- static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     TCGReg base;
--#if defined(CONFIG_SOFTMMU)
--    tcg_insn_unit *label_ptr[1];
-+    ldst = prepare_host_addr(s, &base, addr_reg, oi, true);
-+    tcg_out_qemu_ld_direct(s, data_reg, base, get_memop(oi), data_type);
--    base = tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 1);
--    tcg_out_qemu_ld_direct(s, data_reg, base, opc, data_type);
--    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, true, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    base = addr_reg;
--    if (TARGET_LONG_BITS == 32) {
--        tcg_out_ext32u(s, TCG_REG_TMP0, base);
--        base = TCG_REG_TMP0;
--    }
--    if (guest_base != 0) {
--        tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_GUEST_BASE_REG, base);
--        base = TCG_REG_TMP0;
--    }
--    tcg_out_qemu_ld_direct(s, data_reg, base, opc, data_type);
--#endif
- }
- static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg val,
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg val,
- static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     TCGReg base;
--#if defined(CONFIG_SOFTMMU)
--    tcg_insn_unit *label_ptr[1];
-+    ldst = prepare_host_addr(s, &base, addr_reg, oi, false);
-+    tcg_out_qemu_st_direct(s, data_reg, base, get_memop(oi));
--    base = tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 0);
--    tcg_out_qemu_st_direct(s, data_reg, base, opc);
--    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--    if (a_bits) {
--        tcg_out_test_alignment(s, false, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    base = addr_reg;
--    if (TARGET_LONG_BITS == 32) {
--        tcg_out_ext32u(s, TCG_REG_TMP0, base);
--        base = TCG_REG_TMP0;
--    }
--    if (guest_base != 0) {
--        tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_GUEST_BASE_REG, base);
--        base = TCG_REG_TMP0;
--    }
--    tcg_out_qemu_st_direct(s, data_reg, base, opc);
--#endif
- }
- static const tcg_insn_unit *tb_ret_addr;
---
-.34.1

-[PULL 19/53] tcg/s390x: Introduce prepare_host_addr
+Deleted patch
-Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
-tcg_prepare_user_ldst, and some code that lived in both tcg_out_qemu_ld
-and tcg_out_qemu_st into one function that returns HostAddress and
-TCGLabelQemuLdst structures.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/s390x/tcg-target.c.inc | 263 ++++++++++++++++---------------------
-file changed, 113 insertions(+), 150 deletions(-)
-diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/s390x/tcg-target.c.inc
-+++ b/tcg/s390x/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp opc, TCGReg data,
- }
- #if defined(CONFIG_SOFTMMU)
--/* We're expecting to use a 20-bit negative offset on the tlb memory ops.  */
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
--QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
--
--/* Load and compare a TLB entry, leaving the flags set.  Loads the TLB
--   addend into R2.  Returns a register with the santitized guest address.  */
--static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, MemOp opc,
--                               int mem_index, bool is_ld)
--{
--    unsigned s_bits = opc & MO_SIZE;
--    unsigned a_bits = get_alignment_bits(opc);
--    unsigned s_mask = (1 << s_bits) - 1;
--    unsigned a_mask = (1 << a_bits) - 1;
--    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
--    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
--    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
--    int ofs, a_off;
--    uint64_t tlb_mask;
--
--    tcg_out_sh64(s, RSY_SRLG, TCG_REG_R2, addr_reg, TCG_REG_NONE,
--                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
--    tcg_out_insn(s, RXY, NG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, mask_off);
--    tcg_out_insn(s, RXY, AG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, table_off);
--
--    /* For aligned accesses, we check the first byte and include the alignment
--       bits within the address.  For unaligned access, we check that we don't
--       cross pages using the address of the last byte of the access.  */
--    a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
--    tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
--    if (a_off == 0) {
--        tgen_andi_risbg(s, TCG_REG_R3, addr_reg, tlb_mask);
--    } else {
--        tcg_out_insn(s, RX, LA, TCG_REG_R3, addr_reg, TCG_REG_NONE, a_off);
--        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R3, tlb_mask);
--    }
--
--    if (is_ld) {
--        ofs = offsetof(CPUTLBEntry, addr_read);
--    } else {
--        ofs = offsetof(CPUTLBEntry, addr_write);
--    }
--    if (TARGET_LONG_BITS == 32) {
--        tcg_out_insn(s, RX, C, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
--    } else {
--        tcg_out_insn(s, RXY, CG, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
--    }
--
--    tcg_out_insn(s, RXY, LG, TCG_REG_R2, TCG_REG_R2, TCG_REG_NONE,
--                 offsetof(CPUTLBEntry, addend));
--
--    if (TARGET_LONG_BITS == 32) {
--        tcg_out_ext32u(s, TCG_REG_R3, addr_reg);
--        return TCG_REG_R3;
--    }
--    return addr_reg;
--}
--
--static void add_qemu_ldst_label(TCGContext *s, bool is_ld, MemOpIdx oi,
--                                TCGType type, TCGReg data, TCGReg addr,
--                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
--{
--    TCGLabelQemuLdst *label = new_ldst_label(s);
--
--    label->is_ld = is_ld;
--    label->oi = oi;
--    label->type = type;
--    label->datalo_reg = data;
--    label->addrlo_reg = addr;
--    label->raddr = tcg_splitwx_to_rx(raddr);
--    label->label_ptr[0] = label_ptr;
--}
--
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
-     TCGReg addr_reg = lb->addrlo_reg;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
-     return true;
- }
- #else
--static void tcg_out_test_alignment(TCGContext *s, bool is_ld,
--                                   TCGReg addrlo, unsigned a_bits)
--{
--    unsigned a_mask = (1 << a_bits) - 1;
--    TCGLabelQemuLdst *l = new_ldst_label(s);
--
--    l->is_ld = is_ld;
--    l->addrlo_reg = addrlo;
--
--    /* We are expecting a_bits to max out at 7, much lower than TMLL. */
--    tcg_debug_assert(a_bits < 16);
--    tcg_out_insn(s, RI, TMLL, addrlo, a_mask);
--
--    tcg_out16(s, RI_BRC | (7 << 4)); /* CC in {1,2,3} */
--    l->label_ptr[0] = s->code_ptr;
--    s->code_ptr += 1;
--
--    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
--}
--
- static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     if (!patch_reloc(l->label_ptr[0], R_390_PC16DBL,
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
-     return tcg_out_fail_alignment(s, l);
- }
-+#endif /* CONFIG_SOFTMMU */
--static HostAddress tcg_prepare_user_ldst(TCGContext *s, TCGReg addr_reg)
-+/*
-+ * For softmmu, perform the TLB load and compare.
-+ * For useronly, perform any required alignment tests.
-+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
-+ * is required and fill in @h with the host address for the fast path.
-+ */
-+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-+                                           TCGReg addr_reg, MemOpIdx oi,
-+                                           bool is_ld)
- {
--    TCGReg index;
--    int disp;
-+    TCGLabelQemuLdst *ldst = NULL;
-+    MemOp opc = get_memop(oi);
-+    unsigned a_bits = get_alignment_bits(opc);
-+    unsigned a_mask = (1u << a_bits) - 1;
-+#ifdef CONFIG_SOFTMMU
-+    unsigned s_bits = opc & MO_SIZE;
-+    unsigned s_mask = (1 << s_bits) - 1;
-+    int mem_index = get_mmuidx(oi);
-+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-+    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-+    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-+    int ofs, a_off;
-+    uint64_t tlb_mask;
-+
-+    ldst = new_ldst_label(s);
-+    ldst->is_ld = is_ld;
-+    ldst->oi = oi;
-+    ldst->addrlo_reg = addr_reg;
-+
-+    tcg_out_sh64(s, RSY_SRLG, TCG_REG_R2, addr_reg, TCG_REG_NONE,
-+                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-+
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
-+    tcg_out_insn(s, RXY, NG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, mask_off);
-+    tcg_out_insn(s, RXY, AG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, table_off);
-+
-+    /*
-+     * For aligned accesses, we check the first byte and include the alignment
-+     * bits within the address.  For unaligned access, we check that we don't
-+     * cross pages using the address of the last byte of the access.
-+     */
-+    a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
-+    tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
-+    if (a_off == 0) {
-+        tgen_andi_risbg(s, TCG_REG_R3, addr_reg, tlb_mask);
-+    } else {
-+        tcg_out_insn(s, RX, LA, TCG_REG_R3, addr_reg, TCG_REG_NONE, a_off);
-+        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R3, tlb_mask);
-+    }
-+
-+    if (is_ld) {
-+        ofs = offsetof(CPUTLBEntry, addr_read);
-+    } else {
-+        ofs = offsetof(CPUTLBEntry, addr_write);
-+    }
-+    if (TARGET_LONG_BITS == 32) {
-+        tcg_out_insn(s, RX, C, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
-+    } else {
-+        tcg_out_insn(s, RXY, CG, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
-+    }
-+
-+    tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
-+    ldst->label_ptr[0] = s->code_ptr++;
-+
-+    h->index = TCG_REG_R2;
-+    tcg_out_insn(s, RXY, LG, h->index, TCG_REG_R2, TCG_REG_NONE,
-+                 offsetof(CPUTLBEntry, addend));
-+
-+    h->base = addr_reg;
-+    if (TARGET_LONG_BITS == 32) {
-+        tcg_out_ext32u(s, TCG_REG_R3, addr_reg);
-+        h->base = TCG_REG_R3;
-+    }
-+    h->disp = 0;
-+#else
-+    if (a_mask) {
-+        ldst = new_ldst_label(s);
-+        ldst->is_ld = is_ld;
-+        ldst->oi = oi;
-+        ldst->addrlo_reg = addr_reg;
-+
-+        /* We are expecting a_bits to max out at 7, much lower than TMLL. */
-+        tcg_debug_assert(a_bits < 16);
-+        tcg_out_insn(s, RI, TMLL, addr_reg, a_mask);
-+
-+        tcg_out16(s, RI_BRC | (7 << 4)); /* CC in {1,2,3} */
-+        ldst->label_ptr[0] = s->code_ptr++;
-+    }
-+
-+    h->base = addr_reg;
-     if (TARGET_LONG_BITS == 32) {
-         tcg_out_ext32u(s, TCG_TMP0, addr_reg);
--        addr_reg = TCG_TMP0;
-+        h->base = TCG_TMP0;
-     }
-     if (guest_base < 0x80000) {
--        index = TCG_REG_NONE;
--        disp = guest_base;
-+        h->index = TCG_REG_NONE;
-+        h->disp = guest_base;
-     } else {
--        index = TCG_GUEST_BASE_REG;
--        disp = 0;
-+        h->index = TCG_GUEST_BASE_REG;
-+        h->disp = 0;
-     }
--    return (HostAddress){ .base = addr_reg, .index = index, .disp = disp };
-+#endif
-+
-+    return ldst;
- }
--#endif /* CONFIG_SOFTMMU */
- static void tcg_out_qemu_ld(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#ifdef CONFIG_SOFTMMU
--    unsigned mem_index = get_mmuidx(oi);
--    tcg_insn_unit *label_ptr;
-+    ldst = prepare_host_addr(s, &h, addr_reg, oi, true);
-+    tcg_out_qemu_ld_direct(s, get_memop(oi), data_reg, h);
--    h.base = tcg_out_tlb_read(s, addr_reg, opc, mem_index, 1);
--    h.index = TCG_REG_R2;
--    h.disp = 0;
--
--    tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
--    label_ptr = s->code_ptr;
--    s->code_ptr += 1;
--
--    tcg_out_qemu_ld_direct(s, opc, data_reg, h);
--
--    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--
--    if (a_bits) {
--        tcg_out_test_alignment(s, true, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    h = tcg_prepare_user_ldst(s, addr_reg);
--    tcg_out_qemu_ld_direct(s, opc, data_reg, h);
--#endif
- }
- static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
-                             MemOpIdx oi, TCGType data_type)
- {
--    MemOp opc = get_memop(oi);
-+    TCGLabelQemuLdst *ldst;
-     HostAddress h;
--#ifdef CONFIG_SOFTMMU
--    unsigned mem_index = get_mmuidx(oi);
--    tcg_insn_unit *label_ptr;
-+    ldst = prepare_host_addr(s, &h, addr_reg, oi, false);
-+    tcg_out_qemu_st_direct(s, get_memop(oi), data_reg, h);
--    h.base = tcg_out_tlb_read(s, addr_reg, opc, mem_index, 0);
--    h.index = TCG_REG_R2;
--    h.disp = 0;
--
--    tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
--    label_ptr = s->code_ptr;
--    s->code_ptr += 1;
--
--    tcg_out_qemu_st_direct(s, opc, data_reg, h);
--
--    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
--                        s->code_ptr, label_ptr);
--#else
--    unsigned a_bits = get_alignment_bits(opc);
--
--    if (a_bits) {
--        tcg_out_test_alignment(s, false, addr_reg, a_bits);
-+    if (ldst) {
-+        ldst->type = data_type;
-+        ldst->datalo_reg = data_reg;
-+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
-     }
--    h = tcg_prepare_user_ldst(s, addr_reg);
--    tcg_out_qemu_st_direct(s, opc, data_reg, h);
--#endif
- }
- static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
---
-.34.1

-[PULL 22/53] tcg/i386: Convert tcg_out_qemu_st_slow_path
+Deleted patch
-Use tcg_out_st_helper_args.  This eliminates the use of a tail call to
-the store helper.  This may or may not be an improvement, depending on
-the call/return branch prediction of the host microarchitecture.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/i386/tcg-target.c.inc | 57 +++------------------------------------
-file changed, 4 insertions(+), 53 deletions(-)
-diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/i386/tcg-target.c.inc
-+++ b/tcg/i386/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-  */
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
--    MemOpIdx oi = l->oi;
--    MemOp opc = get_memop(oi);
--    MemOp s_bits = opc & MO_SIZE;
-+    MemOp opc = get_memop(l->oi);
-     tcg_insn_unit **label_ptr = &l->label_ptr[0];
--    TCGReg retaddr;
-     /* resolve label address */
-     tcg_patch32(label_ptr[0], s->code_ptr - label_ptr[0] - 4);
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-         tcg_patch32(label_ptr[1], s->code_ptr - label_ptr[1] - 4);
-     }
--    if (TCG_TARGET_REG_BITS == 32) {
--        int ofs = 0;
-+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
-+    tcg_out_branch(s, 1, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
--        tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
--        ofs += 4;
--
--        tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
--        ofs += 4;
--
--        if (TARGET_LONG_BITS == 64) {
--            tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
--            ofs += 4;
--        }
--
--        tcg_out_st(s, TCG_TYPE_I32, l->datalo_reg, TCG_REG_ESP, ofs);
--        ofs += 4;
--
--        if (s_bits == MO_64) {
--            tcg_out_st(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_ESP, ofs);
--            ofs += 4;
--        }
--
--        tcg_out_sti(s, TCG_TYPE_I32, oi, TCG_REG_ESP, ofs);
--        ofs += 4;
--
--        retaddr = TCG_REG_EAX;
--        tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
--        tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP, ofs);
--    } else {
--        tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
--        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
--                    l->addrlo_reg);
--        tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
--                    tcg_target_call_iarg_regs[2], l->datalo_reg);
--        tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3], oi);
--
--        if (ARRAY_SIZE(tcg_target_call_iarg_regs) > 4) {
--            retaddr = tcg_target_call_iarg_regs[4];
--            tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
--        } else {
--            retaddr = TCG_REG_RAX;
--            tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
--            tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP,
--                       TCG_TARGET_CALL_STACK_OFFSET);
--        }
--    }
--
--    /* "Tail call" to the helper, with the return address back inline.  */
--    tcg_out_push(s, retaddr);
--    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
-+    tcg_out_jmp(s, l->raddr);
-     return true;
- }
- #else
---
-.34.1

-[PULL 24/53] tcg/arm: Convert tcg_out_qemu_{ld,st}_slow_path
+Deleted patch
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
-and tcg_out_st_helper_args.  This allows our local
-tcg_out_arg_* infrastructure to be removed.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/arm/tcg-target.c.inc | 140 +++++----------------------------------
-file changed, 18 insertions(+), 122 deletions(-)
-diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/arm/tcg-target.c.inc
-+++ b/tcg/arm/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ tcg_out_ldrd_rwb(TCGContext *s, ARMCond cond, TCGReg rt, TCGReg rn, TCGReg rm)
-     tcg_out_memop_r(s, cond, INSN_LDRD_REG, rt, rn, rm, 1, 1, 1);
- }
--static void tcg_out_strd_8(TCGContext *s, ARMCond cond, TCGReg rt,
--                           TCGReg rn, int imm8)
-+static void __attribute__((unused))
-+tcg_out_strd_8(TCGContext *s, ARMCond cond, TCGReg rt, TCGReg rn, int imm8)
- {
-     tcg_out_memop_8(s, cond, INSN_STRD_IMM, rt, rn, imm8, 1, 0);
- }
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_ext8u(TCGContext *s, TCGReg rd, TCGReg rn)
-     tcg_out_dat_imm(s, COND_AL, ARITH_AND, rd, rn, 0xff);
- }
--static void __attribute__((unused))
--tcg_out_ext8u_cond(TCGContext *s, ARMCond cond, TCGReg rd, TCGReg rn)
--{
--    tcg_out_dat_imm(s, cond, ARITH_AND, rd, rn, 0xff);
--}
--
- static void tcg_out_ext16s(TCGContext *s, TCGType t, TCGReg rd, TCGReg rn)
- {
-     /* sxth */
-     tcg_out32(s, 0x06bf0070 | (COND_AL << 28) | (rd << 12) | rn);
- }
--static void tcg_out_ext16u_cond(TCGContext *s, ARMCond cond,
--                                TCGReg rd, TCGReg rn)
--{
--    /* uxth */
--    tcg_out32(s, 0x06ff0070 | (cond << 28) | (rd << 12) | rn);
--}
--
- static void tcg_out_ext16u(TCGContext *s, TCGReg rd, TCGReg rn)
- {
--    tcg_out_ext16u_cond(s, COND_AL, rd, rn);
-+    /* uxth */
-+    tcg_out32(s, 0x06ff0070 | (COND_AL << 28) | (rd << 12) | rn);
- }
- static void tcg_out_ext32s(TCGContext *s, TCGReg rd, TCGReg rn)
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
- #endif
- };
--/* Helper routines for marshalling helper function arguments into
-- * the correct registers and stack.
-- * argreg is where we want to put this argument, arg is the argument itself.
-- * Return value is the updated argreg ready for the next call.
-- * Note that argreg 0..3 is real registers, 4+ on stack.
-- *
-- * We provide routines for arguments which are: immediate, 32 bit
-- * value in register, 16 and 8 bit values in register (which must be zero
-- * extended before use) and 64 bit value in a lo:hi register pair.
-- */
--#define DEFINE_TCG_OUT_ARG(NAME, ARGTYPE, MOV_ARG, EXT_ARG)                \
--static TCGReg NAME(TCGContext *s, TCGReg argreg, ARGTYPE arg)              \
--{                                                                          \
--    if (argreg < 4) {                                                      \
--        MOV_ARG(s, COND_AL, argreg, arg);                                  \
--    } else {                                                               \
--        int ofs = (argreg - 4) * 4;                                        \
--        EXT_ARG;                                                           \
--        tcg_debug_assert(ofs + 4 <= TCG_STATIC_CALL_ARGS_SIZE);            \
--        tcg_out_st32_12(s, COND_AL, arg, TCG_REG_CALL_STACK, ofs);         \
--    }                                                                      \
--    return argreg + 1;                                                     \
--}
--
--DEFINE_TCG_OUT_ARG(tcg_out_arg_imm32, uint32_t, tcg_out_movi32,
--    (tcg_out_movi32(s, COND_AL, TCG_REG_TMP, arg), arg = TCG_REG_TMP))
--DEFINE_TCG_OUT_ARG(tcg_out_arg_reg8, TCGReg, tcg_out_ext8u_cond,
--    (tcg_out_ext8u_cond(s, COND_AL, TCG_REG_TMP, arg), arg = TCG_REG_TMP))
--DEFINE_TCG_OUT_ARG(tcg_out_arg_reg16, TCGReg, tcg_out_ext16u_cond,
--    (tcg_out_ext16u_cond(s, COND_AL, TCG_REG_TMP, arg), arg = TCG_REG_TMP))
--DEFINE_TCG_OUT_ARG(tcg_out_arg_reg32, TCGReg, tcg_out_mov_reg, )
--
--static TCGReg tcg_out_arg_reg64(TCGContext *s, TCGReg argreg,
--                                TCGReg arglo, TCGReg arghi)
-+static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
- {
--    /* 64 bit arguments must go in even/odd register pairs
--     * and in 8-aligned stack slots.
--     */
--    if (argreg & 1) {
--        argreg++;
--    }
--    if (argreg >= 4 && (arglo & 1) == 0 && arghi == arglo + 1) {
--        tcg_out_strd_8(s, COND_AL, arglo,
--                       TCG_REG_CALL_STACK, (argreg - 4) * 4);
--        return argreg + 2;
--    } else {
--        argreg = tcg_out_arg_reg32(s, argreg, arglo);
--        argreg = tcg_out_arg_reg32(s, argreg, arghi);
--        return argreg;
--    }
-+    /* We arrive at the slow path via "BLNE", so R14 contains l->raddr. */
-+    return TCG_REG_R14;
- }
-+static const TCGLdstHelperParam ldst_helper_param = {
-+    .ra_gen = ldst_ra_gen,
-+    .ntmp = 1,
-+    .tmp = { TCG_REG_TMP },
-+};
-+
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    TCGReg argreg;
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
-+    MemOp opc = get_memop(lb->oi);
-     if (!reloc_pc24(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    argreg = tcg_out_arg_reg32(s, TCG_REG_R0, TCG_AREG0);
--    if (TARGET_LONG_BITS == 64) {
--        argreg = tcg_out_arg_reg64(s, argreg, lb->addrlo_reg, lb->addrhi_reg);
--    } else {
--        argreg = tcg_out_arg_reg32(s, argreg, lb->addrlo_reg);
--    }
--    argreg = tcg_out_arg_imm32(s, argreg, oi);
--    argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
--
--    /* Use the canonical unsigned helpers and minimize icache usage. */
-+    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
-     tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE]);
--
--    if ((opc & MO_SIZE) == MO_64) {
--        TCGMovExtend ext[2] = {
--            { .dst = lb->datalo_reg, .dst_type = TCG_TYPE_I32,
--              .src = TCG_REG_R0, .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
--            { .dst = lb->datahi_reg, .dst_type = TCG_TYPE_I32,
--              .src = TCG_REG_R1, .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
--        };
--        tcg_out_movext2(s, &ext[0], &ext[1], TCG_REG_TMP);
--    } else {
--        tcg_out_movext(s, TCG_TYPE_I32, lb->datalo_reg,
--                       TCG_TYPE_I32, opc & MO_SSIZE, TCG_REG_R0);
--    }
-+    tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
-     tcg_out_goto(s, COND_AL, lb->raddr);
-     return true;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    TCGReg argreg, datalo, datahi;
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
-+    MemOp opc = get_memop(lb->oi);
-     if (!reloc_pc24(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    argreg = TCG_REG_R0;
--    argreg = tcg_out_arg_reg32(s, argreg, TCG_AREG0);
--    if (TARGET_LONG_BITS == 64) {
--        argreg = tcg_out_arg_reg64(s, argreg, lb->addrlo_reg, lb->addrhi_reg);
--    } else {
--        argreg = tcg_out_arg_reg32(s, argreg, lb->addrlo_reg);
--    }
--
--    datalo = lb->datalo_reg;
--    datahi = lb->datahi_reg;
--    switch (opc & MO_SIZE) {
--    case MO_8:
--        argreg = tcg_out_arg_reg8(s, argreg, datalo);
--        break;
--    case MO_16:
--        argreg = tcg_out_arg_reg16(s, argreg, datalo);
--        break;
--    case MO_32:
--    default:
--        argreg = tcg_out_arg_reg32(s, argreg, datalo);
--        break;
--    case MO_64:
--        argreg = tcg_out_arg_reg64(s, argreg, datalo, datahi);
--        break;
--    }
--
--    argreg = tcg_out_arg_imm32(s, argreg, oi);
--    argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
-+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
-     /* Tail-call to the helper, which will return to the fast path.  */
-     tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & MO_SIZE]);
---
-.34.1

-[PULL 25/53] tcg/loongarch64: Convert tcg_out_qemu_{ld,st}_slow_path
+Deleted patch
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
-and tcg_out_st_helper_args.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/loongarch64/tcg-target.c.inc | 37 ++++++++++----------------------
-file changed, 11 insertions(+), 26 deletions(-)
-diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/loongarch64/tcg-target.c.inc
-+++ b/tcg/loongarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
-     return reloc_br_sd10k16(s->code_ptr - 1, target);
- }
-+static const TCGLdstHelperParam ldst_helper_param = {
-+    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
-+};
-+
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
--    MemOpIdx oi = l->oi;
--    MemOp opc = get_memop(oi);
--    MemOp size = opc & MO_SIZE;
-+    MemOp opc = get_memop(l->oi);
-     /* resolve label address */
-     if (!reloc_br_sk16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    /* call load helper */
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A1, l->addrlo_reg);
--    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A2, oi);
--    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A3, (tcg_target_long)l->raddr);
--
--    tcg_out_call_int(s, qemu_ld_helpers[size], false);
--
--    tcg_out_movext(s, l->type, l->datalo_reg,
--                   TCG_TYPE_REG, opc & MO_SSIZE, TCG_REG_A0);
-+    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
-+    tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE], false);
-+    tcg_out_ld_helper_ret(s, l, false, &ldst_helper_param);
-     return tcg_out_goto(s, l->raddr);
- }
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
--    MemOpIdx oi = l->oi;
--    MemOp opc = get_memop(oi);
--    MemOp size = opc & MO_SIZE;
-+    MemOp opc = get_memop(l->oi);
-     /* resolve label address */
-     if (!reloc_br_sk16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    /* call store helper */
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A1, l->addrlo_reg);
--    tcg_out_movext(s, size == MO_64 ? TCG_TYPE_I32 : TCG_TYPE_I32, TCG_REG_A2,
--                   l->type, size, l->datalo_reg);
--    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A3, oi);
--    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A4, (tcg_target_long)l->raddr);
--
--    tcg_out_call_int(s, qemu_st_helpers[size], false);
--
-+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
-+    tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
-     return tcg_out_goto(s, l->raddr);
- }
- #else
---
-.34.1

-[PULL 27/53] tcg/ppc: Convert tcg_out_qemu_{ld,st}_slow_path
+Deleted patch
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
-and tcg_out_st_helper_args.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/ppc/tcg-target.c.inc | 88 ++++++++++++----------------------------
-file changed, 26 insertions(+), 62 deletions(-)
-diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-     [MO_BEUQ] = helper_be_stq_mmu,
- };
-+static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
-+{
-+    if (arg < 0) {
-+        arg = TCG_REG_TMP1;
-+    }
-+    tcg_out32(s, MFSPR | RT(arg) | LR);
-+    return arg;
-+}
-+
-+/*
-+ * For the purposes of ppc32 sorting 4 input registers into 4 argument
-+ * registers, there is an outside chance we would require 3 temps.
-+ * Because of constraints, no inputs are in r3, and env will not be
-+ * placed into r3 until after the sorting is done, and is thus free.
-+ */
-+static const TCGLdstHelperParam ldst_helper_param = {
-+    .ra_gen = ldst_ra_gen,
-+    .ntmp = 3,
-+    .tmp = { TCG_REG_TMP1, TCG_REG_R0, TCG_REG_R3 }
-+};
-+
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
--    TCGReg hi, lo, arg = TCG_REG_R3;
-+    MemOp opc = get_memop(lb->oi);
-     if (!reloc_pc14(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    tcg_out_mov(s, TCG_TYPE_PTR, arg++, TCG_AREG0);
--
--    lo = lb->addrlo_reg;
--    hi = lb->addrhi_reg;
--    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
--        arg |= (TCG_TARGET_CALL_ARG_I64 == TCG_CALL_ARG_EVEN);
--        tcg_out_mov(s, TCG_TYPE_I32, arg++, hi);
--        tcg_out_mov(s, TCG_TYPE_I32, arg++, lo);
--    } else {
--        /* If the address needed to be zero-extended, we'll have already
--           placed it in R4.  The only remaining case is 64-bit guest.  */
--        tcg_out_mov(s, TCG_TYPE_TL, arg++, lo);
--    }
--
--    tcg_out_movi(s, TCG_TYPE_I32, arg++, oi);
--    tcg_out32(s, MFSPR | RT(arg) | LR);
--
-+    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
-     tcg_out_call_int(s, LK, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
--
--    lo = lb->datalo_reg;
--    hi = lb->datahi_reg;
--    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
--        tcg_out_mov(s, TCG_TYPE_I32, lo, TCG_REG_R4);
--        tcg_out_mov(s, TCG_TYPE_I32, hi, TCG_REG_R3);
--    } else {
--        tcg_out_movext(s, lb->type, lo,
--                       TCG_TYPE_REG, opc & MO_SSIZE, TCG_REG_R3);
--    }
-+    tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
-     tcg_out_b(s, 0, lb->raddr);
-     return true;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
--    MemOp s_bits = opc & MO_SIZE;
--    TCGReg hi, lo, arg = TCG_REG_R3;
-+    MemOp opc = get_memop(lb->oi);
-     if (!reloc_pc14(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-         return false;
-     }
--    tcg_out_mov(s, TCG_TYPE_PTR, arg++, TCG_AREG0);
--
--    lo = lb->addrlo_reg;
--    hi = lb->addrhi_reg;
--    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
--        arg |= (TCG_TARGET_CALL_ARG_I64 == TCG_CALL_ARG_EVEN);
--        tcg_out_mov(s, TCG_TYPE_I32, arg++, hi);
--        tcg_out_mov(s, TCG_TYPE_I32, arg++, lo);
--    } else {
--        /* If the address needed to be zero-extended, we'll have already
--           placed it in R4.  The only remaining case is 64-bit guest.  */
--        tcg_out_mov(s, TCG_TYPE_TL, arg++, lo);
--    }
--
--    lo = lb->datalo_reg;
--    hi = lb->datahi_reg;
--    if (TCG_TARGET_REG_BITS == 32 && s_bits == MO_64) {
--        arg |= (TCG_TARGET_CALL_ARG_I64 == TCG_CALL_ARG_EVEN);
--        tcg_out_mov(s, TCG_TYPE_I32, arg++, hi);
--        tcg_out_mov(s, TCG_TYPE_I32, arg++, lo);
--    } else {
--        tcg_out_movext(s, s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32,
--                       arg++, lb->type, s_bits, lo);
--    }
--
--    tcg_out_movi(s, TCG_TYPE_I32, arg++, oi);
--    tcg_out32(s, MFSPR | RT(arg) | LR);
--
-+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
-     tcg_out_call_int(s, LK, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
-     tcg_out_b(s, 0, lb->raddr);
---
-.34.1

-[PULL 28/53] tcg/riscv: Convert tcg_out_qemu_{ld,st}_slow_path
+Deleted patch
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
-and tcg_out_st_helper_args.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/riscv/tcg-target.c.inc | 37 ++++++++++---------------------------
-file changed, 10 insertions(+), 27 deletions(-)
-diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/riscv/tcg-target.c.inc
-+++ b/tcg/riscv/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
-     tcg_debug_assert(ok);
- }
-+/* We have three temps, we might as well expose them. */
-+static const TCGLdstHelperParam ldst_helper_param = {
-+    .ntmp = 3, .tmp = { TCG_REG_TMP0, TCG_REG_TMP1, TCG_REG_TMP2 }
-+};
-+
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
--    MemOpIdx oi = l->oi;
--    MemOp opc = get_memop(oi);
--    TCGReg a0 = tcg_target_call_iarg_regs[0];
--    TCGReg a1 = tcg_target_call_iarg_regs[1];
--    TCGReg a2 = tcg_target_call_iarg_regs[2];
--    TCGReg a3 = tcg_target_call_iarg_regs[3];
-+    MemOp opc = get_memop(l->oi);
-     /* resolve label address */
-     if (!reloc_sbimm12(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-     }
-     /* call load helper */
--    tcg_out_mov(s, TCG_TYPE_PTR, a0, TCG_AREG0);
--    tcg_out_mov(s, TCG_TYPE_PTR, a1, l->addrlo_reg);
--    tcg_out_movi(s, TCG_TYPE_PTR, a2, oi);
--    tcg_out_movi(s, TCG_TYPE_PTR, a3, (tcg_target_long)l->raddr);
--
-+    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
-     tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SSIZE], false);
--    tcg_out_mov(s, (opc & MO_SIZE) == MO_64, l->datalo_reg, a0);
-+    tcg_out_ld_helper_ret(s, l, true, &ldst_helper_param);
-     tcg_out_goto(s, l->raddr);
-     return true;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
- {
--    MemOpIdx oi = l->oi;
--    MemOp opc = get_memop(oi);
--    MemOp s_bits = opc & MO_SIZE;
--    TCGReg a0 = tcg_target_call_iarg_regs[0];
--    TCGReg a1 = tcg_target_call_iarg_regs[1];
--    TCGReg a2 = tcg_target_call_iarg_regs[2];
--    TCGReg a3 = tcg_target_call_iarg_regs[3];
--    TCGReg a4 = tcg_target_call_iarg_regs[4];
-+    MemOp opc = get_memop(l->oi);
-     /* resolve label address */
-     if (!reloc_sbimm12(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-     }
-     /* call store helper */
--    tcg_out_mov(s, TCG_TYPE_PTR, a0, TCG_AREG0);
--    tcg_out_mov(s, TCG_TYPE_PTR, a1, l->addrlo_reg);
--    tcg_out_movext(s, s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32, a2,
--                   l->type, s_bits, l->datalo_reg);
--    tcg_out_movi(s, TCG_TYPE_PTR, a3, oi);
--    tcg_out_movi(s, TCG_TYPE_PTR, a4, (tcg_target_long)l->raddr);
--
-+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
-     tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
-     tcg_out_goto(s, l->raddr);
---
-.34.1

-[PULL 29/53] tcg/s390x: Convert tcg_out_qemu_{ld,st}_slow_path
+Deleted patch
-Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
-and tcg_out_st_helper_args.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/s390x/tcg-target.c.inc | 35 ++++++++++-------------------------
-file changed, 10 insertions(+), 25 deletions(-)
-diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/s390x/tcg-target.c.inc
-+++ b/tcg/s390x/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp opc, TCGReg data,
- }
- #if defined(CONFIG_SOFTMMU)
-+static const TCGLdstHelperParam ldst_helper_param = {
-+    .ntmp = 1, .tmp = { TCG_TMP0 }
-+};
-+
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    TCGReg addr_reg = lb->addrlo_reg;
--    TCGReg data_reg = lb->datalo_reg;
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
-+    MemOp opc = get_memop(lb->oi);
-     if (!patch_reloc(lb->label_ptr[0], R_390_PC16DBL,
-                      (intptr_t)tcg_splitwx_to_rx(s->code_ptr), 2)) {
-         return false;
-     }
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_R2, TCG_AREG0);
--    if (TARGET_LONG_BITS == 64) {
--        tcg_out_mov(s, TCG_TYPE_I64, TCG_REG_R3, addr_reg);
--    }
--    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_R4, oi);
--    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R5, (uintptr_t)lb->raddr);
--    tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)]);
--    tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_R2);
-+    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
-+    tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
-+    tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
-     tgen_gotoi(s, S390_CC_ALWAYS, lb->raddr);
-     return true;
-@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
- {
--    TCGReg addr_reg = lb->addrlo_reg;
--    TCGReg data_reg = lb->datalo_reg;
--    MemOpIdx oi = lb->oi;
--    MemOp opc = get_memop(oi);
--    MemOp size = opc & MO_SIZE;
-+    MemOp opc = get_memop(lb->oi);
-     if (!patch_reloc(lb->label_ptr[0], R_390_PC16DBL,
-                      (intptr_t)tcg_splitwx_to_rx(s->code_ptr), 2)) {
-         return false;
-     }
--    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_R2, TCG_AREG0);
--    if (TARGET_LONG_BITS == 64) {
--        tcg_out_mov(s, TCG_TYPE_I64, TCG_REG_R3, addr_reg);
--    }
--    tcg_out_movext(s, size == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32,
--                   TCG_REG_R4, lb->type, size, data_reg);
--    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_R5, oi);
--    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R6, (uintptr_t)lb->raddr);
-+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
-     tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
-     tgen_gotoi(s, S390_CC_ALWAYS, lb->raddr);
---
-.34.1

-[PULL 30/53] tcg/loongarch64: Simplify constraints on qemu_ld/st
+Deleted patch
-The softmmu tlb uses TCG_REG_TMP[0-2], not any of the normally available
-registers.  Now that we handle overlap betwen inputs and helper arguments,
-we can allow any allocatable reg.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/loongarch64/tcg-target-con-set.h |  2 --
- tcg/loongarch64/tcg-target-con-str.h |  1 -
- tcg/loongarch64/tcg-target.c.inc     | 23 ++++-------------------
-files changed, 4 insertions(+), 22 deletions(-)
-diff --git a/tcg/loongarch64/tcg-target-con-set.h b/tcg/loongarch64/tcg-target-con-set.h
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/loongarch64/tcg-target-con-set.h
-+++ b/tcg/loongarch64/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@
- C_O0_I1(r)
- C_O0_I2(rZ, r)
- C_O0_I2(rZ, rZ)
--C_O0_I2(LZ, L)
- C_O1_I1(r, r)
--C_O1_I1(r, L)
- C_O1_I2(r, r, rC)
- C_O1_I2(r, r, ri)
- C_O1_I2(r, r, rI)
-diff --git a/tcg/loongarch64/tcg-target-con-str.h b/tcg/loongarch64/tcg-target-con-str.h
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/loongarch64/tcg-target-con-str.h
-+++ b/tcg/loongarch64/tcg-target-con-str.h
-@@ -XXX,XX +XXX,XX @@
-  * REGS(letter, register_mask)
-  */
- REGS('r', ALL_GENERAL_REGS)
--REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
- /*
-  * Define constraint letters for constants:
-diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/loongarch64/tcg-target.c.inc
-+++ b/tcg/loongarch64/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
- #define TCG_CT_CONST_C12   0x1000
- #define TCG_CT_CONST_WSZ   0x2000
--#define ALL_GENERAL_REGS      MAKE_64BIT_MASK(0, 32)
--/*
-- * For softmmu, we need to avoid conflicts with the first 5
-- * argument registers to call the helper.  Some of these are
-- * also used for the tlb lookup.
-- */
--#ifdef CONFIG_SOFTMMU
--#define SOFTMMU_RESERVE_REGS  MAKE_64BIT_MASK(TCG_REG_A0, 5)
--#else
--#define SOFTMMU_RESERVE_REGS  0
--#endif
--
-+#define ALL_GENERAL_REGS   MAKE_64BIT_MASK(0, 32)
- static inline tcg_target_long sextreg(tcg_target_long val, int pos, int len)
- {
-@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-     case INDEX_op_st32_i64:
-     case INDEX_op_st_i32:
-     case INDEX_op_st_i64:
-+    case INDEX_op_qemu_st_i32:
-+    case INDEX_op_qemu_st_i64:
-         return C_O0_I2(rZ, r);
-     case INDEX_op_brcond_i32:
-     case INDEX_op_brcond_i64:
-         return C_O0_I2(rZ, rZ);
--    case INDEX_op_qemu_st_i32:
--    case INDEX_op_qemu_st_i64:
--        return C_O0_I2(LZ, L);
--
-     case INDEX_op_ext8s_i32:
-     case INDEX_op_ext8s_i64:
-     case INDEX_op_ext8u_i32:
-@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-     case INDEX_op_ld32u_i64:
-     case INDEX_op_ld_i32:
-     case INDEX_op_ld_i64:
--        return C_O1_I1(r, r);
--
-     case INDEX_op_qemu_ld_i32:
-     case INDEX_op_qemu_ld_i64:
--        return C_O1_I1(r, L);
-+        return C_O1_I1(r, r);
-     case INDEX_op_andc_i32:
-     case INDEX_op_andc_i64:
---
-.34.1

-[PULL 32/53] tcg/mips: Reorg tlb load within prepare_host_addr
+Deleted patch
-Compare the address vs the tlb entry with sign-extended values.
-This simplifies the page+alignment mask constant, and the
-generation of the last byte address for the misaligned test.
-Move the tlb addend load up, and the zero-extension down.
-This frees up a register, which allows us use TMP3 as the returned base
-address register instead of A0, which we were using as a 5th temporary.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/mips/tcg-target.c.inc | 38 ++++++++++++++++++--------------------
-file changed, 18 insertions(+), 20 deletions(-)
-diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target.c.inc
-+++ b/tcg/mips/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ typedef enum {
-     ALIAS_PADDI    = sizeof(void *) == 4 ? OPC_ADDIU : OPC_DADDIU,
-     ALIAS_TSRL     = TARGET_LONG_BITS == 32 || TCG_TARGET_REG_BITS == 32
-                      ? OPC_SRL : OPC_DSRL,
-+    ALIAS_TADDI    = TARGET_LONG_BITS == 32 || TCG_TARGET_REG_BITS == 32
-+                     ? OPC_ADDIU : OPC_DADDIU,
- } MIPSInsn;
- /*
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     int add_off = offsetof(CPUTLBEntry, addend);
-     int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
-                         : offsetof(CPUTLBEntry, addr_write);
--    target_ulong tlb_mask;
-     ldst = new_ldst_label(s);
-     ldst->is_ld = is_ld;
-     ldst->oi = oi;
-     ldst->addrlo_reg = addrlo;
-     ldst->addrhi_reg = addrhi;
--    base = TCG_REG_A0;
-     /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-         tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
-     } else {
--        tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
--                         : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
--                     TCG_TMP0, TCG_TMP3, cmp_off);
-+        tcg_out_ld(s, TCG_TYPE_TL, TCG_TMP0, TCG_TMP3, cmp_off);
-     }
--    /* Zero extend a 32-bit guest address for a 64-bit host. */
--    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
--        tcg_out_ext32u(s, base, addrlo);
--        addrlo = base;
-+    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-+        /* Load the tlb addend for the fast path.  */
-+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP3, TCG_TMP3, add_off);
-     }
-     /*
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-      * For unaligned accesses, compare against the end of the access to
-      * verify that it does not cross a page boundary.
-      */
--    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
--    tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
--    if (a_mask >= s_mask) {
--        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrlo);
--    } else {
--        tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrlo, s_mask - a_mask);
-+    tcg_out_movi(s, TCG_TYPE_TL, TCG_TMP1, TARGET_PAGE_MASK | a_mask);
-+    if (a_mask < s_mask) {
-+        tcg_out_opc_imm(s, ALIAS_TADDI, TCG_TMP2, addrlo, s_mask - a_mask);
-         tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
-+    } else {
-+        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrlo);
-     }
--    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
--        /* Load the tlb addend for the fast path.  */
--        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
-+    /* Zero extend a 32-bit guest address for a 64-bit host. */
-+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-+        tcg_out_ext32u(s, TCG_TMP2, addrlo);
-+        addrlo = TCG_TMP2;
-     }
-     ldst->label_ptr[0] = s->code_ptr;
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-         tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
-         /* Load the tlb addend for the fast path.  */
--        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
-+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP3, TCG_TMP3, add_off);
-         ldst->label_ptr[1] = s->code_ptr;
-         tcg_out_opc_br(s, OPC_BNE, addrhi, TCG_TMP0);
-     }
-     /* delay slot */
--    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP2, addrlo);
-+    base = TCG_TMP3;
-+    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP3, addrlo);
- #else
-     if (a_mask && (use_mips32r6_instructions || a_bits != s_bits)) {
-         ldst = new_ldst_label(s);
---
-.34.1

-[PULL 33/53] tcg/mips: Simplify constraints on qemu_ld/st
+Deleted patch
-The softmmu tlb uses TCG_REG_TMP[0-3], not any of the normally available
-registers.  Now that we handle overlap betwen inputs and helper arguments,
-and have eliminated use of A0, we can allow any allocatable reg.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/mips/tcg-target-con-set.h | 13 +++++--------
- tcg/mips/tcg-target-con-str.h |  2 --
- tcg/mips/tcg-target.c.inc     | 30 ++++++++----------------------
-files changed, 13 insertions(+), 32 deletions(-)
-diff --git a/tcg/mips/tcg-target-con-set.h b/tcg/mips/tcg-target-con-set.h
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target-con-set.h
-+++ b/tcg/mips/tcg-target-con-set.h
-@@ -XXX,XX +XXX,XX @@
- C_O0_I1(r)
- C_O0_I2(rZ, r)
- C_O0_I2(rZ, rZ)
--C_O0_I2(SZ, S)
--C_O0_I3(SZ, S, S)
--C_O0_I3(SZ, SZ, S)
-+C_O0_I3(rZ, r, r)
-+C_O0_I3(rZ, rZ, r)
- C_O0_I4(rZ, rZ, rZ, rZ)
--C_O0_I4(SZ, SZ, S, S)
--C_O1_I1(r, L)
-+C_O0_I4(rZ, rZ, r, r)
- C_O1_I1(r, r)
- C_O1_I2(r, 0, rZ)
--C_O1_I2(r, L, L)
-+C_O1_I2(r, r, r)
- C_O1_I2(r, r, ri)
- C_O1_I2(r, r, rI)
- C_O1_I2(r, r, rIK)
-@@ -XXX,XX +XXX,XX @@ C_O1_I2(r, rZ, rN)
- C_O1_I2(r, rZ, rZ)
- C_O1_I4(r, rZ, rZ, rZ, 0)
- C_O1_I4(r, rZ, rZ, rZ, rZ)
--C_O2_I1(r, r, L)
--C_O2_I2(r, r, L, L)
-+C_O2_I1(r, r, r)
- C_O2_I2(r, r, r, r)
- C_O2_I4(r, r, rZ, rZ, rN, rN)
-diff --git a/tcg/mips/tcg-target-con-str.h b/tcg/mips/tcg-target-con-str.h
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target-con-str.h
-+++ b/tcg/mips/tcg-target-con-str.h
-@@ -XXX,XX +XXX,XX @@
-  * REGS(letter, register_mask)
-  */
- REGS('r', ALL_GENERAL_REGS)
--REGS('L', ALL_QLOAD_REGS)
--REGS('S', ALL_QSTORE_REGS)
- /*
-  * Define constraint letters for constants:
-diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/mips/tcg-target.c.inc
-+++ b/tcg/mips/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
- #define TCG_CT_CONST_WSZ  0x2000   /* word size */
- #define ALL_GENERAL_REGS  0xffffffffu
--#define NOA0_REGS         (ALL_GENERAL_REGS & ~(1 << TCG_REG_A0))
--
--#ifdef CONFIG_SOFTMMU
--#define ALL_QLOAD_REGS \
--    (NOA0_REGS & ~((TCG_TARGET_REG_BITS < TARGET_LONG_BITS) << TCG_REG_A2))
--#define ALL_QSTORE_REGS \
--    (NOA0_REGS & ~(TCG_TARGET_REG_BITS < TARGET_LONG_BITS   \
--                   ? (1 << TCG_REG_A2) | (1 << TCG_REG_A3)  \
--                   : (1 << TCG_REG_A1)))
--#else
--#define ALL_QLOAD_REGS   NOA0_REGS
--#define ALL_QSTORE_REGS  NOA0_REGS
--#endif
--
- static bool is_p2m1(tcg_target_long val)
- {
-@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
-     case INDEX_op_qemu_ld_i32:
-         return (TCG_TARGET_REG_BITS == 64 || TARGET_LONG_BITS == 32
--                ? C_O1_I1(r, L) : C_O1_I2(r, L, L));
-+                ? C_O1_I1(r, r) : C_O1_I2(r, r, r));
-     case INDEX_op_qemu_st_i32:
-         return (TCG_TARGET_REG_BITS == 64 || TARGET_LONG_BITS == 32
--                ? C_O0_I2(SZ, S) : C_O0_I3(SZ, S, S));
-+                ? C_O0_I2(rZ, r) : C_O0_I3(rZ, r, r));
-     case INDEX_op_qemu_ld_i64:
--        return (TCG_TARGET_REG_BITS == 64 ? C_O1_I1(r, L)
--                : TARGET_LONG_BITS == 32 ? C_O2_I1(r, r, L)
--                : C_O2_I2(r, r, L, L));
-+        return (TCG_TARGET_REG_BITS == 64 ? C_O1_I1(r, r)
-+                : TARGET_LONG_BITS == 32 ? C_O2_I1(r, r, r)
-+                : C_O2_I2(r, r, r, r));
-     case INDEX_op_qemu_st_i64:
--        return (TCG_TARGET_REG_BITS == 64 ? C_O0_I2(SZ, S)
--                : TARGET_LONG_BITS == 32 ? C_O0_I3(SZ, SZ, S)
--                : C_O0_I4(SZ, SZ, S, S));
-+        return (TCG_TARGET_REG_BITS == 64 ? C_O0_I2(rZ, r)
-+                : TARGET_LONG_BITS == 32 ? C_O0_I3(rZ, rZ, r)
-+                : C_O0_I4(rZ, rZ, r, r));
-     default:
-         g_assert_not_reached();
---
-.34.1

-[PULL 34/53] tcg/ppc: Reorg tcg_out_tlb_read
+Deleted patch
-Allocate TCG_REG_TMP2.  Use R0, TMP1, TMP2 instead of any of
-the normally allocated registers for the tlb load.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/ppc/tcg-target.c.inc | 78 ++++++++++++++++++++++++----------------
-file changed, 47 insertions(+), 31 deletions(-)
-diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@
- #else
- # define TCG_REG_TMP1   TCG_REG_R12
- #endif
-+#define TCG_REG_TMP2    TCG_REG_R11
- #define TCG_VEC_TMP1    TCG_REG_V0
- #define TCG_VEC_TMP2    TCG_REG_V1
-@@ -XXX,XX +XXX,XX @@ static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
- /*
-  * For the purposes of ppc32 sorting 4 input registers into 4 argument
-  * registers, there is an outside chance we would require 3 temps.
-- * Because of constraints, no inputs are in r3, and env will not be
-- * placed into r3 until after the sorting is done, and is thus free.
-  */
- static const TCGLdstHelperParam ldst_helper_param = {
-     .ra_gen = ldst_ra_gen,
-     .ntmp = 3,
--    .tmp = { TCG_REG_TMP1, TCG_REG_R0, TCG_REG_R3 }
-+    .tmp = { TCG_REG_TMP1, TCG_REG_TMP2, TCG_REG_R0 }
- };
- static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_AREG0, mask_off);
--    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R4, TCG_AREG0, table_off);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, mask_off);
-+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_AREG0, table_off);
-     /* Extract the page index, shifted into place for tlb index.  */
-     if (TCG_TARGET_REG_BITS == 32) {
--        tcg_out_shri32(s, TCG_REG_TMP1, addrlo,
-+        tcg_out_shri32(s, TCG_REG_R0, addrlo,
-                        TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-     } else {
--        tcg_out_shri64(s, TCG_REG_TMP1, addrlo,
-+        tcg_out_shri64(s, TCG_REG_R0, addrlo,
-                        TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-     }
--    tcg_out32(s, AND | SAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_TMP1));
-+    tcg_out32(s, AND | SAB(TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_R0));
--    /* Load the TLB comparator.  */
-+    /* Load the (low part) TLB comparator into TMP2.  */
-     if (cmp_off == 0 && TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-         uint32_t lxu = (TCG_TARGET_REG_BITS == 32 || TARGET_LONG_BITS == 32
-                         ? LWZUX : LDUX);
--        tcg_out32(s, lxu | TAB(TCG_REG_TMP1, TCG_REG_R3, TCG_REG_R4));
-+        tcg_out32(s, lxu | TAB(TCG_REG_TMP2, TCG_REG_TMP1, TCG_REG_TMP2));
-     } else {
--        tcg_out32(s, ADD | TAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_R4));
-+        tcg_out32(s, ADD | TAB(TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP2));
-         if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
--            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP1, TCG_REG_R3, cmp_off + 4);
--            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_R4, TCG_REG_R3, cmp_off);
-+            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP2,
-+                       TCG_REG_TMP1, cmp_off + 4 * HOST_BIG_ENDIAN);
-         } else {
--            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP1, TCG_REG_R3, cmp_off);
-+            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP2, TCG_REG_TMP1, cmp_off);
-         }
-     }
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-      * Load the TLB addend for use on the fast path.
-      * Do this asap to minimize any load use delay.
-      */
--    h->base = TCG_REG_R3;
--    tcg_out_ld(s, TCG_TYPE_PTR, h->base, TCG_REG_R3,
--               offsetof(CPUTLBEntry, addend));
-+    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
-+                   offsetof(CPUTLBEntry, addend));
-+    }
--    /* Clear the non-page, non-alignment bits from the address */
-+    /* Clear the non-page, non-alignment bits from the address in R0. */
-     if (TCG_TARGET_REG_BITS == 32) {
-         /*
-          * We don't support unaligned accesses on 32-bits.
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-         if (TARGET_LONG_BITS == 32) {
-             tcg_out_rlw(s, RLWINM, TCG_REG_R0, t, 0,
-                         (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
--            /* Zero-extend the address for use in the final address.  */
--            tcg_out_ext32u(s, TCG_REG_R4, addrlo);
--            addrlo = TCG_REG_R4;
-         } else if (a_bits == 0) {
-             tcg_out_rld(s, RLDICR, TCG_REG_R0, t, 0, 63 - TARGET_PAGE_BITS);
-         } else {
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-             tcg_out_rld(s, RLDICL, TCG_REG_R0, TCG_REG_R0, TARGET_PAGE_BITS, 0);
-         }
-     }
--    h->index = addrlo;
-     if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
--        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
-+        /* Low part comparison into cr7. */
-+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP2,
-, 7, TCG_TYPE_I32);
--        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_R4, 0, 6, TCG_TYPE_I32);
-+
-+        /* Load the high part TLB comparator into TMP2.  */
-+        tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP2, TCG_REG_TMP1,
-+                   cmp_off + 4 * !HOST_BIG_ENDIAN);
-+
-+        /* Load addend, deferred for this case. */
-+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
-+                   offsetof(CPUTLBEntry, addend));
-+
-+        /* High part comparison into cr6. */
-+        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_TMP2, 0, 6, TCG_TYPE_I32);
-+
-+        /* Combine comparisons into cr7. */
-         tcg_out32(s, CRAND | BT(7, CR_EQ) | BA(6, CR_EQ) | BB(7, CR_EQ));
-     } else {
--        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
-+        /* Full comparison into cr7. */
-+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP2,
-, 7, TCG_TYPE_TL);
-     }
-     /* Load a pointer into the current opcode w/conditional branch-link. */
-     ldst->label_ptr[0] = s->code_ptr;
-     tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
-+
-+    h->base = TCG_REG_TMP1;
- #else
-     if (a_bits) {
-         ldst = new_ldst_label(s);
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     }
-     h->base = guest_base ? TCG_GUEST_BASE_REG : 0;
--    h->index = addrlo;
--    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
--        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
--        h->index = TCG_REG_TMP1;
--    }
- #endif
-+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-+        /* Zero-extend the guest address for use in the host address. */
-+        tcg_out_ext32u(s, TCG_REG_R0, addrlo);
-+        h->index = TCG_REG_R0;
-+    } else {
-+        h->index = addrlo;
-+    }
-+
-     return ldst;
- }
-@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
- #if defined(_CALL_SYSV) || TCG_TARGET_REG_BITS == 64
-     tcg_regset_set_reg(s->reserved_regs, TCG_REG_R13); /* thread pointer */
- #endif
--    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1); /* mem temp */
-+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
-+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
-     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP1);
-     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP2);
-     if (USE_REG_TB) {
---
-.34.1

-[PULL 36/53] tcg/ppc: Remove unused constraints A, B, C, D
+Deleted patch
-These constraints have not been used for quite some time.
-Fixes: 77b73de67632 ("Use rem/div[u]_i32 drop div[u]2_i32")
-Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/ppc/tcg-target-con-str.h | 4 ----
-file changed, 4 deletions(-)
-diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target-con-str.h
-+++ b/tcg/ppc/tcg-target-con-str.h
-@@ -XXX,XX +XXX,XX @@
-  */
- REGS('r', ALL_GENERAL_REGS)
- REGS('v', ALL_VECTOR_REGS)
--REGS('A', 1u << TCG_REG_R3)
--REGS('B', 1u << TCG_REG_R4)
--REGS('C', 1u << TCG_REG_R5)
--REGS('D', 1u << TCG_REG_R6)
- /*
-  * Define constraint letters for constants:
---
-.34.1

-[PULL 37/53] tcg/ppc: Remove unused constraint J
+Deleted patch
-Never used since its introduction.
-Fixes: 3d582c6179c ("tcg-ppc64: Rearrange integer constant constraints")
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/ppc/tcg-target-con-str.h | 1 -
- tcg/ppc/tcg-target.c.inc     | 3 ---
-files changed, 4 deletions(-)
-diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target-con-str.h
-+++ b/tcg/ppc/tcg-target-con-str.h
-@@ -XXX,XX +XXX,XX @@ REGS('v', ALL_VECTOR_REGS)
-  * CONST(letter, TCG_CT_CONST_* bit set)
-  */
- CONST('I', TCG_CT_CONST_S16)
--CONST('J', TCG_CT_CONST_U16)
- CONST('M', TCG_CT_CONST_MONE)
- CONST('T', TCG_CT_CONST_S32)
- CONST('U', TCG_CT_CONST_U32)
-diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/ppc/tcg-target.c.inc
-+++ b/tcg/ppc/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@
- #define SZR  (TCG_TARGET_REG_BITS / 8)
- #define TCG_CT_CONST_S16  0x100
--#define TCG_CT_CONST_U16  0x200
- #define TCG_CT_CONST_S32  0x400
- #define TCG_CT_CONST_U32  0x800
- #define TCG_CT_CONST_ZERO 0x1000
-@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
-     if ((ct & TCG_CT_CONST_S16) && val == (int16_t)val) {
-         return 1;
--    } else if ((ct & TCG_CT_CONST_U16) && val == (uint16_t)val) {
--        return 1;
-     } else if ((ct & TCG_CT_CONST_S32) && val == (int32_t)val) {
-         return 1;
-     } else if ((ct & TCG_CT_CONST_U32) && val == (uint32_t)val) {
---
-.34.1

-[PULL 39/53] tcg/s390x: Use ALGFR in constructing softmmu host address
+Deleted patch
-Rather than zero-extend the guest address into a register,
-use an add instruction which zero-extends the second input.
-Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tcg/s390x/tcg-target.c.inc | 8 +++++---
-file changed, 5 insertions(+), 3 deletions(-)
-diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/tcg/s390x/tcg-target.c.inc
-+++ b/tcg/s390x/tcg-target.c.inc
-@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
-     RRE_ALGR    = 0xb90a,
-     RRE_ALCR    = 0xb998,
-     RRE_ALCGR   = 0xb988,
-+    RRE_ALGFR   = 0xb91a,
-     RRE_CGR     = 0xb920,
-     RRE_CLGR    = 0xb921,
-     RRE_DLGR    = 0xb987,
-@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
-     tcg_out_insn(s, RXY, LG, h->index, TCG_REG_R2, TCG_REG_NONE,
-                  offsetof(CPUTLBEntry, addend));
--    h->base = addr_reg;
-     if (TARGET_LONG_BITS == 32) {
--        tcg_out_ext32u(s, TCG_REG_R3, addr_reg);
--        h->base = TCG_REG_R3;
-+        tcg_out_insn(s, RRE, ALGFR, h->index, addr_reg);
-+        h->base = TCG_REG_NONE;
-+    } else {
-+        h->base = addr_reg;
-     }
-     h->disp = 0;
- #else
---
-.34.1

-[PULL 43/53] target/mips: Use MO_ALIGN instead of 0
+Deleted patch
-The opposite of MO_UNALN is MO_ALIGN.
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- target/mips/tcg/nanomips_translate.c.inc | 2 +-
-file changed, 1 insertion(+), 1 deletion(-)
-diff --git a/target/mips/tcg/nanomips_translate.c.inc b/target/mips/tcg/nanomips_translate.c.inc
-index XXXXXXX..XXXXXXX 100644
---- a/target/mips/tcg/nanomips_translate.c.inc
-+++ b/target/mips/tcg/nanomips_translate.c.inc
-@@ -XXX,XX +XXX,XX @@ static int decode_nanomips_32_48_opc(CPUMIPSState *env, DisasContext *ctx)
-                     TCGv va = tcg_temp_new();
-                     TCGv t1 = tcg_temp_new();
-                     MemOp memop = (extract32(ctx->opcode, 8, 3)) ==
--                                      NM_P_LS_UAWM ? MO_UNALN : 0;
-+                                      NM_P_LS_UAWM ? MO_UNALN : MO_ALIGN;
-                     count = (count == 0) ? 8 : count;
-                     while (counter != count) {
---
-.34.1

-[PULL 46/53] target/sh4: Use MO_ALIGN where required
+Deleted patch
-Mark all memory operations that are not already marked with UNALIGN.
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- target/sh4/translate.c | 102 ++++++++++++++++++++++++++---------------
-file changed, 66 insertions(+), 36 deletions(-)
-diff --git a/target/sh4/translate.c b/target/sh4/translate.c
-index XXXXXXX..XXXXXXX 100644
---- a/target/sh4/translate.c
-+++ b/target/sh4/translate.c
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     case 0x9000:        /* mov.w @(disp,PC),Rn */
-     {
-             TCGv addr = tcg_constant_i32(ctx->base.pc_next + 4 + B7_0 * 2);
--            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx, MO_TESW);
-+            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx,
-+                                MO_TESW | MO_ALIGN);
-     }
-     return;
-     case 0xd000:        /* mov.l @(disp,PC),Rn */
-     {
-             TCGv addr = tcg_constant_i32((ctx->base.pc_next + 4 + B7_0 * 4) & ~3);
--            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-     }
-     return;
-     case 0x7000:        /* add #imm,Rn */
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     {
-         TCGv arg0, arg1;
-         arg0 = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-         arg1 = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-             gen_helper_macl(cpu_env, arg0, arg1);
-         tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 4);
-         tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     {
-         TCGv arg0, arg1;
-         arg0 = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-         arg1 = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-             gen_helper_macw(cpu_env, arg0, arg1);
-         tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 2);
-         tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 2);
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-         if (ctx->tbflags & FPSCR_SZ) {
-             TCGv_i64 fp = tcg_temp_new_i64();
-             gen_load_fpr64(ctx, fp, XHACK(B7_4));
--            tcg_gen_qemu_st_i64(fp, REG(B11_8), ctx->memidx, MO_TEUQ);
-+            tcg_gen_qemu_st_i64(fp, REG(B11_8), ctx->memidx,
-+                                MO_TEUQ | MO_ALIGN);
-     } else {
--            tcg_gen_qemu_st_i32(FREG(B7_4), REG(B11_8), ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_st_i32(FREG(B7_4), REG(B11_8), ctx->memidx,
-+                                MO_TEUL | MO_ALIGN);
-     }
-     return;
-     case 0xf008: /* fmov @Rm,{F,D,X}Rn - FPSCR: Nothing */
-     CHECK_FPU_ENABLED
-         if (ctx->tbflags & FPSCR_SZ) {
-             TCGv_i64 fp = tcg_temp_new_i64();
--            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx, MO_TEUQ);
-+            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx,
-+                                MO_TEUQ | MO_ALIGN);
-             gen_store_fpr64(ctx, fp, XHACK(B11_8));
-     } else {
--            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx,
-+                                MO_TEUL | MO_ALIGN);
-     }
-     return;
-     case 0xf009: /* fmov @Rm+,{F,D,X}Rn - FPSCR: Nothing */
-     CHECK_FPU_ENABLED
-         if (ctx->tbflags & FPSCR_SZ) {
-             TCGv_i64 fp = tcg_temp_new_i64();
--            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx, MO_TEUQ);
-+            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx,
-+                                MO_TEUQ | MO_ALIGN);
-             gen_store_fpr64(ctx, fp, XHACK(B11_8));
-             tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 8);
-     } else {
--            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx,
-+                                MO_TEUL | MO_ALIGN);
-         tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 4);
-     }
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-                 TCGv_i64 fp = tcg_temp_new_i64();
-                 gen_load_fpr64(ctx, fp, XHACK(B7_4));
-                 tcg_gen_subi_i32(addr, REG(B11_8), 8);
--                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx, MO_TEUQ);
-+                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx,
-+                                    MO_TEUQ | MO_ALIGN);
-             } else {
-                 tcg_gen_subi_i32(addr, REG(B11_8), 4);
--                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx, MO_TEUL);
-+                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx,
-+                                    MO_TEUL | MO_ALIGN);
-             }
-             tcg_gen_mov_i32(REG(B11_8), addr);
-         }
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-         tcg_gen_add_i32(addr, REG(B7_4), REG(0));
-             if (ctx->tbflags & FPSCR_SZ) {
-                 TCGv_i64 fp = tcg_temp_new_i64();
--                tcg_gen_qemu_ld_i64(fp, addr, ctx->memidx, MO_TEUQ);
-+                tcg_gen_qemu_ld_i64(fp, addr, ctx->memidx,
-+                                    MO_TEUQ | MO_ALIGN);
-                 gen_store_fpr64(ctx, fp, XHACK(B11_8));
-         } else {
--                tcg_gen_qemu_ld_i32(FREG(B11_8), addr, ctx->memidx, MO_TEUL);
-+                tcg_gen_qemu_ld_i32(FREG(B11_8), addr, ctx->memidx,
-+                                    MO_TEUL | MO_ALIGN);
-         }
-     }
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-             if (ctx->tbflags & FPSCR_SZ) {
-                 TCGv_i64 fp = tcg_temp_new_i64();
-                 gen_load_fpr64(ctx, fp, XHACK(B7_4));
--                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx, MO_TEUQ);
-+                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx,
-+                                    MO_TEUQ | MO_ALIGN);
-         } else {
--                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx, MO_TEUL);
-+                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx,
-+                                    MO_TEUL | MO_ALIGN);
-         }
-     }
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     {
-         TCGv addr = tcg_temp_new();
-         tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 2);
--            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESW);
-+            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESW | MO_ALIGN);
-     }
-     return;
-     case 0xc600:        /* mov.l @(disp,GBR),R0 */
-     {
-         TCGv addr = tcg_temp_new();
-         tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 4);
--            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESL | MO_ALIGN);
-     }
-     return;
-     case 0xc000:        /* mov.b R0,@(disp,GBR) */
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     {
-         TCGv addr = tcg_temp_new();
-         tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 2);
--            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUW);
-+            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUW | MO_ALIGN);
-     }
-     return;
-     case 0xc200:        /* mov.l R0,@(disp,GBR) */
-     {
-         TCGv addr = tcg_temp_new();
-         tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 4);
--            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUL | MO_ALIGN);
-     }
-     return;
-     case 0x8000:        /* mov.b R0,@(disp,Rn) */
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     return;
-     case 0x4087:        /* ldc.l @Rm+,Rn_BANK */
-     CHECK_PRIVILEGED
--        tcg_gen_qemu_ld_i32(ALTREG(B6_4), REG(B11_8), ctx->memidx, MO_TESL);
-+        tcg_gen_qemu_ld_i32(ALTREG(B6_4), REG(B11_8), ctx->memidx,
-+                            MO_TESL | MO_ALIGN);
-     tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
-     return;
-     case 0x0082:        /* stc Rm_BANK,Rn */
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     {
-         TCGv addr = tcg_temp_new();
-         tcg_gen_subi_i32(addr, REG(B11_8), 4);
--            tcg_gen_qemu_st_i32(ALTREG(B6_4), addr, ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_st_i32(ALTREG(B6_4), addr, ctx->memidx,
-+                                MO_TEUL | MO_ALIGN);
-         tcg_gen_mov_i32(REG(B11_8), addr);
-     }
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     CHECK_PRIVILEGED
-     {
-         TCGv val = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-             tcg_gen_andi_i32(val, val, 0x700083f3);
-             gen_write_sr(val);
-         tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-             TCGv val = tcg_temp_new();
-         tcg_gen_subi_i32(addr, REG(B11_8), 4);
-             gen_read_sr(val);
--            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL | MO_ALIGN);
-         tcg_gen_mov_i32(REG(B11_8), addr);
-     }
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     return;                            \
-   case ldpnum:                            \
-     prechk                                \
--    tcg_gen_qemu_ld_i32(cpu_##reg, REG(B11_8), ctx->memidx, MO_TESL); \
-+    tcg_gen_qemu_ld_i32(cpu_##reg, REG(B11_8), ctx->memidx,     \
-+                        MO_TESL | MO_ALIGN);                    \
-     tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);        \
-     return;
- #define ST(reg,stnum,stpnum,prechk)        \
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     {                                \
-     TCGv addr = tcg_temp_new();                \
-     tcg_gen_subi_i32(addr, REG(B11_8), 4);            \
--        tcg_gen_qemu_st_i32(cpu_##reg, addr, ctx->memidx, MO_TEUL); \
-+        tcg_gen_qemu_st_i32(cpu_##reg, addr, ctx->memidx,       \
-+                            MO_TEUL | MO_ALIGN);                \
-     tcg_gen_mov_i32(REG(B11_8), addr);            \
-     }                                \
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-     CHECK_FPU_ENABLED
-     {
-         TCGv addr = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(addr, REG(B11_8), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(addr, REG(B11_8), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-         tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
-             gen_helper_ld_fpscr(cpu_env, addr);
-             ctx->base.is_jmp = DISAS_STOP;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-         tcg_gen_andi_i32(val, cpu_fpscr, 0x003fffff);
-         addr = tcg_temp_new();
-         tcg_gen_subi_i32(addr, REG(B11_8), 4);
--            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL | MO_ALIGN);
-         tcg_gen_mov_i32(REG(B11_8), addr);
-     }
-     return;
-     case 0x00c3:        /* movca.l R0,@Rm */
-         {
-             TCGv val = tcg_temp_new();
--            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx,
-+                                MO_TEUL | MO_ALIGN);
-             gen_helper_movcal(cpu_env, REG(B11_8), val);
--            tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx, MO_TEUL);
-+            tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx,
-+                                MO_TEUL | MO_ALIGN);
-         }
-         ctx->has_movcal = 1;
-     return;
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-                                    cpu_lock_addr, fail);
-                 tmp = tcg_temp_new();
-                 tcg_gen_atomic_cmpxchg_i32(tmp, REG(B11_8), cpu_lock_value,
--                                           REG(0), ctx->memidx, MO_TEUL);
-+                                           REG(0), ctx->memidx,
-+                                           MO_TEUL | MO_ALIGN);
-                 tcg_gen_setcond_i32(TCG_COND_EQ, cpu_sr_t, tmp, cpu_lock_value);
-             } else {
-                 tcg_gen_brcondi_i32(TCG_COND_EQ, cpu_lock_addr, -1, fail);
--                tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx, MO_TEUL);
-+                tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx,
-+                                    MO_TEUL | MO_ALIGN);
-                 tcg_gen_movi_i32(cpu_sr_t, 1);
-             }
-             tcg_gen_br(done);
-@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
-         if ((tb_cflags(ctx->base.tb) & CF_PARALLEL)) {
-             TCGv tmp = tcg_temp_new();
-             tcg_gen_mov_i32(tmp, REG(B11_8));
--            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-             tcg_gen_mov_i32(cpu_lock_value, REG(0));
-             tcg_gen_mov_i32(cpu_lock_addr, tmp);
-         } else {
--            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx, MO_TESL);
-+            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx,
-+                                MO_TESL | MO_ALIGN);
-             tcg_gen_movi_i32(cpu_lock_addr, 0);
-         }
-         return;
---
-.34.1

-[PULL 47/53] target/sh4: Remove TARGET_ALIGNED_ONLY
+Deleted patch
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- configs/targets/sh4-linux-user.mak   | 1 -
- configs/targets/sh4-softmmu.mak      | 1 -
- configs/targets/sh4eb-linux-user.mak | 1 -
- configs/targets/sh4eb-softmmu.mak    | 1 -
-files changed, 4 deletions(-)
-diff --git a/configs/targets/sh4-linux-user.mak b/configs/targets/sh4-linux-user.mak
-index XXXXXXX..XXXXXXX 100644
---- a/configs/targets/sh4-linux-user.mak
-+++ b/configs/targets/sh4-linux-user.mak
-@@ -XXX,XX +XXX,XX @@
- TARGET_ARCH=sh4
- TARGET_SYSTBL_ABI=common
- TARGET_SYSTBL=syscall.tbl
--TARGET_ALIGNED_ONLY=y
- TARGET_HAS_BFLT=y
-diff --git a/configs/targets/sh4-softmmu.mak b/configs/targets/sh4-softmmu.mak
-index XXXXXXX..XXXXXXX 100644
---- a/configs/targets/sh4-softmmu.mak
-+++ b/configs/targets/sh4-softmmu.mak
-@@ -1,2 +1 @@
- TARGET_ARCH=sh4
--TARGET_ALIGNED_ONLY=y
-diff --git a/configs/targets/sh4eb-linux-user.mak b/configs/targets/sh4eb-linux-user.mak
-index XXXXXXX..XXXXXXX 100644
---- a/configs/targets/sh4eb-linux-user.mak
-+++ b/configs/targets/sh4eb-linux-user.mak
-@@ -XXX,XX +XXX,XX @@
- TARGET_ARCH=sh4
- TARGET_SYSTBL_ABI=common
- TARGET_SYSTBL=syscall.tbl
--TARGET_ALIGNED_ONLY=y
- TARGET_BIG_ENDIAN=y
- TARGET_HAS_BFLT=y
-diff --git a/configs/targets/sh4eb-softmmu.mak b/configs/targets/sh4eb-softmmu.mak
-index XXXXXXX..XXXXXXX 100644
---- a/configs/targets/sh4eb-softmmu.mak
-+++ b/configs/targets/sh4eb-softmmu.mak
-@@ -XXX,XX +XXX,XX @@
- TARGET_ARCH=sh4
--TARGET_ALIGNED_ONLY=y
- TARGET_BIG_ENDIAN=y
---
-.34.1

The following changes since commit d530697ca20e19f7a626f4c1c8b26fccd0dc4470:

Merge tag 'pull-testing-updates-100523-1' of https://gitlab.com/stsquad/qemu into staging (2023-05-10 16:43:01 +0100)

are available in the Git repository at:

https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230511

for you to fetch changes up to b2d4d6616c22325dff802e0a35092167f2dc2268:

target/loongarch: Do not include tcg-ldst.h (2023-05-11 06:06:04 +0100)

----------------------------------------------------------------
target/m68k: Fix gen_load_fp regression
accel/tcg: Ensure fairness with icount
disas: Move disas.c into the target-independent source sets
tcg: Use common routines for calling slow path helpers
tcg/*: Cleanups to qemu_ld/st constraints
tcg: Remove TARGET_ALIGNED_ONLY
accel/tcg: Reorg system mode load/store helpers

----------------------------------------------------------------
Jamie Iles (2):
      cpu: expose qemu_cpu_list_lock for lock-guard use
      accel/tcg/tcg-accel-ops-rr: ensure fairness with icount

Richard Henderson (49):
      target/m68k: Fix gen_load_fp for OS_LONG
      accel/tcg: Fix atomic_mmu_lookup for reads
      disas: Fix tabs and braces in disas.c
      disas: Move disas.c to disas/
      disas: Remove target_ulong from the interface
      disas: Remove target-specific headers
      tcg/i386: Introduce prepare_host_addr
      tcg/i386: Use indexed addressing for softmmu fast path
      tcg/aarch64: Introduce prepare_host_addr
      tcg/arm: Introduce prepare_host_addr
      tcg/loongarch64: Introduce prepare_host_addr
      tcg/mips: Introduce prepare_host_addr
      tcg/ppc: Introduce prepare_host_addr
      tcg/riscv: Introduce prepare_host_addr
      tcg/s390x: Introduce prepare_host_addr
      tcg: Add routines for calling slow-path helpers
      tcg/i386: Convert tcg_out_qemu_ld_slow_path
      tcg/i386: Convert tcg_out_qemu_st_slow_path
      tcg/aarch64: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/arm: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/loongarch64: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/mips: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/ppc: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/riscv: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/s390x: Convert tcg_out_qemu_{ld,st}_slow_path
      tcg/loongarch64: Simplify constraints on qemu_ld/st
      tcg/mips: Remove MO_BSWAP handling
      tcg/mips: Reorg tlb load within prepare_host_addr
      tcg/mips: Simplify constraints on qemu_ld/st
      tcg/ppc: Reorg tcg_out_tlb_read
      tcg/ppc: Adjust constraints on qemu_ld/st
      tcg/ppc: Remove unused constraints A, B, C, D
      tcg/ppc: Remove unused constraint J
      tcg/riscv: Simplify constraints on qemu_ld/st
      tcg/s390x: Use ALGFR in constructing softmmu host address
      tcg/s390x: Simplify constraints on qemu_ld/st
      target/mips: Add MO_ALIGN to gen_llwp, gen_scwp
      target/mips: Add missing default_tcg_memop_mask
      target/mips: Use MO_ALIGN instead of 0
      target/mips: Remove TARGET_ALIGNED_ONLY
      target/nios2: Remove TARGET_ALIGNED_ONLY
      target/sh4: Use MO_ALIGN where required
      target/sh4: Remove TARGET_ALIGNED_ONLY
      tcg: Remove TARGET_ALIGNED_ONLY
      accel/tcg: Add cpu_in_serial_context
      accel/tcg: Introduce tlb_read_idx
      accel/tcg: Reorg system mode load helpers
      accel/tcg: Reorg system mode store helpers
      target/loongarch: Do not include tcg-ldst.h

Thomas Huth (2):
      disas: Move softmmu specific code to separate file
      disas: Move disas.c into the target-independent source set

A copy-paste bug had us looking at the victim cache for writes.

Cc: qemu-stable@nongnu.org
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Fixes: 08dff435e2 ("tcg: Probe the proper permissions for atomic ops")
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20230505204049.352469-1-richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -XXX,XX +XXX,XX @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     } else /* if (prot & PAGE_READ) */ {
         tlb_addr = tlbe->addr_read;
         if (!tlb_hit(tlb_addr, addr)) {
-            if (!VICTIM_TLB_HIT(addr_write, addr)) {
+            if (!VICTIM_TLB_HIT(addr_read, addr)) {
                 tlb_fill(env_cpu(env), addr, size,
                          MMU_DATA_LOAD, mmu_idx, retaddr);
                 index = tlb_index(env, mmu_idx, addr);
-- 
2.34.1

Fix these before moving the file, for checkpatch.pl.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230510170812.663149-1-richard.henderson@linaro.org>
---
 disas.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/disas.c b/disas.c
index XXXXXXX..XXXXXXX 100644
--- a/disas.c
+++ b/disas.c
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,
     }
 
     for (pc = code; size > 0; pc += count, size -= count) {
-	fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
-	count = s.info.print_insn(pc, &s.info);
-	fprintf(out, "\n");
-	if (count < 0)
-	    break;
+        fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
+        count = s.info.print_insn(pc, &s.info);
+        fprintf(out, "\n");
+        if (count < 0) {
+            break;
+        }
         if (size < count) {
             fprintf(out,
                     "Disassembler disagrees with translator over instruction "
-- 
2.34.1

Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230503072331.1747057-80-richard.henderson@linaro.org>
---
 meson.build              | 3 ---
 disas.c => disas/disas.c | 0
 disas/meson.build        | 4 +++-
 3 files changed, 3 insertions(+), 4 deletions(-)
 rename disas.c => disas/disas.c (100%)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ specific_ss.add(files('cpu.c'))
 
 subdir('softmmu')
 
-common_ss.add(capstone)
-specific_ss.add(files('disas.c'), capstone)
-
 # Work around a gcc bug/misfeature wherein constant propagation looks
 # through an alias:
 #   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99696
diff --git a/disas.c b/disas/disas.c
similarity index 100%
rename from disas.c
rename to disas/disas.c
diff --git a/disas/meson.build b/disas/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/disas/meson.build
+++ b/disas/meson.build
@@ -XXX,XX +XXX,XX @@ common_ss.add(when: 'CONFIG_RISCV_DIS', if_true: files('riscv.c'))
 common_ss.add(when: 'CONFIG_SH4_DIS', if_true: files('sh4.c'))
 common_ss.add(when: 'CONFIG_SPARC_DIS', if_true: files('sparc.c'))
 common_ss.add(when: 'CONFIG_XTENSA_DIS', if_true: files('xtensa.c'))
-common_ss.add(when: capstone, if_true: files('capstone.c'))
+common_ss.add(when: capstone, if_true: [files('capstone.c'), capstone])
+
+specific_ss.add(files('disas.c'), capstone)
-- 
2.34.1

Use uint64_t for the pc, and size_t for the size.

Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230503072331.1747057-81-richard.henderson@linaro.org>
---
 include/disas/disas.h | 17 ++++++-----------
 bsd-user/elfload.c    |  5 +++--
 disas/disas.c         | 19 +++++++++----------
 linux-user/elfload.c  |  5 +++--
 4 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/include/disas/disas.h b/include/disas/disas.h
index XXXXXXX..XXXXXXX 100644
--- a/include/disas/disas.h
+++ b/include/disas/disas.h
@@ -XXX,XX +XXX,XX @@
 #include "cpu.h"
 
 /* Disassemble this for me please... (debugging). */
-void disas(FILE *out, const void *code, unsigned long size);
-void target_disas(FILE *out, CPUState *cpu, target_ulong code,
-                  target_ulong size);
+void disas(FILE *out, const void *code, size_t size);
+void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size);
 
-void monitor_disas(Monitor *mon, CPUState *cpu,
-                   target_ulong pc, int nb_insn, int is_physical);
+void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
+                   int nb_insn, bool is_physical);
 
 char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size);
 
 /* Look up symbol for debugging purpose.  Returns "" if unknown. */
-const char *lookup_symbol(target_ulong orig_addr);
+const char *lookup_symbol(uint64_t orig_addr);
 #endif
 
 struct syminfo;
 struct elf32_sym;
 struct elf64_sym;
 
-#if defined(CONFIG_USER_ONLY)
-typedef const char *(*lookup_symbol_t)(struct syminfo *s, target_ulong orig_addr);
-#else
-typedef const char *(*lookup_symbol_t)(struct syminfo *s, hwaddr orig_addr);
-#endif
+typedef const char *(*lookup_symbol_t)(struct syminfo *s, uint64_t orig_addr);
 
 struct syminfo {
     lookup_symbol_t lookup_symbol;
diff --git a/bsd-user/elfload.c b/bsd-user/elfload.c
index XXXXXXX..XXXXXXX 100644
--- a/bsd-user/elfload.c
+++ b/bsd-user/elfload.c
@@ -XXX,XX +XXX,XX @@ static abi_ulong load_elf_interp(struct elfhdr *interp_elf_ex,
 
 static int symfind(const void *s0, const void *s1)
 {
-    target_ulong addr = *(target_ulong *)s0;
+    __typeof(sym->st_value) addr = *(uint64_t *)s0;
     struct elf_sym *sym = (struct elf_sym *)s1;
     int result = 0;
+
     if (addr < sym->st_value) {
         result = -1;
     } else if (addr >= sym->st_value + sym->st_size) {
@@ -XXX,XX +XXX,XX @@ static int symfind(const void *s0, const void *s1)
     return result;
 }
 
-static const char *lookup_symbolxx(struct syminfo *s, target_ulong orig_addr)
+static const char *lookup_symbolxx(struct syminfo *s, uint64_t orig_addr)
 {
 #if ELF_CLASS == ELFCLASS32
     struct elf_sym *syms = s->disas_symtab.elf32;
diff --git a/disas/disas.c b/disas/disas.c
index XXXXXXX..XXXXXXX 100644
--- a/disas/disas.c
+++ b/disas/disas.c
@@ -XXX,XX +XXX,XX @@ static void initialize_debug_host(CPUDebug *s)
 }
 
 /* Disassemble this for me please... (debugging).  */
-void target_disas(FILE *out, CPUState *cpu, target_ulong code,
-                  target_ulong size)
+void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size)
 {
-    target_ulong pc;
+    uint64_t pc;
     int count;
     CPUDebug s;
 
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,
     }
 
     for (pc = code; size > 0; pc += count, size -= count) {
-        fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
+        fprintf(out, "0x%08" PRIx64 ":  ", pc);
         count = s.info.print_insn(pc, &s.info);
         fprintf(out, "\n");
         if (count < 0) {
@@ -XXX,XX +XXX,XX @@ char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size)
 }
 
 /* Disassemble this for me please... (debugging). */
-void disas(FILE *out, const void *code, unsigned long size)
+void disas(FILE *out, const void *code, size_t size)
 {
     uintptr_t pc;
     int count;
@@ -XXX,XX +XXX,XX @@ void disas(FILE *out, const void *code, unsigned long size)
 }
 
 /* Look up symbol for debugging purpose.  Returns "" if unknown. */
-const char *lookup_symbol(target_ulong orig_addr)
+const char *lookup_symbol(uint64_t orig_addr)
 {
     const char *symbol = "";
     struct syminfo *s;
@@ -XXX,XX +XXX,XX @@ physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
 }
 
 /* Disassembler for the monitor.  */
-void monitor_disas(Monitor *mon, CPUState *cpu,
-                   target_ulong pc, int nb_insn, int is_physical)
+void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
+                   int nb_insn, bool is_physical)
 {
     int count, i;
     CPUDebug s;
@@ -XXX,XX +XXX,XX @@ void monitor_disas(Monitor *mon, CPUState *cpu,
     }
 
     if (!s.info.print_insn) {
-        monitor_printf(mon, "0x" TARGET_FMT_lx
+        monitor_printf(mon, "0x%08" PRIx64
                        ": Asm output not supported on this arch\n", pc);
         return;
     }
 
     for (i = 0; i < nb_insn; i++) {
-        g_string_append_printf(ds, "0x" TARGET_FMT_lx ":  ", pc);
+        g_string_append_printf(ds, "0x%08" PRIx64 ":  ", pc);
         count = s.info.print_insn(pc, &s.info);
         g_string_append_c(ds, '\n');
         if (count < 0) {
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index XXXXXXX..XXXXXXX 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -XXX,XX +XXX,XX @@ static void load_elf_interp(const char *filename, struct image_info *info,
 
 static int symfind(const void *s0, const void *s1)
 {
-    target_ulong addr = *(target_ulong *)s0;
     struct elf_sym *sym = (struct elf_sym *)s1;
+    __typeof(sym->st_value) addr = *(uint64_t *)s0;
     int result = 0;
+
     if (addr < sym->st_value) {
         result = -1;
     } else if (addr >= sym->st_value + sym->st_size) {
@@ -XXX,XX +XXX,XX @@ static int symfind(const void *s0, const void *s1)
     return result;
 }
 
-static const char *lookup_symbolxx(struct syminfo *s, target_ulong orig_addr)
+static const char *lookup_symbolxx(struct syminfo *s, uint64_t orig_addr)
 {
 #if ELF_CLASS == ELFCLASS32
     struct elf_sym *syms = s->disas_symtab.elf32;
-- 
2.34.1

Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230503072331.1747057-83-richard.henderson@linaro.org>
---
 include/disas/disas.h | 6 ------
 disas/disas.c         | 3 ++-
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/disas/disas.h b/include/disas/disas.h
index XXXXXXX..XXXXXXX 100644
--- a/include/disas/disas.h
+++ b/include/disas/disas.h
@@ -XXX,XX +XXX,XX @@
 #ifndef QEMU_DISAS_H
 #define QEMU_DISAS_H
 
-#include "exec/hwaddr.h"
-
-#ifdef NEED_CPU_H
-#include "cpu.h"
-
 /* Disassemble this for me please... (debugging). */
 void disas(FILE *out, const void *code, size_t size);
 void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size);
@@ -XXX,XX +XXX,XX @@ char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size);
 
 /* Look up symbol for debugging purpose.  Returns "" if unknown. */
 const char *lookup_symbol(uint64_t orig_addr);
-#endif
 
 struct syminfo;
 struct elf32_sym;
diff --git a/disas/disas.c b/disas/disas.c
index XXXXXXX..XXXXXXX 100644
--- a/disas/disas.c
+++ b/disas/disas.c
@@ -XXX,XX +XXX,XX @@
 #include "disas/dis-asm.h"
 #include "elf.h"
 #include "qemu/qemu-print.h"
-
 #include "disas/disas.h"
 #include "disas/capstone.h"
+#include "hw/core/cpu.h"
+#include "exec/memory.h"
 
 typedef struct CPUDebug {
     struct disassemble_info info;
-- 
2.34.1

From: Thomas Huth <thuth@redhat.com>

We'd like to move disas.c into the common code source set, where
CONFIG_USER_ONLY is not available anymore. So we have to move
the related code into a separate file instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230508133745.109463-2-thuth@redhat.com>
[rth: Type change done in a separate patch]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 disas/disas-internal.h | 21 ++++++++++++
 disas/disas-mon.c      | 65 ++++++++++++++++++++++++++++++++++++
 disas/disas.c          | 76 ++++--------------------------------------
 disas/meson.build      |  1 +
 4 files changed, 93 insertions(+), 70 deletions(-)
 create mode 100644 disas/disas-internal.h
 create mode 100644 disas/disas-mon.c

diff --git a/disas/disas-internal.h b/disas/disas-internal.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/disas/disas-internal.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * Definitions used internally in the disassembly code
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef DISAS_INTERNAL_H
+#define DISAS_INTERNAL_H
+
+#include "disas/dis-asm.h"
+
+typedef struct CPUDebug {
+    struct disassemble_info info;
+    CPUState *cpu;
+} CPUDebug;
+
+void disas_initialize_debug_target(CPUDebug *s, CPUState *cpu);
+int disas_gstring_printf(FILE *stream, const char *fmt, ...)
+    G_GNUC_PRINTF(2, 3);
+
+#endif
diff --git a/disas/disas-mon.c b/disas/disas-mon.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/disas/disas-mon.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * Functions related to disassembly from the monitor
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "disas-internal.h"
+#include "disas/disas.h"
+#include "exec/memory.h"
+#include "hw/core/cpu.h"
+#include "monitor/monitor.h"
+
+static int
+physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
+                     struct disassemble_info *info)
+{
+    CPUDebug *s = container_of(info, CPUDebug, info);
+    MemTxResult res;
+
+    res = address_space_read(s->cpu->as, memaddr, MEMTXATTRS_UNSPECIFIED,
+                             myaddr, length);
+    return res == MEMTX_OK ? 0 : EIO;
+}
+
+/* Disassembler for the monitor.  */
+void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
+                   int nb_insn, bool is_physical)
+{
+    int count, i;
+    CPUDebug s;
+    g_autoptr(GString) ds = g_string_new("");
+
+    disas_initialize_debug_target(&s, cpu);
+    s.info.fprintf_func = disas_gstring_printf;
+    s.info.stream = (FILE *)ds;  /* abuse this slot */
+
+    if (is_physical) {
+        s.info.read_memory_func = physical_read_memory;
+    }
+    s.info.buffer_vma = pc;
+
+    if (s.info.cap_arch >= 0 && cap_disas_monitor(&s.info, pc, nb_insn)) {
+        monitor_puts(mon, ds->str);
+        return;
+    }
+
+    if (!s.info.print_insn) {
+        monitor_printf(mon, "0x%08" PRIx64
+                       ": Asm output not supported on this arch\n", pc);
+        return;
+    }
+
+    for (i = 0; i < nb_insn; i++) {
+        g_string_append_printf(ds, "0x%08" PRIx64 ":  ", pc);
+        count = s.info.print_insn(pc, &s.info);
+        g_string_append_c(ds, '\n');
+        if (count < 0) {
+            break;
+        }
+        pc += count;
+    }
+
+    monitor_puts(mon, ds->str);
+}
diff --git a/disas/disas.c b/disas/disas.c
index XXXXXXX..XXXXXXX 100644
--- a/disas/disas.c
+++ b/disas/disas.c
@@ -XXX,XX +XXX,XX @@
 /* General "disassemble this chunk" code.  Used for debugging. */
 #include "qemu/osdep.h"
-#include "disas/dis-asm.h"
+#include "disas/disas-internal.h"
 #include "elf.h"
 #include "qemu/qemu-print.h"
 #include "disas/disas.h"
@@ -XXX,XX +XXX,XX @@
 #include "hw/core/cpu.h"
 #include "exec/memory.h"
 
-typedef struct CPUDebug {
-    struct disassemble_info info;
-    CPUState *cpu;
-} CPUDebug;
-
 /* Filled in by elfload.c.  Simplistic, but will do for now. */
 struct syminfo *syminfos = NULL;
 
@@ -XXX,XX +XXX,XX @@ static void initialize_debug(CPUDebug *s)
     s->info.symbol_at_address_func = symbol_at_address;
 }
 
-static void initialize_debug_target(CPUDebug *s, CPUState *cpu)
+void disas_initialize_debug_target(CPUDebug *s, CPUState *cpu)
 {
     initialize_debug(s);
 
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size)
     int count;
     CPUDebug s;
 
-    initialize_debug_target(&s, cpu);
+    disas_initialize_debug_target(&s, cpu);
     s.info.fprintf_func = fprintf;
     s.info.stream = out;
     s.info.buffer_vma = code;
@@ -XXX,XX +XXX,XX @@ void target_disas(FILE *out, CPUState *cpu, uint64_t code, size_t size)
     }
 }
 
-static int G_GNUC_PRINTF(2, 3)
-gstring_printf(FILE *stream, const char *fmt, ...)
+int disas_gstring_printf(FILE *stream, const char *fmt, ...)
 {
     /* We abuse the FILE parameter to pass a GString. */
     GString *s = (GString *)stream;
@@ -XXX,XX +XXX,XX @@ char *plugin_disas(CPUState *cpu, uint64_t addr, size_t size)
     CPUDebug s;
     GString *ds = g_string_new(NULL);
 
-    initialize_debug_target(&s, cpu);
-    s.info.fprintf_func = gstring_printf;
+    disas_initialize_debug_target(&s, cpu);
+    s.info.fprintf_func = disas_gstring_printf;
     s.info.stream = (FILE *)ds;  /* abuse this slot */
     s.info.buffer_vma = addr;
     s.info.buffer_length = size;
@@ -XXX,XX +XXX,XX @@ const char *lookup_symbol(uint64_t orig_addr)
 
     return symbol;
 }
-
-#if !defined(CONFIG_USER_ONLY)
-
-#include "monitor/monitor.h"
-
-static int
-physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
-                     struct disassemble_info *info)
-{
-    CPUDebug *s = container_of(info, CPUDebug, info);
-    MemTxResult res;
-
-    res = address_space_read(s->cpu->as, memaddr, MEMTXATTRS_UNSPECIFIED,
-                             myaddr, length);
-    return res == MEMTX_OK ? 0 : EIO;
-}
-
-/* Disassembler for the monitor.  */
-void monitor_disas(Monitor *mon, CPUState *cpu, uint64_t pc,
-                   int nb_insn, bool is_physical)
-{
-    int count, i;
-    CPUDebug s;
-    g_autoptr(GString) ds = g_string_new("");
-
-    initialize_debug_target(&s, cpu);
-    s.info.fprintf_func = gstring_printf;
-    s.info.stream = (FILE *)ds;  /* abuse this slot */
-
-    if (is_physical) {
-        s.info.read_memory_func = physical_read_memory;
-    }
-    s.info.buffer_vma = pc;
-
-    if (s.info.cap_arch >= 0 && cap_disas_monitor(&s.info, pc, nb_insn)) {
-        monitor_puts(mon, ds->str);
-        return;
-    }
-
-    if (!s.info.print_insn) {
-        monitor_printf(mon, "0x%08" PRIx64
-                       ": Asm output not supported on this arch\n", pc);
-        return;
-    }
-
-    for (i = 0; i < nb_insn; i++) {
-        g_string_append_printf(ds, "0x%08" PRIx64 ":  ", pc);
-        count = s.info.print_insn(pc, &s.info);
-        g_string_append_c(ds, '\n');
-        if (count < 0) {
-            break;
-        }
-        pc += count;
-    }
-
-    monitor_puts(mon, ds->str);
-}
-#endif
diff --git a/disas/meson.build b/disas/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/disas/meson.build
+++ b/disas/meson.build
@@ -XXX,XX +XXX,XX @@ common_ss.add(when: 'CONFIG_SPARC_DIS', if_true: files('sparc.c'))
 common_ss.add(when: 'CONFIG_XTENSA_DIS', if_true: files('xtensa.c'))
 common_ss.add(when: capstone, if_true: [files('capstone.c'), capstone])
 
+softmmu_ss.add(files('disas-mon.c'))
 specific_ss.add(files('disas.c'), capstone)
-- 
2.34.1

From: Thomas Huth <thuth@redhat.com>

By using target_words_bigendian() instead of an ifdef,
we can build this code once.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230508133745.109463-3-thuth@redhat.com>
[rth: Type change done in a separate patch]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 disas/disas.c     | 10 +++++-----
 disas/meson.build |  3 ++-
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/disas/disas.c b/disas/disas.c
index XXXXXXX..XXXXXXX 100644
--- a/disas/disas.c
+++ b/disas/disas.c
@@ -XXX,XX +XXX,XX @@ void disas_initialize_debug_target(CPUDebug *s, CPUState *cpu)
     s->cpu = cpu;
     s->info.read_memory_func = target_read_memory;
     s->info.print_address_func = print_address;
-#if TARGET_BIG_ENDIAN
-    s->info.endian = BFD_ENDIAN_BIG;
-#else
-    s->info.endian = BFD_ENDIAN_LITTLE;
-#endif
+    if (target_words_bigendian()) {
+        s->info.endian = BFD_ENDIAN_BIG;
+    } else {
+        s->info.endian =  BFD_ENDIAN_LITTLE;
+    }
 
     CPUClass *cc = CPU_GET_CLASS(cpu);
     if (cc->disas_set_info) {
diff --git a/disas/meson.build b/disas/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/disas/meson.build
+++ b/disas/meson.build
@@ -XXX,XX +XXX,XX @@ common_ss.add(when: 'CONFIG_SH4_DIS', if_true: files('sh4.c'))
 common_ss.add(when: 'CONFIG_SPARC_DIS', if_true: files('sparc.c'))
 common_ss.add(when: 'CONFIG_XTENSA_DIS', if_true: files('xtensa.c'))
 common_ss.add(when: capstone, if_true: [files('capstone.c'), capstone])
+common_ss.add(files('disas.c'))
 
 softmmu_ss.add(files('disas-mon.c'))
-specific_ss.add(files('disas.c'), capstone)
+specific_ss.add(capstone)
-- 
2.34.1

From: Jamie Iles <quic_jiles@quicinc.com>

Expose qemu_cpu_list_lock globally so that we can use
WITH_QEMU_LOCK_GUARD and QEMU_LOCK_GUARD to simplify a few code paths
now and in future.

Signed-off-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230427020925.51003-2-quic_jiles@quicinc.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/exec/cpu-common.h |  1 +
 cpus-common.c             |  2 +-
 linux-user/elfload.c      | 13 +++++++------
 migration/dirtyrate.c     | 26 +++++++++++++-------------
 trace/control-target.c    |  9 ++++-----
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -XXX,XX +XXX,XX @@ extern intptr_t qemu_host_page_mask;
 #define REAL_HOST_PAGE_ALIGN(addr) ROUND_UP((addr), qemu_real_host_page_size())
 
 /* The CPU list lock nests outside page_(un)lock or mmap_(un)lock */
+extern QemuMutex qemu_cpu_list_lock;
 void qemu_init_cpu_list(void);
 void cpu_list_lock(void);
 void cpu_list_unlock(void);
diff --git a/cpus-common.c b/cpus-common.c
index XXXXXXX..XXXXXXX 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/lockable.h"
 #include "trace/trace-root.h"
 
-static QemuMutex qemu_cpu_list_lock;
+QemuMutex qemu_cpu_list_lock;
 static QemuCond exclusive_cond;
 static QemuCond exclusive_resume;
 static QemuCond qemu_work_cond;
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index XXXXXXX..XXXXXXX 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/guest-random.h"
 #include "qemu/units.h"
 #include "qemu/selfmap.h"
+#include "qemu/lockable.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "target_signal.h"
@@ -XXX,XX +XXX,XX @@ static int fill_note_info(struct elf_note_info *info,
         info->notes_size += note_size(&info->notes[i]);
 
     /* read and fill status of all threads */
-    cpu_list_lock();
-    CPU_FOREACH(cpu) {
-        if (cpu == thread_cpu) {
-            continue;
+    WITH_QEMU_LOCK_GUARD(&qemu_cpu_list_lock) {
+        CPU_FOREACH(cpu) {
+            if (cpu == thread_cpu) {
+                continue;
+            }
+            fill_thread_info(info, cpu->env_ptr);
         }
-        fill_thread_info(info, cpu->env_ptr);
     }
-    cpu_list_unlock();
 
     return (0);
 }
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index XXXXXXX..XXXXXXX 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -XXX,XX +XXX,XX @@ int64_t vcpu_calculate_dirtyrate(int64_t calc_time_ms,
 retry:
     init_time_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 
-    cpu_list_lock();
-    gen_id = cpu_list_generation_id_get();
-    records = vcpu_dirty_stat_alloc(stat);
-    vcpu_dirty_stat_collect(stat, records, true);
-    cpu_list_unlock();
+    WITH_QEMU_LOCK_GUARD(&qemu_cpu_list_lock) {
+        gen_id = cpu_list_generation_id_get();
+        records = vcpu_dirty_stat_alloc(stat);
+        vcpu_dirty_stat_collect(stat, records, true);
+    }
 
     duration = dirty_stat_wait(calc_time_ms, init_time_ms);
 
     global_dirty_log_sync(flag, one_shot);
 
-    cpu_list_lock();
-    if (gen_id != cpu_list_generation_id_get()) {
-        g_free(records);
-        g_free(stat->rates);
-        cpu_list_unlock();
-        goto retry;
+    WITH_QEMU_LOCK_GUARD(&qemu_cpu_list_lock) {
+        if (gen_id != cpu_list_generation_id_get()) {
+            g_free(records);
+            g_free(stat->rates);
+            cpu_list_unlock();
+            goto retry;
+        }
+        vcpu_dirty_stat_collect(stat, records, false);
     }
-    vcpu_dirty_stat_collect(stat, records, false);
-    cpu_list_unlock();
 
     for (i = 0; i < stat->nvcpu; i++) {
         dirtyrate = do_calculate_dirtyrate(records[i], duration);
diff --git a/trace/control-target.c b/trace/control-target.c
index XXXXXXX..XXXXXXX 100644
--- a/trace/control-target.c
+++ b/trace/control-target.c
@@ -XXX,XX +XXX,XX @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/lockable.h"
 #include "cpu.h"
 #include "trace/trace-root.h"
 #include "trace/control.h"
@@ -XXX,XX +XXX,XX @@ static bool adding_first_cpu1(void)
 
 static bool adding_first_cpu(void)
 {
-    bool res;
-    cpu_list_lock();
-    res = adding_first_cpu1();
-    cpu_list_unlock();
-    return res;
+    QEMU_LOCK_GUARD(&qemu_cpu_list_lock);
+
+    return adding_first_cpu1();
 }
 
 void trace_init_vcpu(CPUState *vcpu)
-- 
2.34.1

From: Jamie Iles <quic_jiles@quicinc.com>

The round-robin scheduler will iterate over the CPU list with an
assigned budget until the next timer expiry and may exit early because
of a TB exit.  This is fine under normal operation but with icount
enabled and SMP it is possible for a CPU to be starved of run time and
the system live-locks.

For example, booting a riscv64 platform with '-icount
shift=0,align=off,sleep=on -smp 2' we observe a livelock once the kernel
has timers enabled and starts performing TLB shootdowns.  In this case
we have CPU 0 in M-mode with interrupts disabled sending an IPI to CPU
1.  As we enter the TCG loop, we assign the icount budget to next timer
interrupt to CPU 0 and begin executing where the guest is sat in a busy
loop exhausting all of the budget before we try to execute CPU 1 which
is the target of the IPI but CPU 1 is left with no budget with which to
execute and the process repeats.

We try here to add some fairness by splitting the budget across all of
the CPUs on the thread fairly before entering each one.  The CPU count
is cached on CPU list generation ID to avoid iterating the list on each
loop iteration.  With this change it is possible to boot an SMP rv64
guest with icount enabled and no hangs.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230427020925.51003-3-quic_jiles@quicinc.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/tcg-accel-ops-icount.h |  3 ++-
 accel/tcg/tcg-accel-ops-icount.c | 21 ++++++++++++++----
 accel/tcg/tcg-accel-ops-rr.c     | 37 +++++++++++++++++++++++++++++++-
 replay/replay.c                  |  3 +--
 4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/accel/tcg/tcg-accel-ops-icount.h b/accel/tcg/tcg-accel-ops-icount.h
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/tcg-accel-ops-icount.h
+++ b/accel/tcg/tcg-accel-ops-icount.h
@@ -XXX,XX +XXX,XX @@
 #define TCG_ACCEL_OPS_ICOUNT_H
 
 void icount_handle_deadline(void);
-void icount_prepare_for_run(CPUState *cpu);
+void icount_prepare_for_run(CPUState *cpu, int64_t cpu_budget);
+int64_t icount_percpu_budget(int cpu_count);
 void icount_process_data(CPUState *cpu);
 
 void icount_handle_interrupt(CPUState *cpu, int mask);
diff --git a/accel/tcg/tcg-accel-ops-icount.c b/accel/tcg/tcg-accel-ops-icount.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/tcg-accel-ops-icount.c
+++ b/accel/tcg/tcg-accel-ops-icount.c
@@ -XXX,XX +XXX,XX @@ void icount_handle_deadline(void)
     }
 }
 
-void icount_prepare_for_run(CPUState *cpu)
+/* Distribute the budget evenly across all CPUs */
+int64_t icount_percpu_budget(int cpu_count)
+{
+    int64_t limit = icount_get_limit();
+    int64_t timeslice = limit / cpu_count;
+
+    if (timeslice == 0) {
+        timeslice = limit;
+    }
+
+    return timeslice;
+}
+
+void icount_prepare_for_run(CPUState *cpu, int64_t cpu_budget)
 {
     int insns_left;
 
@@ -XXX,XX +XXX,XX @@ void icount_prepare_for_run(CPUState *cpu)
     g_assert(cpu_neg(cpu)->icount_decr.u16.low == 0);
     g_assert(cpu->icount_extra == 0);
 
-    cpu->icount_budget = icount_get_limit();
+    replay_mutex_lock();
+
+    cpu->icount_budget = MIN(icount_get_limit(), cpu_budget);
     insns_left = MIN(0xffff, cpu->icount_budget);
     cpu_neg(cpu)->icount_decr.u16.low = insns_left;
     cpu->icount_extra = cpu->icount_budget - insns_left;
 
-    replay_mutex_lock();
-
     if (cpu->icount_budget == 0) {
         /*
          * We're called without the iothread lock, so must take it while
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -XXX,XX +XXX,XX @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/lockable.h"
 #include "sysemu/tcg.h"
 #include "sysemu/replay.h"
 #include "sysemu/cpu-timers.h"
@@ -XXX,XX +XXX,XX @@ static void rr_force_rcu(Notifier *notify, void *data)
     rr_kick_next_cpu();
 }
 
+/*
+ * Calculate the number of CPUs that we will process in a single iteration of
+ * the main CPU thread loop so that we can fairly distribute the instruction
+ * count across CPUs.
+ *
+ * The CPU count is cached based on the CPU list generation ID to avoid
+ * iterating the list every time.
+ */
+static int rr_cpu_count(void)
+{
+    static unsigned int last_gen_id = ~0;
+    static int cpu_count;
+    CPUState *cpu;
+
+    QEMU_LOCK_GUARD(&qemu_cpu_list_lock);
+
+    if (cpu_list_generation_id_get() != last_gen_id) {
+        cpu_count = 0;
+        CPU_FOREACH(cpu) {
+            ++cpu_count;
+        }
+        last_gen_id = cpu_list_generation_id_get();
+    }
+
+    return cpu_count;
+}
+
 /*
  * In the single-threaded case each vCPU is simulated in turn. If
  * there is more than a single vCPU we create a simple timer to kick
@@ -XXX,XX +XXX,XX @@ static void *rr_cpu_thread_fn(void *arg)
     cpu->exit_request = 1;
 
     while (1) {
+        /* Only used for icount_enabled() */
+        int64_t cpu_budget = 0;
+
         qemu_mutex_unlock_iothread();
         replay_mutex_lock();
         qemu_mutex_lock_iothread();
 
         if (icount_enabled()) {
+            int cpu_count = rr_cpu_count();
+
             /* Account partial waits to QEMU_CLOCK_VIRTUAL.  */
             icount_account_warp_timer();
             /*
@@ -XXX,XX +XXX,XX @@ static void *rr_cpu_thread_fn(void *arg)
              * waking up the I/O thread and waiting for completion.
              */
             icount_handle_deadline();
+
+            cpu_budget = icount_percpu_budget(cpu_count);
         }
 
         replay_mutex_unlock();
@@ -XXX,XX +XXX,XX @@ static void *rr_cpu_thread_fn(void *arg)
 
                 qemu_mutex_unlock_iothread();
                 if (icount_enabled()) {
-                    icount_prepare_for_run(cpu);
+                    icount_prepare_for_run(cpu, cpu_budget);
                 }
                 r = tcg_cpus_exec(cpu);
                 if (icount_enabled()) {
diff --git a/replay/replay.c b/replay/replay.c
index XXXXXXX..XXXXXXX 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -XXX,XX +XXX,XX @@ uint64_t replay_get_current_icount(void)
 int replay_get_instructions(void)
 {
     int res = 0;
-    replay_mutex_lock();
+    g_assert(replay_mutex_locked());
     if (replay_next_event_is(EVENT_INSTRUCTION)) {
         res = replay_state.instruction_count;
         if (replay_break_icount != -1LL) {
@@ -XXX,XX +XXX,XX @@ int replay_get_instructions(void)
             }
         }
     }
-    replay_mutex_unlock();
     return res;
 }
 
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label,
tcg_out_test_alignment, and some code that lived in both
tcg_out_qemu_ld and tcg_out_qemu_st into one function
that returns HostAddress and TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 346 ++++++++++++++++----------------------
 1 file changed, 145 insertions(+), 201 deletions(-)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
     [MO_BEUQ] = helper_be_stq_mmu,
 };
 
-/* Perform the TLB load and compare.
-
-   Inputs:
-   ADDRLO and ADDRHI contain the low and high part of the address.
-
-   MEM_INDEX and S_BITS are the memory context and log2 size of the load.
-
-   WHICH is the offset into the CPUTLBEntry structure of the slot to read.
-   This should be offsetof addr_read or addr_write.
-
-   Outputs:
-   LABEL_PTRS is filled with 1 (32-bit addresses) or 2 (64-bit addresses)
-   positions of the displacements of forward jumps to the TLB miss case.
-
-   Second argument register is loaded with the low part of the address.
-   In the TLB hit case, it has been adjusted as indicated by the TLB
-   and so is a host address.  In the TLB miss case, it continues to
-   hold a guest address.
-
-   First argument register is clobbered.  */
-
-static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
-                                    int mem_index, MemOp opc,
-                                    tcg_insn_unit **label_ptr, int which)
-{
-    TCGType ttype = TCG_TYPE_I32;
-    TCGType tlbtype = TCG_TYPE_I32;
-    int trexw = 0, hrexw = 0, tlbrexw = 0;
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_mask = (1 << a_bits) - 1;
-    unsigned s_mask = (1 << s_bits) - 1;
-    target_ulong tlb_mask;
-
-    if (TCG_TARGET_REG_BITS == 64) {
-        if (TARGET_LONG_BITS == 64) {
-            ttype = TCG_TYPE_I64;
-            trexw = P_REXW;
-        }
-        if (TCG_TYPE_PTR == TCG_TYPE_I64) {
-            hrexw = P_REXW;
-            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
-                tlbtype = TCG_TYPE_I64;
-                tlbrexw = P_REXW;
-            }
-        }
-    }
-
-    tcg_out_mov(s, tlbtype, TCG_REG_L0, addrlo);
-    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, TCG_REG_L0,
-                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
-    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, TCG_REG_L0, TCG_AREG0,
-                         TLB_MASK_TABLE_OFS(mem_index) +
-                         offsetof(CPUTLBDescFast, mask));
-
-    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, TCG_REG_L0, TCG_AREG0,
-                         TLB_MASK_TABLE_OFS(mem_index) +
-                         offsetof(CPUTLBDescFast, table));
-
-    /* If the required alignment is at least as large as the access, simply
-       copy the address and mask.  For lesser alignments, check that we don't
-       cross pages for the complete access.  */
-    if (a_bits >= s_bits) {
-        tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
-    } else {
-        tcg_out_modrm_offset(s, OPC_LEA + trexw, TCG_REG_L1,
-                             addrlo, s_mask - a_mask);
-    }
-    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-    tgen_arithi(s, ARITH_AND + trexw, TCG_REG_L1, tlb_mask, 0);
-
-    /* cmp 0(TCG_REG_L0), TCG_REG_L1 */
-    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw,
-                         TCG_REG_L1, TCG_REG_L0, which);
-
-    /* Prepare for both the fast path add of the tlb addend, and the slow
-       path function argument setup.  */
-    tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
-
-    /* jne slow_path */
-    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-    label_ptr[0] = s->code_ptr;
-    s->code_ptr += 4;
-
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        /* cmp 4(TCG_REG_L0), addrhi */
-        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, TCG_REG_L0, which + 4);
-
-        /* jne slow_path */
-        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-        label_ptr[1] = s->code_ptr;
-        s->code_ptr += 4;
-    }
-
-    /* TLB Hit.  */
-
-    /* add addend(TCG_REG_L0), TCG_REG_L1 */
-    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, TCG_REG_L1, TCG_REG_L0,
-                         offsetof(CPUTLBEntry, addend));
-}
-
-/*
- * Record the context of a call to the out of line helper code for the slow path
- * for a load or store, so that we can later generate the correct helper code
- */
-static void add_qemu_ldst_label(TCGContext *s, bool is_ld,
-                                TCGType type, MemOpIdx oi,
-                                TCGReg datalo, TCGReg datahi,
-                                TCGReg addrlo, TCGReg addrhi,
-                                tcg_insn_unit *raddr,
-                                tcg_insn_unit **label_ptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = type;
-    label->datalo_reg = datalo;
-    label->datahi_reg = datahi;
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = addrhi;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr[0];
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        label->label_ptr[1] = label_ptr[1];
-    }
-}
-
 /*
  * Generate code for the slow path for a load at the end of block
  */
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     return true;
 }
 #else
-
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
-                                   TCGReg addrhi, unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *label;
-
-    tcg_out_testi(s, addrlo, a_mask);
-    /* jne slow_path */
-    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
-
-    label = new_ldst_label(s);
-    label->is_ld = is_ld;
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = addrhi;
-    label->raddr = tcg_splitwx_to_rx(s->code_ptr + 4);
-    label->label_ptr[0] = s->code_ptr;
-
-    s->code_ptr += 4;
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     /* resolve label address */
@@ -XXX,XX +XXX,XX @@ static inline int setup_guest_base_seg(void)
 #endif /* setup_guest_base_seg */
 #endif /* SOFTMMU */
 
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addrlo, TCGReg addrhi,
+                                           MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+    unsigned a_mask = (1 << a_bits) - 1;
+
+#ifdef CONFIG_SOFTMMU
+    int cmp_ofs = is_ld ? offsetof(CPUTLBEntry, addr_read)
+                        : offsetof(CPUTLBEntry, addr_write);
+    TCGType ttype = TCG_TYPE_I32;
+    TCGType tlbtype = TCG_TYPE_I32;
+    int trexw = 0, hrexw = 0, tlbrexw = 0;
+    unsigned mem_index = get_mmuidx(oi);
+    unsigned s_bits = opc & MO_SIZE;
+    unsigned s_mask = (1 << s_bits) - 1;
+    target_ulong tlb_mask;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addrlo;
+    ldst->addrhi_reg = addrhi;
+
+    if (TCG_TARGET_REG_BITS == 64) {
+        if (TARGET_LONG_BITS == 64) {
+            ttype = TCG_TYPE_I64;
+            trexw = P_REXW;
+        }
+        if (TCG_TYPE_PTR == TCG_TYPE_I64) {
+            hrexw = P_REXW;
+            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
+                tlbtype = TCG_TYPE_I64;
+                tlbrexw = P_REXW;
+            }
+        }
+    }
+
+    tcg_out_mov(s, tlbtype, TCG_REG_L0, addrlo);
+    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, TCG_REG_L0,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, TCG_REG_L0, TCG_AREG0,
+                         TLB_MASK_TABLE_OFS(mem_index) +
+                         offsetof(CPUTLBDescFast, mask));
+
+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, TCG_REG_L0, TCG_AREG0,
+                         TLB_MASK_TABLE_OFS(mem_index) +
+                         offsetof(CPUTLBDescFast, table));
+
+    /*
+     * If the required alignment is at least as large as the access, simply
+     * copy the address and mask.  For lesser alignments, check that we don't
+     * cross pages for the complete access.
+     */
+    if (a_bits >= s_bits) {
+        tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
+    } else {
+        tcg_out_modrm_offset(s, OPC_LEA + trexw, TCG_REG_L1,
+                             addrlo, s_mask - a_mask);
+    }
+    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
+    tgen_arithi(s, ARITH_AND + trexw, TCG_REG_L1, tlb_mask, 0);
+
+    /* cmp 0(TCG_REG_L0), TCG_REG_L1 */
+    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw,
+                         TCG_REG_L1, TCG_REG_L0, cmp_ofs);
+
+    /*
+     * Prepare for both the fast path add of the tlb addend, and the slow
+     * path function argument setup.
+     */
+    *h = (HostAddress) {
+        .base = TCG_REG_L1,
+        .index = -1
+    };
+    tcg_out_mov(s, ttype, h->base, addrlo);
+
+    /* jne slow_path */
+    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
+    ldst->label_ptr[0] = s->code_ptr;
+    s->code_ptr += 4;
+
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        /* cmp 4(TCG_REG_L0), addrhi */
+        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, TCG_REG_L0, cmp_ofs + 4);
+
+        /* jne slow_path */
+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
+        ldst->label_ptr[1] = s->code_ptr;
+        s->code_ptr += 4;
+    }
+
+    /* TLB Hit.  */
+
+    /* add addend(TCG_REG_L0), TCG_REG_L1 */
+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, h->base, TCG_REG_L0,
+                         offsetof(CPUTLBEntry, addend));
+#else
+    if (a_bits) {
+        ldst = new_ldst_label(s);
+
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addrlo;
+        ldst->addrhi_reg = addrhi;
+
+        tcg_out_testi(s, addrlo, a_mask);
+        /* jne slow_path */
+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
+        ldst->label_ptr[0] = s->code_ptr;
+        s->code_ptr += 4;
+    }
+
+    *h = x86_guest_base;
+    h->base = addrlo;
+#endif
+
+    return ldst;
+}
+
 static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                    HostAddress h, TCGType type, MemOp memop)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             TCGReg addrlo, TCGReg addrhi,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#if defined(CONFIG_SOFTMMU)
-    tcg_insn_unit *label_ptr[2];
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
+    tcg_out_qemu_ld_direct(s, datalo, datahi, h, data_type, get_memop(oi));
 
-    tcg_out_tlb_load(s, addrlo, addrhi, get_mmuidx(oi), opc,
-                     label_ptr, offsetof(CPUTLBEntry, addr_read));
-
-    /* TLB Hit.  */
-    h.base = TCG_REG_L1;
-    h.index = -1;
-    h.ofs = 0;
-    h.seg = 0;
-    tcg_out_qemu_ld_direct(s, datalo, datahi, h, data_type, opc);
-
-    /* Record the current context of a load into ldst label */
-    add_qemu_ldst_label(s, true, data_type, oi, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-
-    h = x86_guest_base;
-    h.base = addrlo;
-    tcg_out_qemu_ld_direct(s, datalo, datahi, h, data_type, opc);
-#endif
 }
 
 static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             TCGReg addrlo, TCGReg addrhi,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#if defined(CONFIG_SOFTMMU)
-    tcg_insn_unit *label_ptr[2];
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
+    tcg_out_qemu_st_direct(s, datalo, datahi, h, get_memop(oi));
 
-    tcg_out_tlb_load(s, addrlo, addrhi, get_mmuidx(oi), opc,
-                     label_ptr, offsetof(CPUTLBEntry, addr_write));
-
-    /* TLB Hit.  */
-    h.base = TCG_REG_L1;
-    h.index = -1;
-    h.ofs = 0;
-    h.seg = 0;
-    tcg_out_qemu_st_direct(s, datalo, datahi, h, opc);
-
-    /* Record the current context of a store into ldst label */
-    add_qemu_ldst_label(s, false, data_type, oi, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addrlo, addrhi, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-
-    h = x86_guest_base;
-    h.base = addrlo;
-
-    tcg_out_qemu_st_direct(s, datalo, datahi, h, opc);
-#endif
 }
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
-- 
2.34.1

Since tcg_out_{ld,st}_helper_args, the slow path no longer requires
the address argument to be set up by the tlb load sequence.  Use a
plain load for the addend and indexed addressing with the original
input address register.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         tcg_out_sti(s, TCG_TYPE_PTR, (uintptr_t)l->raddr, TCG_REG_ESP, ofs);
     } else {
         tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
-        /* The second argument is already loaded with addrlo.  */
+        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
+                    l->addrlo_reg);
         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], oi);
         tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
                      (uintptr_t)l->raddr);
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP, ofs);
     } else {
         tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
-        /* The second argument is already loaded with addrlo.  */
+        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
+                    l->addrlo_reg);
         tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
                     tcg_target_call_iarg_regs[2], l->datalo_reg);
         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3], oi);
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw,
                          TCG_REG_L1, TCG_REG_L0, cmp_ofs);
 
-    /*
-     * Prepare for both the fast path add of the tlb addend, and the slow
-     * path function argument setup.
-     */
-    *h = (HostAddress) {
-        .base = TCG_REG_L1,
-        .index = -1
-    };
-    tcg_out_mov(s, ttype, h->base, addrlo);
-
     /* jne slow_path */
     tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
     ldst->label_ptr[0] = s->code_ptr;
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     }
 
     /* TLB Hit.  */
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_L0, TCG_REG_L0,
+               offsetof(CPUTLBEntry, addend));
 
-    /* add addend(TCG_REG_L0), TCG_REG_L1 */
-    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, h->base, TCG_REG_L0,
-                         offsetof(CPUTLBEntry, addend));
+    *h = (HostAddress) {
+        .base = addrlo,
+        .index = TCG_REG_L0,
+    };
 #else
     if (a_bits) {
         ldst = new_ldst_label(s);
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
into one function that returns HostAddress and TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 313 +++++++++++++++--------------------
 1 file changed, 133 insertions(+), 180 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tcg_out_goto(s, lb->raddr);
     return true;
 }
-
-static void add_qemu_ldst_label(TCGContext *s, bool is_ld, MemOpIdx oi,
-                                TCGType ext, TCGReg data_reg, TCGReg addr_reg,
-                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = ext;
-    label->datalo_reg = data_reg;
-    label->addrlo_reg = addr_reg;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr;
-}
-
-/* We expect to use a 7-bit scaled negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
-
-/* These offsets are built into the LDP below.  */
-QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
-QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
-
-/* Load and compare a TLB entry, emitting the conditional jump to the
-   slow path for the failure case, which will be patched later when finalizing
-   the slow path. Generated code returns the host addend in X1,
-   clobbers X0,X2,X3,TMP. */
-static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, MemOp opc,
-                             tcg_insn_unit **label_ptr, int mem_index,
-                             bool is_read)
-{
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_mask = (1u << a_bits) - 1;
-    unsigned s_mask = (1u << s_bits) - 1;
-    TCGReg x3;
-    TCGType mask_type;
-    uint64_t compare_mask;
-
-    mask_type = (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32
-                 ? TCG_TYPE_I64 : TCG_TYPE_I32);
-
-    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
-    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
-                 TLB_MASK_TABLE_OFS(mem_index), 1, 0);
-
-    /* Extract the TLB index from the address into X0.  */
-    tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
-                 TCG_REG_X0, TCG_REG_X0, addr_reg,
-                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
-    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
-    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
-
-    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
-    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_X0, TCG_REG_X1, is_read
-               ? offsetof(CPUTLBEntry, addr_read)
-               : offsetof(CPUTLBEntry, addr_write));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
-               offsetof(CPUTLBEntry, addend));
-
-    /* For aligned accesses, we check the first byte and include the alignment
-       bits within the address.  For unaligned access, we check that we don't
-       cross pages using the address of the last byte of the access.  */
-    if (a_bits >= s_bits) {
-        x3 = addr_reg;
-    } else {
-        tcg_out_insn(s, 3401, ADDI, TARGET_LONG_BITS == 64,
-                     TCG_REG_X3, addr_reg, s_mask - a_mask);
-        x3 = TCG_REG_X3;
-    }
-    compare_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
-
-    /* Store the page mask part of the address into X3.  */
-    tcg_out_logicali(s, I3404_ANDI, TARGET_LONG_BITS == 64,
-                     TCG_REG_X3, x3, compare_mask);
-
-    /* Perform the address comparison. */
-    tcg_out_cmp(s, TARGET_LONG_BITS == 64, TCG_REG_X0, TCG_REG_X3, 0);
-
-    /* If not equal, we jump to the slow path. */
-    *label_ptr = s->code_ptr;
-    tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
-}
-
 #else
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
-                                   unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->addrlo_reg = addr_reg;
-
-    /* tst addr, #mask */
-    tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, a_mask);
-
-    label->label_ptr[0] = s->code_ptr;
-
-    /* b.ne slow_path */
-    tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
-
-    label->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     if (!reloc_pc19(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 }
 #endif /* CONFIG_SOFTMMU */
 
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addr_reg, MemOpIdx oi,
+                                           bool is_ld)
+{
+    TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+    unsigned a_mask = (1u << a_bits) - 1;
+
+#ifdef CONFIG_SOFTMMU
+    unsigned s_bits = opc & MO_SIZE;
+    unsigned s_mask = (1u << s_bits) - 1;
+    unsigned mem_index = get_mmuidx(oi);
+    TCGReg x3;
+    TCGType mask_type;
+    uint64_t compare_mask;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addr_reg;
+
+    mask_type = (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32
+                 ? TCG_TYPE_I64 : TCG_TYPE_I32);
+
+    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
+    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
+                 TLB_MASK_TABLE_OFS(mem_index), 1, 0);
+
+    /* Extract the TLB index from the address into X0.  */
+    tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
+                 TCG_REG_X0, TCG_REG_X0, addr_reg,
+                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
+    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
+
+    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_X0, TCG_REG_X1,
+               is_ld ? offsetof(CPUTLBEntry, addr_read)
+                     : offsetof(CPUTLBEntry, addr_write));
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
+               offsetof(CPUTLBEntry, addend));
+
+    /*
+     * For aligned accesses, we check the first byte and include the alignment
+     * bits within the address.  For unaligned access, we check that we don't
+     * cross pages using the address of the last byte of the access.
+     */
+    if (a_bits >= s_bits) {
+        x3 = addr_reg;
+    } else {
+        tcg_out_insn(s, 3401, ADDI, TARGET_LONG_BITS == 64,
+                     TCG_REG_X3, addr_reg, s_mask - a_mask);
+        x3 = TCG_REG_X3;
+    }
+    compare_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
+
+    /* Store the page mask part of the address into X3.  */
+    tcg_out_logicali(s, I3404_ANDI, TARGET_LONG_BITS == 64,
+                     TCG_REG_X3, x3, compare_mask);
+
+    /* Perform the address comparison. */
+    tcg_out_cmp(s, TARGET_LONG_BITS == 64, TCG_REG_X0, TCG_REG_X3, 0);
+
+    /* If not equal, we jump to the slow path. */
+    ldst->label_ptr[0] = s->code_ptr;
+    tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
+
+    *h = (HostAddress){
+        .base = TCG_REG_X1,
+        .index = addr_reg,
+        .index_ext = addr_type
+    };
+#else
+    if (a_mask) {
+        ldst = new_ldst_label(s);
+
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addr_reg;
+
+        /* tst addr, #mask */
+        tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, a_mask);
+
+        /* b.ne slow_path */
+        ldst->label_ptr[0] = s->code_ptr;
+        tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
+    }
+
+    if (USE_GUEST_BASE) {
+        *h = (HostAddress){
+            .base = TCG_REG_GUEST_BASE,
+            .index = addr_reg,
+            .index_ext = addr_type
+        };
+    } else {
+        *h = (HostAddress){
+            .base = addr_reg,
+            .index = TCG_REG_XZR,
+            .index_ext = TCG_TYPE_I64
+        };
+    }
+#endif
+
+    return ldst;
+}
+
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp memop, TCGType ext,
                                    TCGReg data_r, HostAddress h)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp memop,
 static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp memop = get_memop(oi);
-    TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-    /* Byte swapping is left to middle-end expansion. */
-    tcg_debug_assert((memop & MO_BSWAP) == 0);
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, true);
+    tcg_out_qemu_ld_direct(s, get_memop(oi), data_type, data_reg, h);
 
-#ifdef CONFIG_SOFTMMU
-    tcg_insn_unit *label_ptr;
-
-    tcg_out_tlb_read(s, addr_reg, memop, &label_ptr, get_mmuidx(oi), 1);
-
-    h = (HostAddress){
-        .base = TCG_REG_X1,
-        .index = addr_reg,
-        .index_ext = addr_type
-    };
-    tcg_out_qemu_ld_direct(s, memop, data_type, data_reg, h);
-
-    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#else /* !CONFIG_SOFTMMU */
-    unsigned a_bits = get_alignment_bits(memop);
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    if (USE_GUEST_BASE) {
-        h = (HostAddress){
-            .base = TCG_REG_GUEST_BASE,
-            .index = addr_reg,
-            .index_ext = addr_type
-        };
-    } else {
-        h = (HostAddress){
-            .base = addr_reg,
-            .index = TCG_REG_XZR,
-            .index_ext = TCG_TYPE_I64
-        };
-    }
-    tcg_out_qemu_ld_direct(s, memop, data_type, data_reg, h);
-#endif /* CONFIG_SOFTMMU */
 }
 
 static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp memop = get_memop(oi);
-    TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-    /* Byte swapping is left to middle-end expansion. */
-    tcg_debug_assert((memop & MO_BSWAP) == 0);
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, false);
+    tcg_out_qemu_st_direct(s, get_memop(oi), data_reg, h);
 
-#ifdef CONFIG_SOFTMMU
-    tcg_insn_unit *label_ptr;
-
-    tcg_out_tlb_read(s, addr_reg, memop, &label_ptr, get_mmuidx(oi), 0);
-
-    h = (HostAddress){
-        .base = TCG_REG_X1,
-        .index = addr_reg,
-        .index_ext = addr_type
-    };
-    tcg_out_qemu_st_direct(s, memop, data_reg, h);
-
-    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#else /* !CONFIG_SOFTMMU */
-    unsigned a_bits = get_alignment_bits(memop);
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    if (USE_GUEST_BASE) {
-        h = (HostAddress){
-            .base = TCG_REG_GUEST_BASE,
-            .index = addr_reg,
-            .index_ext = addr_type
-        };
-    } else {
-        h = (HostAddress){
-            .base = addr_reg,
-            .index = TCG_REG_XZR,
-            .index_ext = TCG_TYPE_I64
-        };
-    }
-    tcg_out_qemu_st_direct(s, memop, data_reg, h);
-#endif /* CONFIG_SOFTMMU */
 }
 
 static const tcg_insn_unit *tb_ret_addr;
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, and some code that lived
in both tcg_out_qemu_ld and tcg_out_qemu_st into one function that
returns HostAddress and TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/arm/tcg-target.c.inc | 351 ++++++++++++++++++---------------------
 1 file changed, 159 insertions(+), 192 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_out_arg_reg64(TCGContext *s, TCGReg argreg,
     }
 }
 
-#define TLB_SHIFT	(CPU_TLB_ENTRY_BITS + CPU_TLB_BITS)
-
-/* We expect to use an 9-bit sign-magnitude negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -256);
-
-/* These offsets are built into the LDRD below.  */
-QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
-QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 4);
-
-/* Load and compare a TLB entry, leaving the flags set.  Returns the register
-   containing the addend of the tlb entry.  Clobbers R0, R1, R2, TMP.  */
-
-static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
-                               MemOp opc, int mem_index, bool is_load)
-{
-    int cmp_off = (is_load ? offsetof(CPUTLBEntry, addr_read)
-                   : offsetof(CPUTLBEntry, addr_write));
-    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-    unsigned s_mask = (1 << (opc & MO_SIZE)) - 1;
-    unsigned a_mask = (1 << get_alignment_bits(opc)) - 1;
-    TCGReg t_addr;
-
-    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {r0,r1}.  */
-    tcg_out_ldrd_8(s, COND_AL, TCG_REG_R0, TCG_AREG0, fast_off);
-
-    /* Extract the tlb index from the address into R0.  */
-    tcg_out_dat_reg(s, COND_AL, ARITH_AND, TCG_REG_R0, TCG_REG_R0, addrlo,
-                    SHIFT_IMM_LSR(TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS));
-
-    /*
-     * Add the tlb_table pointer, creating the CPUTLBEntry address in R1.
-     * Load the tlb comparator into R2/R3 and the fast path addend into R1.
-     */
-    if (cmp_off == 0) {
-        if (TARGET_LONG_BITS == 64) {
-            tcg_out_ldrd_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
-        } else {
-            tcg_out_ld32_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
-        }
-    } else {
-        tcg_out_dat_reg(s, COND_AL, ARITH_ADD,
-                        TCG_REG_R1, TCG_REG_R1, TCG_REG_R0, 0);
-        if (TARGET_LONG_BITS == 64) {
-            tcg_out_ldrd_8(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
-        } else {
-            tcg_out_ld32_12(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
-        }
-    }
-
-    /* Load the tlb addend.  */
-    tcg_out_ld32_12(s, COND_AL, TCG_REG_R1, TCG_REG_R1,
-                    offsetof(CPUTLBEntry, addend));
-
-    /*
-     * Check alignment, check comparators.
-     * Do this in 2-4 insns.  Use MOVW for v7, if possible,
-     * to reduce the number of sequential conditional instructions.
-     * Almost all guests have at least 4k pages, which means that we need
-     * to clear at least 9 bits even for an 8-byte memory, which means it
-     * isn't worth checking for an immediate operand for BIC.
-     *
-     * For unaligned accesses, test the page of the last unit of alignment.
-     * This leaves the least significant alignment bits unchanged, and of
-     * course must be zero.
-     */
-    t_addr = addrlo;
-    if (a_mask < s_mask) {
-        t_addr = TCG_REG_R0;
-        tcg_out_dat_imm(s, COND_AL, ARITH_ADD, t_addr,
-                        addrlo, s_mask - a_mask);
-    }
-    if (use_armv7_instructions && TARGET_PAGE_BITS <= 16) {
-        tcg_out_movi32(s, COND_AL, TCG_REG_TMP, ~(TARGET_PAGE_MASK | a_mask));
-        tcg_out_dat_reg(s, COND_AL, ARITH_BIC, TCG_REG_TMP,
-                        t_addr, TCG_REG_TMP, 0);
-        tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R2, TCG_REG_TMP, 0);
-    } else {
-        if (a_mask) {
-            tcg_debug_assert(a_mask <= 0xff);
-            tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
-        }
-        tcg_out_dat_reg(s, COND_AL, ARITH_MOV, TCG_REG_TMP, 0, t_addr,
-                        SHIFT_IMM_LSR(TARGET_PAGE_BITS));
-        tcg_out_dat_reg(s, (a_mask ? COND_EQ : COND_AL), ARITH_CMP,
-                        0, TCG_REG_R2, TCG_REG_TMP,
-                        SHIFT_IMM_LSL(TARGET_PAGE_BITS));
-    }
-
-    if (TARGET_LONG_BITS == 64) {
-        tcg_out_dat_reg(s, COND_EQ, ARITH_CMP, 0, TCG_REG_R3, addrhi, 0);
-    }
-
-    return TCG_REG_R1;
-}
-
-/* Record the context of a call to the out of line helper code for the slow
-   path for a load or store, so that we can later generate the correct
-   helper code.  */
-static void add_qemu_ldst_label(TCGContext *s, bool is_ld,
-                                MemOpIdx oi, TCGType type,
-                                TCGReg datalo, TCGReg datahi,
-                                TCGReg addrlo, TCGReg addrhi,
-                                tcg_insn_unit *raddr,
-                                tcg_insn_unit *label_ptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = type;
-    label->datalo_reg = datalo;
-    label->datahi_reg = datahi;
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = addrhi;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr;
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
     TCGReg argreg;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     return true;
 }
 #else
-
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
-                                   TCGReg addrhi, unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = addrhi;
-
-    /* We are expecting a_bits to max out at 7, and can easily support 8. */
-    tcg_debug_assert(a_mask <= 0xff);
-    /* tst addr, #mask */
-    tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
-
-    /* blne slow_path */
-    label->label_ptr[0] = s->code_ptr;
-    tcg_out_bl_imm(s, COND_NE, 0);
-
-    label->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     if (!reloc_pc24(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 }
 #endif /* SOFTMMU */
 
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addrlo, TCGReg addrhi,
+                                           MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    MemOp a_bits = get_alignment_bits(opc);
+    unsigned a_mask = (1 << a_bits) - 1;
+
+#ifdef CONFIG_SOFTMMU
+    int mem_index = get_mmuidx(oi);
+    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
+                        : offsetof(CPUTLBEntry, addr_write);
+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
+    unsigned s_mask = (1 << (opc & MO_SIZE)) - 1;
+    TCGReg t_addr;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addrlo;
+    ldst->addrhi_reg = addrhi;
+
+    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {r0,r1}.  */
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -256);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 4);
+    tcg_out_ldrd_8(s, COND_AL, TCG_REG_R0, TCG_AREG0, fast_off);
+
+    /* Extract the tlb index from the address into R0.  */
+    tcg_out_dat_reg(s, COND_AL, ARITH_AND, TCG_REG_R0, TCG_REG_R0, addrlo,
+                    SHIFT_IMM_LSR(TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS));
+
+    /*
+     * Add the tlb_table pointer, creating the CPUTLBEntry address in R1.
+     * Load the tlb comparator into R2/R3 and the fast path addend into R1.
+     */
+    if (cmp_off == 0) {
+        if (TARGET_LONG_BITS == 64) {
+            tcg_out_ldrd_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
+        } else {
+            tcg_out_ld32_rwb(s, COND_AL, TCG_REG_R2, TCG_REG_R1, TCG_REG_R0);
+        }
+    } else {
+        tcg_out_dat_reg(s, COND_AL, ARITH_ADD,
+                        TCG_REG_R1, TCG_REG_R1, TCG_REG_R0, 0);
+        if (TARGET_LONG_BITS == 64) {
+            tcg_out_ldrd_8(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
+        } else {
+            tcg_out_ld32_12(s, COND_AL, TCG_REG_R2, TCG_REG_R1, cmp_off);
+        }
+    }
+
+    /* Load the tlb addend.  */
+    tcg_out_ld32_12(s, COND_AL, TCG_REG_R1, TCG_REG_R1,
+                    offsetof(CPUTLBEntry, addend));
+
+    /*
+     * Check alignment, check comparators.
+     * Do this in 2-4 insns.  Use MOVW for v7, if possible,
+     * to reduce the number of sequential conditional instructions.
+     * Almost all guests have at least 4k pages, which means that we need
+     * to clear at least 9 bits even for an 8-byte memory, which means it
+     * isn't worth checking for an immediate operand for BIC.
+     *
+     * For unaligned accesses, test the page of the last unit of alignment.
+     * This leaves the least significant alignment bits unchanged, and of
+     * course must be zero.
+     */
+    t_addr = addrlo;
+    if (a_mask < s_mask) {
+        t_addr = TCG_REG_R0;
+        tcg_out_dat_imm(s, COND_AL, ARITH_ADD, t_addr,
+                        addrlo, s_mask - a_mask);
+    }
+    if (use_armv7_instructions && TARGET_PAGE_BITS <= 16) {
+        tcg_out_movi32(s, COND_AL, TCG_REG_TMP, ~(TARGET_PAGE_MASK | a_mask));
+        tcg_out_dat_reg(s, COND_AL, ARITH_BIC, TCG_REG_TMP,
+                        t_addr, TCG_REG_TMP, 0);
+        tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R2, TCG_REG_TMP, 0);
+    } else {
+        if (a_mask) {
+            tcg_debug_assert(a_mask <= 0xff);
+            tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
+        }
+        tcg_out_dat_reg(s, COND_AL, ARITH_MOV, TCG_REG_TMP, 0, t_addr,
+                        SHIFT_IMM_LSR(TARGET_PAGE_BITS));
+        tcg_out_dat_reg(s, (a_mask ? COND_EQ : COND_AL), ARITH_CMP,
+                        0, TCG_REG_R2, TCG_REG_TMP,
+                        SHIFT_IMM_LSL(TARGET_PAGE_BITS));
+    }
+
+    if (TARGET_LONG_BITS == 64) {
+        tcg_out_dat_reg(s, COND_EQ, ARITH_CMP, 0, TCG_REG_R3, addrhi, 0);
+    }
+
+    *h = (HostAddress){
+        .cond = COND_AL,
+        .base = addrlo,
+        .index = TCG_REG_R1,
+        .index_scratch = true,
+    };
+#else
+    if (a_mask) {
+        ldst = new_ldst_label(s);
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addrlo;
+        ldst->addrhi_reg = addrhi;
+
+        /* We are expecting a_bits to max out at 7 */
+        tcg_debug_assert(a_mask <= 0xff);
+        /* tst addr, #mask */
+        tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
+    }
+
+    *h = (HostAddress){
+        .cond = COND_AL,
+        .base = addrlo,
+        .index = guest_base ? TCG_REG_GUEST_BASE : -1,
+        .index_scratch = false,
+    };
+#endif
+
+    return ldst;
+}
+
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg datalo,
                                    TCGReg datahi, HostAddress h)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             MemOpIdx oi, TCGType data_type)
 {
     MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    h.cond = COND_AL;
-    h.base = addrlo;
-    h.index_scratch = true;
-    h.index = tcg_out_tlb_read(s, addrlo, addrhi, opc, get_mmuidx(oi), 1);
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
 
-    /*
-     * This a conditional BL only to load a pointer within this opcode into
-     * LR for the slow path.  We will not be using the value for a tail call.
-     */
-    tcg_insn_unit *label_ptr = s->code_ptr;
-    tcg_out_bl_imm(s, COND_NE, 0);
+        /*
+         * This a conditional BL only to load a pointer within this
+         * opcode into LR for the slow path.  We will not be using
+         * the value for a tail call.
+         */
+        ldst->label_ptr[0] = s->code_ptr;
+        tcg_out_bl_imm(s, COND_NE, 0);
 
-    tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
-
-    add_qemu_ldst_label(s, true, oi, data_type, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
+        tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    } else {
+        tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
     }
-
-    h.cond = COND_AL;
-    h.base = addrlo;
-    h.index = guest_base ? TCG_REG_GUEST_BASE : -1;
-    h.index_scratch = false;
-    tcg_out_qemu_ld_direct(s, opc, datalo, datahi, h);
-#endif
 }
 
 static void tcg_out_qemu_st_direct(TCGContext *s, MemOp opc, TCGReg datalo,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             MemOpIdx oi, TCGType data_type)
 {
     MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    h.cond = COND_EQ;
-    h.base = addrlo;
-    h.index_scratch = true;
-    h.index = tcg_out_tlb_read(s, addrlo, addrhi, opc, get_mmuidx(oi), 0);
-    tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
 
-    /* The conditional call must come last, as we're going to return here.  */
-    tcg_insn_unit *label_ptr = s->code_ptr;
-    tcg_out_bl_imm(s, COND_NE, 0);
-
-    add_qemu_ldst_label(s, false, oi, data_type, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-
-    h.cond = COND_AL;
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addrlo, addrhi, a_bits);
         h.cond = COND_EQ;
-    }
+        tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
 
-    h.base = addrlo;
-    h.index = guest_base ? TCG_REG_GUEST_BASE : -1;
-    h.index_scratch = false;
-    tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
-#endif
+        /* The conditional call is last, as we're going to return here. */
+        ldst->label_ptr[0] = s->code_ptr;
+        tcg_out_bl_imm(s, COND_NE, 0);
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    } else {
+        tcg_out_qemu_st_direct(s, opc, datalo, datahi, h);
+    }
 }
 
 static void tcg_out_epilogue(TCGContext *s);
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
tcg_out_zext_addr_if_32_bit, and some code that lived in both
tcg_out_qemu_ld and tcg_out_qemu_st into one function that returns
HostAddress and TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target.c.inc | 255 +++++++++++++------------------
 1 file changed, 105 insertions(+), 150 deletions(-)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[4] = {
     [MO_64] = helper_le_stq_mmu,
 };
 
-/* We expect to use a 12-bit negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
-
 static bool tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
 {
     tcg_out_opc_b(s, 0);
     return reloc_br_sd10k16(s->code_ptr - 1, target);
 }
 
-/*
- * Emits common code for TLB addend lookup, that eventually loads the
- * addend in TCG_REG_TMP2.
- */
-static void tcg_out_tlb_load(TCGContext *s, TCGReg addrl, MemOpIdx oi,
-                             tcg_insn_unit **label_ptr, bool is_load)
-{
-    MemOp opc = get_memop(oi);
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_bits = get_alignment_bits(opc);
-    tcg_target_long compare_mask;
-    int mem_index = get_mmuidx(oi);
-    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
-    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
-    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
-
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, TCG_AREG0, mask_ofs);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, table_ofs);
-
-    tcg_out_opc_srli_d(s, TCG_REG_TMP2, addrl,
-                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-    tcg_out_opc_and(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
-    tcg_out_opc_add_d(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
-
-    /* Load the tlb comparator and the addend.  */
-    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
-               is_load ? offsetof(CPUTLBEntry, addr_read)
-               : offsetof(CPUTLBEntry, addr_write));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
-               offsetof(CPUTLBEntry, addend));
-
-    /* We don't support unaligned accesses.  */
-    if (a_bits < s_bits) {
-        a_bits = s_bits;
-    }
-    /* Clear the non-page, non-alignment bits from the address.  */
-    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
-    tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
-    tcg_out_opc_and(s, TCG_REG_TMP1, TCG_REG_TMP1, addrl);
-
-    /* Compare masked address with the TLB entry.  */
-    label_ptr[0] = s->code_ptr;
-    tcg_out_opc_bne(s, TCG_REG_TMP0, TCG_REG_TMP1, 0);
-
-    /* TLB Hit - addend in TCG_REG_TMP2, ready for use.  */
-}
-
-static void add_qemu_ldst_label(TCGContext *s, int is_ld, MemOpIdx oi,
-                                TCGType type,
-                                TCGReg datalo, TCGReg addrlo,
-                                void *raddr, tcg_insn_unit **label_ptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = type;
-    label->datalo_reg = datalo;
-    label->datahi_reg = 0; /* unused */
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = 0; /* unused */
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr[0];
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     MemOpIdx oi = l->oi;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     return tcg_out_goto(s, l->raddr);
 }
 #else
-
-/*
- * Alignment helpers for user-mode emulation
- */
-
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
-                                   unsigned a_bits)
-{
-    TCGLabelQemuLdst *l = new_ldst_label(s);
-
-    l->is_ld = is_ld;
-    l->addrlo_reg = addr_reg;
-
-    /*
-     * Without micro-architecture details, we don't know which of bstrpick or
-     * andi is faster, so use bstrpick as it's not constrained by imm field
-     * width. (Not to say alignments >= 2^12 are going to happen any time
-     * soon, though)
-     */
-    tcg_out_opc_bstrpick_d(s, TCG_REG_TMP1, addr_reg, 0, a_bits - 1);
-
-    l->label_ptr[0] = s->code_ptr;
-    tcg_out_opc_bne(s, TCG_REG_TMP1, TCG_REG_ZERO, 0);
-
-    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     /* resolve label address */
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
 #endif /* CONFIG_SOFTMMU */
 
-/*
- * `ext32u` the address register into the temp register given,
- * if target is 32-bit, no-op otherwise.
- *
- * Returns the address register ready for use with TLB addend.
- */
-static TCGReg tcg_out_zext_addr_if_32_bit(TCGContext *s,
-                                          TCGReg addr, TCGReg tmp)
-{
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, tmp, addr);
-        return tmp;
-    }
-    return addr;
-}
-
 typedef struct {
     TCGReg base;
     TCGReg index;
 } HostAddress;
 
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addr_reg, MemOpIdx oi,
+                                           bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+
+#ifdef CONFIG_SOFTMMU
+    unsigned s_bits = opc & MO_SIZE;
+    int mem_index = get_mmuidx(oi);
+    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
+    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
+    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
+    tcg_target_long compare_mask;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addr_reg;
+
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, TCG_AREG0, mask_ofs);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, table_ofs);
+
+    tcg_out_opc_srli_d(s, TCG_REG_TMP2, addr_reg,
+                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+    tcg_out_opc_and(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
+    tcg_out_opc_add_d(s, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
+
+    /* Load the tlb comparator and the addend.  */
+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
+               is_ld ? offsetof(CPUTLBEntry, addr_read)
+                     : offsetof(CPUTLBEntry, addr_write));
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
+               offsetof(CPUTLBEntry, addend));
+
+    /* We don't support unaligned accesses.  */
+    if (a_bits < s_bits) {
+        a_bits = s_bits;
+    }
+    /* Clear the non-page, non-alignment bits from the address.  */
+    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
+    tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
+    tcg_out_opc_and(s, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
+
+    /* Compare masked address with the TLB entry.  */
+    ldst->label_ptr[0] = s->code_ptr;
+    tcg_out_opc_bne(s, TCG_REG_TMP0, TCG_REG_TMP1, 0);
+
+    h->index = TCG_REG_TMP2;
+#else
+    if (a_bits) {
+        ldst = new_ldst_label(s);
+
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addr_reg;
+
+        /*
+         * Without micro-architecture details, we don't know which of
+         * bstrpick or andi is faster, so use bstrpick as it's not
+         * constrained by imm field width. Not to say alignments >= 2^12
+         * are going to happen any time soon.
+         */
+        tcg_out_opc_bstrpick_d(s, TCG_REG_TMP1, addr_reg, 0, a_bits - 1);
+
+        ldst->label_ptr[0] = s->code_ptr;
+        tcg_out_opc_bne(s, TCG_REG_TMP1, TCG_REG_ZERO, 0);
+    }
+
+    h->index = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
+#endif
+
+    if (TARGET_LONG_BITS == 32) {
+        h->base = TCG_REG_TMP0;
+        tcg_out_ext32u(s, h->base, addr_reg);
+    } else {
+        h->base = addr_reg;
+    }
+
+    return ldst;
+}
+
 static void tcg_out_qemu_ld_indexed(TCGContext *s, MemOp opc, TCGType type,
                                     TCGReg rd, HostAddress h)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_indexed(TCGContext *s, MemOp opc, TCGType type,
 static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    tcg_insn_unit *label_ptr[1];
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, true);
+    tcg_out_qemu_ld_indexed(s, get_memop(oi), data_type, data_reg, h);
 
-    tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 1);
-    h.index = TCG_REG_TMP2;
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    h.index = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
-#endif
-
-    h.base = tcg_out_zext_addr_if_32_bit(s, addr_reg, TCG_REG_TMP0);
-    tcg_out_qemu_ld_indexed(s, opc, data_type, data_reg, h);
-
-#ifdef CONFIG_SOFTMMU
-    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#endif
 }
 
 static void tcg_out_qemu_st_indexed(TCGContext *s, MemOp opc,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_indexed(TCGContext *s, MemOp opc,
 static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    tcg_insn_unit *label_ptr[1];
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, false);
+    tcg_out_qemu_st_indexed(s, get_memop(oi), data_reg, h);
 
-    tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 0);
-    h.index = TCG_REG_TMP2;
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    h.index = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
-#endif
-
-    h.base = tcg_out_zext_addr_if_32_bit(s, addr_reg, TCG_REG_TMP0);
-    tcg_out_qemu_st_indexed(s, opc, data_reg, h);
-
-#ifdef CONFIG_SOFTMMU
-    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#endif
 }
 
 /*
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
into one function that returns HostAddress and TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target.c.inc | 404 ++++++++++++++++----------------------
 1 file changed, 172 insertions(+), 232 deletions(-)

diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static int tcg_out_call_iarg_reg2(TCGContext *s, int i, TCGReg al, TCGReg ah)
     return i;
 }
 
-/* We expect to use a 16-bit negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
-
-/*
- * Perform the tlb comparison operation.
- * The complete host address is placed in BASE.
- * Clobbers TMP0, TMP1, TMP2, TMP3.
- */
-static void tcg_out_tlb_load(TCGContext *s, TCGReg base, TCGReg addrl,
-                             TCGReg addrh, MemOpIdx oi,
-                             tcg_insn_unit *label_ptr[2], bool is_load)
-{
-    MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_mask = (1 << a_bits) - 1;
-    unsigned s_mask = (1 << s_bits) - 1;
-    int mem_index = get_mmuidx(oi);
-    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    int add_off = offsetof(CPUTLBEntry, addend);
-    int cmp_off = (is_load ? offsetof(CPUTLBEntry, addr_read)
-                   : offsetof(CPUTLBEntry, addr_write));
-    target_ulong tlb_mask;
-
-    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP0, TCG_AREG0, mask_off);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP1, TCG_AREG0, table_off);
-
-    /* Extract the TLB index from the address into TMP3.  */
-    tcg_out_opc_sa(s, ALIAS_TSRL, TCG_TMP3, addrl,
-                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-    tcg_out_opc_reg(s, OPC_AND, TCG_TMP3, TCG_TMP3, TCG_TMP0);
-
-    /* Add the tlb_table pointer, creating the CPUTLBEntry address in TMP3.  */
-    tcg_out_opc_reg(s, ALIAS_PADD, TCG_TMP3, TCG_TMP3, TCG_TMP1);
-
-    /* Load the (low-half) tlb comparator.  */
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
-    } else {
-        tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
-                         : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
-                     TCG_TMP0, TCG_TMP3, cmp_off);
-    }
-
-    /* Zero extend a 32-bit guest address for a 64-bit host. */
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, base, addrl);
-        addrl = base;
-    }
-
-    /*
-     * Mask the page bits, keeping the alignment bits to compare against.
-     * For unaligned accesses, compare against the end of the access to
-     * verify that it does not cross a page boundary.
-     */
-    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-    tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
-    if (a_mask >= s_mask) {
-        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrl);
-    } else {
-        tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrl, s_mask - a_mask);
-        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
-    }
-
-    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-        /* Load the tlb addend for the fast path.  */
-        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
-    }
-
-    label_ptr[0] = s->code_ptr;
-    tcg_out_opc_br(s, OPC_BNE, TCG_TMP1, TCG_TMP0);
-
-    /* Load and test the high half tlb comparator.  */
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        /* delay slot */
-        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
-
-        /* Load the tlb addend for the fast path.  */
-        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
-
-        label_ptr[1] = s->code_ptr;
-        tcg_out_opc_br(s, OPC_BNE, addrh, TCG_TMP0);
-    }
-
-    /* delay slot */
-    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP2, addrl);
-}
-
-static void add_qemu_ldst_label(TCGContext *s, int is_ld, MemOpIdx oi,
-                                TCGType ext,
-                                TCGReg datalo, TCGReg datahi,
-                                TCGReg addrlo, TCGReg addrhi,
-                                void *raddr, tcg_insn_unit *label_ptr[2])
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = ext;
-    label->datalo_reg = datalo;
-    label->datahi_reg = datahi;
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = addrhi;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr[0];
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        label->label_ptr[1] = label_ptr[1];
-    }
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     const tcg_insn_unit *tgt_rx = tcg_splitwx_to_rx(s->code_ptr);
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 }
 
 #else
-
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
-                                   TCGReg addrhi, unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *l = new_ldst_label(s);
-
-    l->is_ld = is_ld;
-    l->addrlo_reg = addrlo;
-    l->addrhi_reg = addrhi;
-
-    /* We are expecting a_bits to max out at 7, much lower than ANDI. */
-    tcg_debug_assert(a_bits < 16);
-    tcg_out_opc_imm(s, OPC_ANDI, TCG_TMP0, addrlo, a_mask);
-
-    l->label_ptr[0] = s->code_ptr;
-    if (use_mips32r6_instructions) {
-        tcg_out_opc_br(s, OPC_BNEZALC_R6, TCG_REG_ZERO, TCG_TMP0);
-    } else {
-        tcg_out_opc_br(s, OPC_BNEL, TCG_TMP0, TCG_REG_ZERO);
-        tcg_out_nop(s);
-    }
-
-    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     void *target;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 }
 #endif /* SOFTMMU */
 
+typedef struct {
+    TCGReg base;
+    MemOp align;
+} HostAddress;
+
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addrlo, TCGReg addrhi,
+                                           MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+    unsigned s_bits = opc & MO_SIZE;
+    unsigned a_mask = (1 << a_bits) - 1;
+    TCGReg base;
+
+#ifdef CONFIG_SOFTMMU
+    unsigned s_mask = (1 << s_bits) - 1;
+    int mem_index = get_mmuidx(oi);
+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
+    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
+    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
+    int add_off = offsetof(CPUTLBEntry, addend);
+    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
+                        : offsetof(CPUTLBEntry, addr_write);
+    target_ulong tlb_mask;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addrlo;
+    ldst->addrhi_reg = addrhi;
+    base = TCG_REG_A0;
+
+    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP0, TCG_AREG0, mask_off);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP1, TCG_AREG0, table_off);
+
+    /* Extract the TLB index from the address into TMP3.  */
+    tcg_out_opc_sa(s, ALIAS_TSRL, TCG_TMP3, addrlo,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+    tcg_out_opc_reg(s, OPC_AND, TCG_TMP3, TCG_TMP3, TCG_TMP0);
+
+    /* Add the tlb_table pointer, creating the CPUTLBEntry address in TMP3.  */
+    tcg_out_opc_reg(s, ALIAS_PADD, TCG_TMP3, TCG_TMP3, TCG_TMP1);
+
+    /* Load the (low-half) tlb comparator.  */
+    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
+        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
+    } else {
+        tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
+                         : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
+                     TCG_TMP0, TCG_TMP3, cmp_off);
+    }
+
+    /* Zero extend a 32-bit guest address for a 64-bit host. */
+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
+        tcg_out_ext32u(s, base, addrlo);
+        addrlo = base;
+    }
+
+    /*
+     * Mask the page bits, keeping the alignment bits to compare against.
+     * For unaligned accesses, compare against the end of the access to
+     * verify that it does not cross a page boundary.
+     */
+    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
+    tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
+    if (a_mask >= s_mask) {
+        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrlo);
+    } else {
+        tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrlo, s_mask - a_mask);
+        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
+    }
+
+    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
+        /* Load the tlb addend for the fast path.  */
+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
+    }
+
+    ldst->label_ptr[0] = s->code_ptr;
+    tcg_out_opc_br(s, OPC_BNE, TCG_TMP1, TCG_TMP0);
+
+    /* Load and test the high half tlb comparator.  */
+    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
+        /* delay slot */
+        tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
+
+        /* Load the tlb addend for the fast path.  */
+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
+
+        ldst->label_ptr[1] = s->code_ptr;
+        tcg_out_opc_br(s, OPC_BNE, addrhi, TCG_TMP0);
+    }
+
+    /* delay slot */
+    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP2, addrlo);
+#else
+    if (a_mask && (use_mips32r6_instructions || a_bits != s_bits)) {
+        ldst = new_ldst_label(s);
+
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addrlo;
+        ldst->addrhi_reg = addrhi;
+
+        /* We are expecting a_bits to max out at 7, much lower than ANDI. */
+        tcg_debug_assert(a_bits < 16);
+        tcg_out_opc_imm(s, OPC_ANDI, TCG_TMP0, addrlo, a_mask);
+
+        ldst->label_ptr[0] = s->code_ptr;
+        if (use_mips32r6_instructions) {
+            tcg_out_opc_br(s, OPC_BNEZALC_R6, TCG_REG_ZERO, TCG_TMP0);
+        } else {
+            tcg_out_opc_br(s, OPC_BNEL, TCG_TMP0, TCG_REG_ZERO);
+            tcg_out_nop(s);
+        }
+    }
+
+    base = addrlo;
+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
+        tcg_out_ext32u(s, TCG_REG_A0, base);
+        base = TCG_REG_A0;
+    }
+    if (guest_base) {
+        if (guest_base == (int16_t)guest_base) {
+            tcg_out_opc_imm(s, ALIAS_PADDI, TCG_REG_A0, base, guest_base);
+        } else {
+            tcg_out_opc_reg(s, ALIAS_PADD, TCG_REG_A0, base,
+                            TCG_GUEST_BASE_REG);
+        }
+        base = TCG_REG_A0;
+    }
+#endif
+
+    h->base = base;
+    h->align = a_bits;
+    return ldst;
+}
+
 static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
                                    TCGReg base, MemOp opc, TCGType type)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             MemOpIdx oi, TCGType data_type)
 {
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_bits = opc & MO_SIZE;
-    TCGReg base;
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
 
-    /*
-     * R6 removes the left/right instructions but requires the
-     * system to support misaligned memory accesses.
-     */
-#if defined(CONFIG_SOFTMMU)
-    tcg_insn_unit *label_ptr[2];
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
 
-    base = TCG_REG_A0;
-    tcg_out_tlb_load(s, base, addrlo, addrhi, oi, label_ptr, 1);
-    if (use_mips32r6_instructions || a_bits >= s_bits) {
-        tcg_out_qemu_ld_direct(s, datalo, datahi, base, opc, data_type);
+    if (use_mips32r6_instructions || h.align >= (opc & MO_SIZE)) {
+        tcg_out_qemu_ld_direct(s, datalo, datahi, h.base, opc, data_type);
     } else {
-        tcg_out_qemu_ld_unalign(s, datalo, datahi, base, opc, data_type);
+        tcg_out_qemu_ld_unalign(s, datalo, datahi, h.base, opc, data_type);
     }
-    add_qemu_ldst_label(s, true, oi, data_type, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#else
-    base = addrlo;
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, TCG_REG_A0, base);
-        base = TCG_REG_A0;
+
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    if (guest_base) {
-        if (guest_base == (int16_t)guest_base) {
-            tcg_out_opc_imm(s, ALIAS_PADDI, TCG_REG_A0, base, guest_base);
-        } else {
-            tcg_out_opc_reg(s, ALIAS_PADD, TCG_REG_A0, base,
-                            TCG_GUEST_BASE_REG);
-        }
-        base = TCG_REG_A0;
-    }
-    if (use_mips32r6_instructions) {
-        if (a_bits) {
-            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
-        }
-        tcg_out_qemu_ld_direct(s, datalo, datahi, base, opc, data_type);
-    } else {
-        if (a_bits && a_bits != s_bits) {
-            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
-        }
-        if (a_bits >= s_bits) {
-            tcg_out_qemu_ld_direct(s, datalo, datahi, base, opc, data_type);
-        } else {
-            tcg_out_qemu_ld_unalign(s, datalo, datahi, base, opc, data_type);
-        }
-    }
-#endif
 }
 
 static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg lo, TCGReg hi,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             MemOpIdx oi, TCGType data_type)
 {
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_bits = opc & MO_SIZE;
-    TCGReg base;
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
 
-    /*
-     * R6 removes the left/right instructions but requires the
-     * system to support misaligned memory accesses.
-     */
-#if defined(CONFIG_SOFTMMU)
-    tcg_insn_unit *label_ptr[2];
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
 
-    base = TCG_REG_A0;
-    tcg_out_tlb_load(s, base, addrlo, addrhi, oi, label_ptr, 0);
-    if (use_mips32r6_instructions || a_bits >= s_bits) {
-        tcg_out_qemu_st_direct(s, datalo, datahi, base, opc);
+    if (use_mips32r6_instructions || h.align >= (opc & MO_SIZE)) {
+        tcg_out_qemu_st_direct(s, datalo, datahi, h.base, opc);
     } else {
-        tcg_out_qemu_st_unalign(s, datalo, datahi, base, opc);
+        tcg_out_qemu_st_unalign(s, datalo, datahi, h.base, opc);
     }
-    add_qemu_ldst_label(s, false, oi, data_type, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#else
-    base = addrlo;
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, TCG_REG_A0, base);
-        base = TCG_REG_A0;
+
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    if (guest_base) {
-        if (guest_base == (int16_t)guest_base) {
-            tcg_out_opc_imm(s, ALIAS_PADDI, TCG_REG_A0, base, guest_base);
-        } else {
-            tcg_out_opc_reg(s, ALIAS_PADD, TCG_REG_A0, base,
-                            TCG_GUEST_BASE_REG);
-        }
-        base = TCG_REG_A0;
-    }
-    if (use_mips32r6_instructions) {
-        if (a_bits) {
-            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
-        }
-        tcg_out_qemu_st_direct(s, datalo, datahi, base, opc);
-    } else {
-        if (a_bits && a_bits != s_bits) {
-            tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
-        }
-        if (a_bits >= s_bits) {
-            tcg_out_qemu_st_direct(s, datalo, datahi, base, opc);
-        } else {
-            tcg_out_qemu_st_unalign(s, datalo, datahi, base, opc);
-        }
-    }
-#endif
 }
 
 static void tcg_out_mb(TCGContext *s, TCGArg a0)
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
into one function that returns HostAddress and TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.c.inc | 381 ++++++++++++++++++---------------------
 1 file changed, 172 insertions(+), 209 deletions(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
     [MO_BEUQ] = helper_be_stq_mmu,
 };
 
-/* We expect to use a 16-bit negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
-
-/* Perform the TLB load and compare.  Places the result of the comparison
-   in CR7, loads the addend of the TLB into R3, and returns the register
-   containing the guest address (zero-extended into R4).  Clobbers R0 and R2. */
-
-static TCGReg tcg_out_tlb_read(TCGContext *s, MemOp opc,
-                               TCGReg addrlo, TCGReg addrhi,
-                               int mem_index, bool is_read)
-{
-    int cmp_off
-        = (is_read
-           ? offsetof(CPUTLBEntry, addr_read)
-           : offsetof(CPUTLBEntry, addr_write));
-    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_bits = get_alignment_bits(opc);
-
-    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_AREG0, mask_off);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R4, TCG_AREG0, table_off);
-
-    /* Extract the page index, shifted into place for tlb index.  */
-    if (TCG_TARGET_REG_BITS == 32) {
-        tcg_out_shri32(s, TCG_REG_TMP1, addrlo,
-                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-    } else {
-        tcg_out_shri64(s, TCG_REG_TMP1, addrlo,
-                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-    }
-    tcg_out32(s, AND | SAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_TMP1));
-
-    /* Load the TLB comparator.  */
-    if (cmp_off == 0 && TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-        uint32_t lxu = (TCG_TARGET_REG_BITS == 32 || TARGET_LONG_BITS == 32
-                        ? LWZUX : LDUX);
-        tcg_out32(s, lxu | TAB(TCG_REG_TMP1, TCG_REG_R3, TCG_REG_R4));
-    } else {
-        tcg_out32(s, ADD | TAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_R4));
-        if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP1, TCG_REG_R3, cmp_off + 4);
-            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_R4, TCG_REG_R3, cmp_off);
-        } else {
-            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP1, TCG_REG_R3, cmp_off);
-        }
-    }
-
-    /* Load the TLB addend for use on the fast path.  Do this asap
-       to minimize any load use delay.  */
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_REG_R3,
-               offsetof(CPUTLBEntry, addend));
-
-    /* Clear the non-page, non-alignment bits from the address */
-    if (TCG_TARGET_REG_BITS == 32) {
-        /* We don't support unaligned accesses on 32-bits.
-         * Preserve the bottom bits and thus trigger a comparison
-         * failure on unaligned accesses.
-         */
-        if (a_bits < s_bits) {
-            a_bits = s_bits;
-        }
-        tcg_out_rlw(s, RLWINM, TCG_REG_R0, addrlo, 0,
-                    (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
-    } else {
-        TCGReg t = addrlo;
-
-        /* If the access is unaligned, we need to make sure we fail if we
-         * cross a page boundary.  The trick is to add the access size-1
-         * to the address before masking the low bits.  That will make the
-         * address overflow to the next page if we cross a page boundary,
-         * which will then force a mismatch of the TLB compare.
-         */
-        if (a_bits < s_bits) {
-            unsigned a_mask = (1 << a_bits) - 1;
-            unsigned s_mask = (1 << s_bits) - 1;
-            tcg_out32(s, ADDI | TAI(TCG_REG_R0, t, s_mask - a_mask));
-            t = TCG_REG_R0;
-        }
-
-        /* Mask the address for the requested alignment.  */
-        if (TARGET_LONG_BITS == 32) {
-            tcg_out_rlw(s, RLWINM, TCG_REG_R0, t, 0,
-                        (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
-            /* Zero-extend the address for use in the final address.  */
-            tcg_out_ext32u(s, TCG_REG_R4, addrlo);
-            addrlo = TCG_REG_R4;
-        } else if (a_bits == 0) {
-            tcg_out_rld(s, RLDICR, TCG_REG_R0, t, 0, 63 - TARGET_PAGE_BITS);
-        } else {
-            tcg_out_rld(s, RLDICL, TCG_REG_R0, t,
-                        64 - TARGET_PAGE_BITS, TARGET_PAGE_BITS - a_bits);
-            tcg_out_rld(s, RLDICL, TCG_REG_R0, TCG_REG_R0, TARGET_PAGE_BITS, 0);
-        }
-    }
-
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
-                    0, 7, TCG_TYPE_I32);
-        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_R4, 0, 6, TCG_TYPE_I32);
-        tcg_out32(s, CRAND | BT(7, CR_EQ) | BA(6, CR_EQ) | BB(7, CR_EQ));
-    } else {
-        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
-                    0, 7, TCG_TYPE_TL);
-    }
-
-    return addrlo;
-}
-
-/* Record the context of a call to the out of line helper code for the slow
-   path for a load or store, so that we can later generate the correct
-   helper code.  */
-static void add_qemu_ldst_label(TCGContext *s, bool is_ld,
-                                TCGType type, MemOpIdx oi,
-                                TCGReg datalo_reg, TCGReg datahi_reg,
-                                TCGReg addrlo_reg, TCGReg addrhi_reg,
-                                tcg_insn_unit *raddr, tcg_insn_unit *lptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->type = type;
-    label->oi = oi;
-    label->datalo_reg = datalo_reg;
-    label->datahi_reg = datahi_reg;
-    label->addrlo_reg = addrlo_reg;
-    label->addrhi_reg = addrhi_reg;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = lptr;
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
     MemOpIdx oi = lb->oi;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     return true;
 }
 #else
-
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
-                                   TCGReg addrhi, unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->addrlo_reg = addrlo;
-    label->addrhi_reg = addrhi;
-
-    /* We are expecting a_bits to max out at 7, much lower than ANDI. */
-    tcg_debug_assert(a_bits < 16);
-    tcg_out32(s, ANDI | SAI(addrlo, TCG_REG_R0, a_mask));
-
-    label->label_ptr[0] = s->code_ptr;
-    tcg_out32(s, BC | BI(0, CR_EQ) | BO_COND_FALSE | LK);
-
-    label->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     if (!reloc_pc14(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
@@ -XXX,XX +XXX,XX @@ typedef struct {
     TCGReg index;
 } HostAddress;
 
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addrlo, TCGReg addrhi,
+                                           MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+
+#ifdef CONFIG_SOFTMMU
+    int mem_index = get_mmuidx(oi);
+    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
+                        : offsetof(CPUTLBEntry, addr_write);
+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
+    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
+    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
+    unsigned s_bits = opc & MO_SIZE;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addrlo;
+    ldst->addrhi_reg = addrhi;
+
+    /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_AREG0, mask_off);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R4, TCG_AREG0, table_off);
+
+    /* Extract the page index, shifted into place for tlb index.  */
+    if (TCG_TARGET_REG_BITS == 32) {
+        tcg_out_shri32(s, TCG_REG_TMP1, addrlo,
+                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+    } else {
+        tcg_out_shri64(s, TCG_REG_TMP1, addrlo,
+                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+    }
+    tcg_out32(s, AND | SAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_TMP1));
+
+    /* Load the TLB comparator.  */
+    if (cmp_off == 0 && TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
+        uint32_t lxu = (TCG_TARGET_REG_BITS == 32 || TARGET_LONG_BITS == 32
+                        ? LWZUX : LDUX);
+        tcg_out32(s, lxu | TAB(TCG_REG_TMP1, TCG_REG_R3, TCG_REG_R4));
+    } else {
+        tcg_out32(s, ADD | TAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_R4));
+        if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
+            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP1, TCG_REG_R3, cmp_off + 4);
+            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_R4, TCG_REG_R3, cmp_off);
+        } else {
+            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP1, TCG_REG_R3, cmp_off);
+        }
+    }
+
+    /*
+     * Load the TLB addend for use on the fast path.
+     * Do this asap to minimize any load use delay.
+     */
+    h->base = TCG_REG_R3;
+    tcg_out_ld(s, TCG_TYPE_PTR, h->base, TCG_REG_R3,
+               offsetof(CPUTLBEntry, addend));
+
+    /* Clear the non-page, non-alignment bits from the address */
+    if (TCG_TARGET_REG_BITS == 32) {
+        /*
+         * We don't support unaligned accesses on 32-bits.
+         * Preserve the bottom bits and thus trigger a comparison
+         * failure on unaligned accesses.
+         */
+        if (a_bits < s_bits) {
+            a_bits = s_bits;
+        }
+        tcg_out_rlw(s, RLWINM, TCG_REG_R0, addrlo, 0,
+                    (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
+    } else {
+        TCGReg t = addrlo;
+
+        /*
+         * If the access is unaligned, we need to make sure we fail if we
+         * cross a page boundary.  The trick is to add the access size-1
+         * to the address before masking the low bits.  That will make the
+         * address overflow to the next page if we cross a page boundary,
+         * which will then force a mismatch of the TLB compare.
+         */
+        if (a_bits < s_bits) {
+            unsigned a_mask = (1 << a_bits) - 1;
+            unsigned s_mask = (1 << s_bits) - 1;
+            tcg_out32(s, ADDI | TAI(TCG_REG_R0, t, s_mask - a_mask));
+            t = TCG_REG_R0;
+        }
+
+        /* Mask the address for the requested alignment.  */
+        if (TARGET_LONG_BITS == 32) {
+            tcg_out_rlw(s, RLWINM, TCG_REG_R0, t, 0,
+                        (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
+            /* Zero-extend the address for use in the final address.  */
+            tcg_out_ext32u(s, TCG_REG_R4, addrlo);
+            addrlo = TCG_REG_R4;
+        } else if (a_bits == 0) {
+            tcg_out_rld(s, RLDICR, TCG_REG_R0, t, 0, 63 - TARGET_PAGE_BITS);
+        } else {
+            tcg_out_rld(s, RLDICL, TCG_REG_R0, t,
+                        64 - TARGET_PAGE_BITS, TARGET_PAGE_BITS - a_bits);
+            tcg_out_rld(s, RLDICL, TCG_REG_R0, TCG_REG_R0, TARGET_PAGE_BITS, 0);
+        }
+    }
+    h->index = addrlo;
+
+    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
+                    0, 7, TCG_TYPE_I32);
+        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_R4, 0, 6, TCG_TYPE_I32);
+        tcg_out32(s, CRAND | BT(7, CR_EQ) | BA(6, CR_EQ) | BB(7, CR_EQ));
+    } else {
+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
+                    0, 7, TCG_TYPE_TL);
+    }
+
+    /* Load a pointer into the current opcode w/conditional branch-link. */
+    ldst->label_ptr[0] = s->code_ptr;
+    tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
+#else
+    if (a_bits) {
+        ldst = new_ldst_label(s);
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addrlo;
+        ldst->addrhi_reg = addrhi;
+
+        /* We are expecting a_bits to max out at 7, much lower than ANDI. */
+        tcg_debug_assert(a_bits < 16);
+        tcg_out32(s, ANDI | SAI(addrlo, TCG_REG_R0, (1 << a_bits) - 1));
+
+        ldst->label_ptr[0] = s->code_ptr;
+        tcg_out32(s, BC | BI(0, CR_EQ) | BO_COND_FALSE | LK);
+    }
+
+    h->base = guest_base ? TCG_GUEST_BASE_REG : 0;
+    h->index = addrlo;
+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
+        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
+        h->index = TCG_REG_TMP1;
+    }
+#endif
+
+    return ldst;
+}
+
 static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             TCGReg addrlo, TCGReg addrhi,
                             MemOpIdx oi, TCGType data_type)
 {
     MemOp opc = get_memop(oi);
-    MemOp s_bits = opc & MO_SIZE;
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    tcg_insn_unit *label_ptr;
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, true);
 
-    h.index = tcg_out_tlb_read(s, opc, addrlo, addrhi, get_mmuidx(oi), true);
-    h.base = TCG_REG_R3;
-
-    /* Load a pointer into the current opcode w/conditional branch-link. */
-    label_ptr = s->code_ptr;
-    tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
-#else  /* !CONFIG_SOFTMMU */
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
-    }
-    h.base = guest_base ? TCG_GUEST_BASE_REG : 0;
-    h.index = addrlo;
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
-        h.index = TCG_REG_TMP1;
-    }
-#endif
-
-    if (TCG_TARGET_REG_BITS == 32 && s_bits == MO_64) {
+    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
         if (opc & MO_BSWAP) {
             tcg_out32(s, ADDI | TAI(TCG_REG_R0, h.index, 4));
             tcg_out32(s, LWBRX | TAB(datalo, h.base, h.index));
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
         }
     }
 
-#ifdef CONFIG_SOFTMMU
-    add_qemu_ldst_label(s, true, data_type, oi, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#endif
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
 }
 
 static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
                             MemOpIdx oi, TCGType data_type)
 {
     MemOp opc = get_memop(oi);
-    MemOp s_bits = opc & MO_SIZE;
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    tcg_insn_unit *label_ptr;
+    ldst = prepare_host_addr(s, &h, addrlo, addrhi, oi, false);
 
-    h.index = tcg_out_tlb_read(s, opc, addrlo, addrhi, get_mmuidx(oi), false);
-    h.base = TCG_REG_R3;
-
-    /* Load a pointer into the current opcode w/conditional branch-link. */
-    label_ptr = s->code_ptr;
-    tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
-#else  /* !CONFIG_SOFTMMU */
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addrlo, addrhi, a_bits);
-    }
-    h.base = guest_base ? TCG_GUEST_BASE_REG : 0;
-    h.index = addrlo;
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
-        h.index = TCG_REG_TMP1;
-    }
-#endif
-
-    if (TCG_TARGET_REG_BITS == 32 && s_bits == MO_64) {
+    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
         if (opc & MO_BSWAP) {
             tcg_out32(s, ADDI | TAI(TCG_REG_R0, h.index, 4));
             tcg_out32(s, STWBRX | SAB(datalo, h.base, h.index));
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
         }
     }
 
-#ifdef CONFIG_SOFTMMU
-    add_qemu_ldst_label(s, false, data_type, oi, datalo, datahi,
-                        addrlo, addrhi, s->code_ptr, label_ptr);
-#endif
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
 }
 
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
and some code that lived in both tcg_out_qemu_ld and tcg_out_qemu_st
into one function that returns TCGReg and TCGLabelQemuLdst.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/riscv/tcg-target.c.inc | 253 +++++++++++++++++--------------------
 1 file changed, 114 insertions(+), 139 deletions(-)

diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
 #endif
 };
 
-/* We expect to use a 12-bit negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
-
 static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
 {
     tcg_out_opc_jump(s, OPC_JAL, TCG_REG_ZERO, 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
     tcg_debug_assert(ok);
 }
 
-static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, MemOpIdx oi,
-                               tcg_insn_unit **label_ptr, bool is_load)
-{
-    MemOp opc = get_memop(oi);
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_bits = get_alignment_bits(opc);
-    tcg_target_long compare_mask;
-    int mem_index = get_mmuidx(oi);
-    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
-    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
-    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
-    TCGReg mask_base = TCG_AREG0, table_base = TCG_AREG0;
-
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, mask_base, mask_ofs);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, table_base, table_ofs);
-
-    tcg_out_opc_imm(s, OPC_SRLI, TCG_REG_TMP2, addr,
-                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-    tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
-    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
-
-    /* Load the tlb comparator and the addend.  */
-    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
-               is_load ? offsetof(CPUTLBEntry, addr_read)
-               : offsetof(CPUTLBEntry, addr_write));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
-               offsetof(CPUTLBEntry, addend));
-
-    /* We don't support unaligned accesses. */
-    if (a_bits < s_bits) {
-        a_bits = s_bits;
-    }
-    /* Clear the non-page, non-alignment bits from the address.  */
-    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
-    if (compare_mask == sextreg(compare_mask, 0, 12)) {
-        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr, compare_mask);
-    } else {
-        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
-        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr);
-    }
-
-    /* Compare masked address with the TLB entry. */
-    label_ptr[0] = s->code_ptr;
-    tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP0, TCG_REG_TMP1, 0);
-
-    /* TLB Hit - translate address using addend.  */
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_TMP0, addr);
-        addr = TCG_REG_TMP0;
-    }
-    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr);
-    return TCG_REG_TMP0;
-}
-
-static void add_qemu_ldst_label(TCGContext *s, int is_ld, MemOpIdx oi,
-                                TCGType data_type, TCGReg data_reg,
-                                TCGReg addr_reg, void *raddr,
-                                tcg_insn_unit **label_ptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = data_type;
-    label->datalo_reg = data_reg;
-    label->addrlo_reg = addr_reg;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr[0];
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     MemOpIdx oi = l->oi;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     return true;
 }
 #else
-
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
-                                   unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *l = new_ldst_label(s);
-
-    l->is_ld = is_ld;
-    l->addrlo_reg = addr_reg;
-
-    /* We are expecting a_bits to max out at 7, so we can always use andi. */
-    tcg_debug_assert(a_bits < 12);
-    tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, a_mask);
-
-    l->label_ptr[0] = s->code_ptr;
-    tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP1, TCG_REG_ZERO, 0);
-
-    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     /* resolve label address */
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     return tcg_out_fail_alignment(s, l);
 }
-
 #endif /* CONFIG_SOFTMMU */
 
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
+                                           TCGReg addr_reg, MemOpIdx oi,
+                                           bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+    unsigned a_mask = (1u << a_bits) - 1;
+
+#ifdef CONFIG_SOFTMMU
+    unsigned s_bits = opc & MO_SIZE;
+    int mem_index = get_mmuidx(oi);
+    int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
+    int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
+    int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
+    TCGReg mask_base = TCG_AREG0, table_base = TCG_AREG0;
+    tcg_target_long compare_mask;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addr_reg;
+
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, mask_base, mask_ofs);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, table_base, table_ofs);
+
+    tcg_out_opc_imm(s, OPC_SRLI, TCG_REG_TMP2, addr_reg,
+                    TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+    tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
+    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
+
+    /* Load the tlb comparator and the addend.  */
+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
+               is_ld ? offsetof(CPUTLBEntry, addr_read)
+                     : offsetof(CPUTLBEntry, addr_write));
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
+               offsetof(CPUTLBEntry, addend));
+
+    /* We don't support unaligned accesses. */
+    if (a_bits < s_bits) {
+        a_bits = s_bits;
+    }
+    /* Clear the non-page, non-alignment bits from the address.  */
+    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | a_mask;
+    if (compare_mask == sextreg(compare_mask, 0, 12)) {
+        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, compare_mask);
+    } else {
+        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
+        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
+    }
+
+    /* Compare masked address with the TLB entry. */
+    ldst->label_ptr[0] = s->code_ptr;
+    tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP0, TCG_REG_TMP1, 0);
+
+    /* TLB Hit - translate address using addend.  */
+    if (TARGET_LONG_BITS == 32) {
+        tcg_out_ext32u(s, TCG_REG_TMP0, addr_reg);
+        addr_reg = TCG_REG_TMP0;
+    }
+    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr_reg);
+    *pbase = TCG_REG_TMP0;
+#else
+    if (a_mask) {
+        ldst = new_ldst_label(s);
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addr_reg;
+
+        /* We are expecting a_bits max 7, so we can always use andi. */
+        tcg_debug_assert(a_bits < 12);
+        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, a_mask);
+
+        ldst->label_ptr[0] = s->code_ptr;
+        tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP1, TCG_REG_ZERO, 0);
+    }
+
+    TCGReg base = addr_reg;
+    if (TARGET_LONG_BITS == 32) {
+        tcg_out_ext32u(s, TCG_REG_TMP0, base);
+        base = TCG_REG_TMP0;
+    }
+    if (guest_base != 0) {
+        tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_GUEST_BASE_REG, base);
+        base = TCG_REG_TMP0;
+    }
+    *pbase = base;
+#endif
+
+    return ldst;
+}
+
 static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg val,
                                    TCGReg base, MemOp opc, TCGType type)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg val,
 static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     TCGReg base;
 
-#if defined(CONFIG_SOFTMMU)
-    tcg_insn_unit *label_ptr[1];
+    ldst = prepare_host_addr(s, &base, addr_reg, oi, true);
+    tcg_out_qemu_ld_direct(s, data_reg, base, get_memop(oi), data_type);
 
-    base = tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 1);
-    tcg_out_qemu_ld_direct(s, data_reg, base, opc, data_type);
-    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    base = addr_reg;
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_TMP0, base);
-        base = TCG_REG_TMP0;
-    }
-    if (guest_base != 0) {
-        tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_GUEST_BASE_REG, base);
-        base = TCG_REG_TMP0;
-    }
-    tcg_out_qemu_ld_direct(s, data_reg, base, opc, data_type);
-#endif
 }
 
 static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg val,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg val,
 static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     TCGReg base;
 
-#if defined(CONFIG_SOFTMMU)
-    tcg_insn_unit *label_ptr[1];
+    ldst = prepare_host_addr(s, &base, addr_reg, oi, false);
+    tcg_out_qemu_st_direct(s, data_reg, base, get_memop(oi));
 
-    base = tcg_out_tlb_load(s, addr_reg, oi, label_ptr, 0);
-    tcg_out_qemu_st_direct(s, data_reg, base, opc);
-    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    base = addr_reg;
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_TMP0, base);
-        base = TCG_REG_TMP0;
-    }
-    if (guest_base != 0) {
-        tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_GUEST_BASE_REG, base);
-        base = TCG_REG_TMP0;
-    }
-    tcg_out_qemu_st_direct(s, data_reg, base, opc);
-#endif
 }
 
 static const tcg_insn_unit *tb_ret_addr;
-- 
2.34.1

Merge tcg_out_tlb_load, add_qemu_ldst_label, tcg_out_test_alignment,
tcg_prepare_user_ldst, and some code that lived in both tcg_out_qemu_ld
and tcg_out_qemu_st into one function that returns HostAddress and
TCGLabelQemuLdst structures.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 263 ++++++++++++++++---------------------
 1 file changed, 113 insertions(+), 150 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp opc, TCGReg data,
 }
 
 #if defined(CONFIG_SOFTMMU)
-/* We're expecting to use a 20-bit negative offset on the tlb memory ops.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
-
-/* Load and compare a TLB entry, leaving the flags set.  Loads the TLB
-   addend into R2.  Returns a register with the santitized guest address.  */
-static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, MemOp opc,
-                               int mem_index, bool is_ld)
-{
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_mask = (1 << s_bits) - 1;
-    unsigned a_mask = (1 << a_bits) - 1;
-    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
-    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    int ofs, a_off;
-    uint64_t tlb_mask;
-
-    tcg_out_sh64(s, RSY_SRLG, TCG_REG_R2, addr_reg, TCG_REG_NONE,
-                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-    tcg_out_insn(s, RXY, NG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, mask_off);
-    tcg_out_insn(s, RXY, AG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, table_off);
-
-    /* For aligned accesses, we check the first byte and include the alignment
-       bits within the address.  For unaligned access, we check that we don't
-       cross pages using the address of the last byte of the access.  */
-    a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
-    tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
-    if (a_off == 0) {
-        tgen_andi_risbg(s, TCG_REG_R3, addr_reg, tlb_mask);
-    } else {
-        tcg_out_insn(s, RX, LA, TCG_REG_R3, addr_reg, TCG_REG_NONE, a_off);
-        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R3, tlb_mask);
-    }
-
-    if (is_ld) {
-        ofs = offsetof(CPUTLBEntry, addr_read);
-    } else {
-        ofs = offsetof(CPUTLBEntry, addr_write);
-    }
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_insn(s, RX, C, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
-    } else {
-        tcg_out_insn(s, RXY, CG, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
-    }
-
-    tcg_out_insn(s, RXY, LG, TCG_REG_R2, TCG_REG_R2, TCG_REG_NONE,
-                 offsetof(CPUTLBEntry, addend));
-
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_R3, addr_reg);
-        return TCG_REG_R3;
-    }
-    return addr_reg;
-}
-
-static void add_qemu_ldst_label(TCGContext *s, bool is_ld, MemOpIdx oi,
-                                TCGType type, TCGReg data, TCGReg addr,
-                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
-{
-    TCGLabelQemuLdst *label = new_ldst_label(s);
-
-    label->is_ld = is_ld;
-    label->oi = oi;
-    label->type = type;
-    label->datalo_reg = data;
-    label->addrlo_reg = addr;
-    label->raddr = tcg_splitwx_to_rx(raddr);
-    label->label_ptr[0] = label_ptr;
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
     TCGReg addr_reg = lb->addrlo_reg;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     return true;
 }
 #else
-static void tcg_out_test_alignment(TCGContext *s, bool is_ld,
-                                   TCGReg addrlo, unsigned a_bits)
-{
-    unsigned a_mask = (1 << a_bits) - 1;
-    TCGLabelQemuLdst *l = new_ldst_label(s);
-
-    l->is_ld = is_ld;
-    l->addrlo_reg = addrlo;
-
-    /* We are expecting a_bits to max out at 7, much lower than TMLL. */
-    tcg_debug_assert(a_bits < 16);
-    tcg_out_insn(s, RI, TMLL, addrlo, a_mask);
-
-    tcg_out16(s, RI_BRC | (7 << 4)); /* CC in {1,2,3} */
-    l->label_ptr[0] = s->code_ptr;
-    s->code_ptr += 1;
-
-    l->raddr = tcg_splitwx_to_rx(s->code_ptr);
-}
-
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     if (!patch_reloc(l->label_ptr[0], R_390_PC16DBL,
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     return tcg_out_fail_alignment(s, l);
 }
+#endif /* CONFIG_SOFTMMU */
 
-static HostAddress tcg_prepare_user_ldst(TCGContext *s, TCGReg addr_reg)
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addr_reg, MemOpIdx oi,
+                                           bool is_ld)
 {
-    TCGReg index;
-    int disp;
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+    unsigned a_mask = (1u << a_bits) - 1;
 
+#ifdef CONFIG_SOFTMMU
+    unsigned s_bits = opc & MO_SIZE;
+    unsigned s_mask = (1 << s_bits) - 1;
+    int mem_index = get_mmuidx(oi);
+    int fast_off = TLB_MASK_TABLE_OFS(mem_index);
+    int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
+    int table_off = fast_off + offsetof(CPUTLBDescFast, table);
+    int ofs, a_off;
+    uint64_t tlb_mask;
+
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addr_reg;
+
+    tcg_out_sh64(s, RSY_SRLG, TCG_REG_R2, addr_reg, TCG_REG_NONE,
+                 TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
+    tcg_out_insn(s, RXY, NG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, mask_off);
+    tcg_out_insn(s, RXY, AG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, table_off);
+
+    /*
+     * For aligned accesses, we check the first byte and include the alignment
+     * bits within the address.  For unaligned access, we check that we don't
+     * cross pages using the address of the last byte of the access.
+     */
+    a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
+    tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
+    if (a_off == 0) {
+        tgen_andi_risbg(s, TCG_REG_R3, addr_reg, tlb_mask);
+    } else {
+        tcg_out_insn(s, RX, LA, TCG_REG_R3, addr_reg, TCG_REG_NONE, a_off);
+        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R3, tlb_mask);
+    }
+
+    if (is_ld) {
+        ofs = offsetof(CPUTLBEntry, addr_read);
+    } else {
+        ofs = offsetof(CPUTLBEntry, addr_write);
+    }
+    if (TARGET_LONG_BITS == 32) {
+        tcg_out_insn(s, RX, C, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
+    } else {
+        tcg_out_insn(s, RXY, CG, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
+    }
+
+    tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
+    ldst->label_ptr[0] = s->code_ptr++;
+
+    h->index = TCG_REG_R2;
+    tcg_out_insn(s, RXY, LG, h->index, TCG_REG_R2, TCG_REG_NONE,
+                 offsetof(CPUTLBEntry, addend));
+
+    h->base = addr_reg;
+    if (TARGET_LONG_BITS == 32) {
+        tcg_out_ext32u(s, TCG_REG_R3, addr_reg);
+        h->base = TCG_REG_R3;
+    }
+    h->disp = 0;
+#else
+    if (a_mask) {
+        ldst = new_ldst_label(s);
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addr_reg;
+
+        /* We are expecting a_bits to max out at 7, much lower than TMLL. */
+        tcg_debug_assert(a_bits < 16);
+        tcg_out_insn(s, RI, TMLL, addr_reg, a_mask);
+
+        tcg_out16(s, RI_BRC | (7 << 4)); /* CC in {1,2,3} */
+        ldst->label_ptr[0] = s->code_ptr++;
+    }
+
+    h->base = addr_reg;
     if (TARGET_LONG_BITS == 32) {
         tcg_out_ext32u(s, TCG_TMP0, addr_reg);
-        addr_reg = TCG_TMP0;
+        h->base = TCG_TMP0;
     }
     if (guest_base < 0x80000) {
-        index = TCG_REG_NONE;
-        disp = guest_base;
+        h->index = TCG_REG_NONE;
+        h->disp = guest_base;
     } else {
-        index = TCG_GUEST_BASE_REG;
-        disp = 0;
+        h->index = TCG_GUEST_BASE_REG;
+        h->disp = 0;
     }
-    return (HostAddress){ .base = addr_reg, .index = index, .disp = disp };
+#endif
+
+    return ldst;
 }
-#endif /* CONFIG_SOFTMMU */
 
 static void tcg_out_qemu_ld(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    unsigned mem_index = get_mmuidx(oi);
-    tcg_insn_unit *label_ptr;
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, true);
+    tcg_out_qemu_ld_direct(s, get_memop(oi), data_reg, h);
 
-    h.base = tcg_out_tlb_read(s, addr_reg, opc, mem_index, 1);
-    h.index = TCG_REG_R2;
-    h.disp = 0;
-
-    tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
-    label_ptr = s->code_ptr;
-    s->code_ptr += 1;
-
-    tcg_out_qemu_ld_direct(s, opc, data_reg, h);
-
-    add_qemu_ldst_label(s, true, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-
-    if (a_bits) {
-        tcg_out_test_alignment(s, true, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    h = tcg_prepare_user_ldst(s, addr_reg);
-    tcg_out_qemu_ld_direct(s, opc, data_reg, h);
-#endif
 }
 
 static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp opc = get_memop(oi);
+    TCGLabelQemuLdst *ldst;
     HostAddress h;
 
-#ifdef CONFIG_SOFTMMU
-    unsigned mem_index = get_mmuidx(oi);
-    tcg_insn_unit *label_ptr;
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, false);
+    tcg_out_qemu_st_direct(s, get_memop(oi), data_reg, h);
 
-    h.base = tcg_out_tlb_read(s, addr_reg, opc, mem_index, 0);
-    h.index = TCG_REG_R2;
-    h.disp = 0;
-
-    tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
-    label_ptr = s->code_ptr;
-    s->code_ptr += 1;
-
-    tcg_out_qemu_st_direct(s, opc, data_reg, h);
-
-    add_qemu_ldst_label(s, false, oi, data_type, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
-#else
-    unsigned a_bits = get_alignment_bits(opc);
-
-    if (a_bits) {
-        tcg_out_test_alignment(s, false, addr_reg, a_bits);
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data_reg;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    h = tcg_prepare_user_ldst(s, addr_reg);
-    tcg_out_qemu_st_direct(s, opc, data_reg, h);
-#endif
 }
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
-- 
2.34.1

Add tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.  These and their subroutines
use the existing knowledge of the host function call abi
to load the function call arguments and return results.

These will be used to simplify the backends in turn.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 471 insertions(+), 4 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct);
 static int tcg_out_ldst_finalize(TCGContext *s);
 #endif
 
+typedef struct TCGLdstHelperParam {
+    TCGReg (*ra_gen)(TCGContext *s, const TCGLabelQemuLdst *l, int arg_reg);
+    unsigned ntmp;
+    int tmp[3];
+} TCGLdstHelperParam;
+
+static void tcg_out_ld_helper_args(TCGContext *s, const TCGLabelQemuLdst *l,
+                                   const TCGLdstHelperParam *p)
+    __attribute__((unused));
+static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *l,
+                                  bool load_sign, const TCGLdstHelperParam *p)
+    __attribute__((unused));
+static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *l,
+                                   const TCGLdstHelperParam *p)
+    __attribute__((unused));
+
 TCGContext tcg_init_ctx;
 __thread TCGContext *tcg_ctx;
 
@@ -XXX,XX +XXX,XX @@ void tcg_raise_tb_overflow(TCGContext *s)
     siglongjmp(s->jmp_trans, -2);
 }
 
+/*
+ * Used by tcg_out_movext{1,2} to hold the arguments for tcg_out_movext.
+ * By the time we arrive at tcg_out_movext1, @dst is always a TCGReg.
+ *
+ * However, tcg_out_helper_load_slots reuses this field to hold an
+ * argument slot number (which may designate a argument register or an
+ * argument stack slot), converting to TCGReg once all arguments that
+ * are destined for the stack are processed.
+ */
 typedef struct TCGMovExtend {
-    TCGReg dst;
+    unsigned dst;
     TCGReg src;
     TCGType dst_type;
     TCGType src_type;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_movext1(TCGContext *s, const TCGMovExtend *i)
  * between the sources and destinations.
  */
 
-static void __attribute__((unused))
-tcg_out_movext2(TCGContext *s, const TCGMovExtend *i1,
-                const TCGMovExtend *i2, int scratch)
+static void tcg_out_movext2(TCGContext *s, const TCGMovExtend *i1,
+                            const TCGMovExtend *i2, int scratch)
 {
     TCGReg src1 = i1->src;
     TCGReg src2 = i2->src;
@@ -XXX,XX +XXX,XX @@ static TCGHelperInfo all_helpers[] = {
 };
 static GHashTable *helper_table;
 
+/*
+ * Create TCGHelperInfo structures for "tcg/tcg-ldst.h" functions,
+ * akin to what "exec/helper-tcg.h" does with DEF_HELPER_FLAGS_N.
+ * We only use these for layout in tcg_out_ld_helper_ret and
+ * tcg_out_st_helper_args, and share them between several of
+ * the helpers, with the end result that it's easier to build manually.
+ */
+
+#if TCG_TARGET_REG_BITS == 32
+# define dh_typecode_ttl  dh_typecode_i32
+#else
+# define dh_typecode_ttl  dh_typecode_i64
+#endif
+
+static TCGHelperInfo info_helper_ld32_mmu = {
+    .flags = TCG_CALL_NO_WG,
+    .typemask = dh_typemask(ttl, 0)  /* return tcg_target_ulong */
+              | dh_typemask(env, 1)
+              | dh_typemask(tl, 2)   /* target_ulong addr */
+              | dh_typemask(i32, 3)  /* unsigned oi */
+              | dh_typemask(ptr, 4)  /* uintptr_t ra */
+};
+
+static TCGHelperInfo info_helper_ld64_mmu = {
+    .flags = TCG_CALL_NO_WG,
+    .typemask = dh_typemask(i64, 0)  /* return uint64_t */
+              | dh_typemask(env, 1)
+              | dh_typemask(tl, 2)   /* target_ulong addr */
+              | dh_typemask(i32, 3)  /* unsigned oi */
+              | dh_typemask(ptr, 4)  /* uintptr_t ra */
+};
+
+static TCGHelperInfo info_helper_st32_mmu = {
+    .flags = TCG_CALL_NO_WG,
+    .typemask = dh_typemask(void, 0)
+              | dh_typemask(env, 1)
+              | dh_typemask(tl, 2)   /* target_ulong addr */
+              | dh_typemask(i32, 3)  /* uint32_t data */
+              | dh_typemask(i32, 4)  /* unsigned oi */
+              | dh_typemask(ptr, 5)  /* uintptr_t ra */
+};
+
+static TCGHelperInfo info_helper_st64_mmu = {
+    .flags = TCG_CALL_NO_WG,
+    .typemask = dh_typemask(void, 0)
+              | dh_typemask(env, 1)
+              | dh_typemask(tl, 2)   /* target_ulong addr */
+              | dh_typemask(i64, 3)  /* uint64_t data */
+              | dh_typemask(i32, 4)  /* unsigned oi */
+              | dh_typemask(ptr, 5)  /* uintptr_t ra */
+};
+
 #ifdef CONFIG_TCG_INTERPRETER
 static ffi_type *typecode_to_ffi(int argmask)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_context_init(unsigned max_cpus)
                             (gpointer)&all_helpers[i]);
     }
 
+    init_call_layout(&info_helper_ld32_mmu);
+    init_call_layout(&info_helper_ld64_mmu);
+    init_call_layout(&info_helper_st32_mmu);
+    init_call_layout(&info_helper_st64_mmu);
+
 #ifdef CONFIG_TCG_INTERPRETER
     init_ffi_layouts();
 #endif
@@ -XXX,XX +XXX,XX @@ static void tcg_reg_alloc_call(TCGContext *s, TCGOp *op)
     }
 }
 
+/*
+ * Similarly for qemu_ld/st slow path helpers.
+ * We must re-implement tcg_gen_callN and tcg_reg_alloc_call simultaneously,
+ * using only the provided backend tcg_out_* functions.
+ */
+
+static int tcg_out_helper_stk_ofs(TCGType type, unsigned slot)
+{
+    int ofs = arg_slot_stk_ofs(slot);
+
+    /*
+     * Each stack slot is TCG_TARGET_LONG_BITS.  If the host does not
+     * require extension to uint64_t, adjust the address for uint32_t.
+     */
+    if (HOST_BIG_ENDIAN &&
+        TCG_TARGET_REG_BITS == 64 &&
+        type == TCG_TYPE_I32) {
+        ofs += 4;
+    }
+    return ofs;
+}
+
+static void tcg_out_helper_load_regs(TCGContext *s,
+                                     unsigned nmov, TCGMovExtend *mov,
+                                     unsigned ntmp, const int *tmp)
+{
+    switch (nmov) {
+    default:
+        /* The backend must have provided enough temps for the worst case. */
+        tcg_debug_assert(ntmp + 1 >= nmov);
+
+        for (unsigned i = nmov - 1; i >= 2; --i) {
+            TCGReg dst = mov[i].dst;
+
+            for (unsigned j = 0; j < i; ++j) {
+                if (dst == mov[j].src) {
+                    /*
+                     * Conflict.
+                     * Copy the source to a temporary, recurse for the
+                     * remaining moves, perform the extension from our
+                     * scratch on the way out.
+                     */
+                    TCGReg scratch = tmp[--ntmp];
+                    tcg_out_mov(s, mov[i].src_type, scratch, mov[i].src);
+                    mov[i].src = scratch;
+
+                    tcg_out_helper_load_regs(s, i, mov, ntmp, tmp);
+                    tcg_out_movext1(s, &mov[i]);
+                    return;
+                }
+            }
+
+            /* No conflicts: perform this move and continue. */
+            tcg_out_movext1(s, &mov[i]);
+        }
+        /* fall through for the final two moves */
+
+    case 2:
+        tcg_out_movext2(s, mov, mov + 1, ntmp ? tmp[0] : -1);
+        return;
+    case 1:
+        tcg_out_movext1(s, mov);
+        return;
+    case 0:
+        g_assert_not_reached();
+    }
+}
+
+static void tcg_out_helper_load_slots(TCGContext *s,
+                                      unsigned nmov, TCGMovExtend *mov,
+                                      const TCGLdstHelperParam *parm)
+{
+    unsigned i;
+
+    /*
+     * Start from the end, storing to the stack first.
+     * This frees those registers, so we need not consider overlap.
+     */
+    for (i = nmov; i-- > 0; ) {
+        unsigned slot = mov[i].dst;
+
+        if (arg_slot_reg_p(slot)) {
+            goto found_reg;
+        }
+
+        TCGReg src = mov[i].src;
+        TCGType dst_type = mov[i].dst_type;
+        MemOp dst_mo = dst_type == TCG_TYPE_I32 ? MO_32 : MO_64;
+
+        /* The argument is going onto the stack; extend into scratch. */
+        if ((mov[i].src_ext & MO_SIZE) != dst_mo) {
+            tcg_debug_assert(parm->ntmp != 0);
+            mov[i].dst = src = parm->tmp[0];
+            tcg_out_movext1(s, &mov[i]);
+        }
+
+        tcg_out_st(s, dst_type, src, TCG_REG_CALL_STACK,
+                   tcg_out_helper_stk_ofs(dst_type, slot));
+    }
+    return;
+
+ found_reg:
+    /*
+     * The remaining arguments are in registers.
+     * Convert slot numbers to argument registers.
+     */
+    nmov = i + 1;
+    for (i = 0; i < nmov; ++i) {
+        mov[i].dst = tcg_target_call_iarg_regs[mov[i].dst];
+    }
+    tcg_out_helper_load_regs(s, nmov, mov, parm->ntmp, parm->tmp);
+}
+
+static void tcg_out_helper_load_imm(TCGContext *s, unsigned slot,
+                                    TCGType type, tcg_target_long imm,
+                                    const TCGLdstHelperParam *parm)
+{
+    if (arg_slot_reg_p(slot)) {
+        tcg_out_movi(s, type, tcg_target_call_iarg_regs[slot], imm);
+    } else {
+        int ofs = tcg_out_helper_stk_ofs(type, slot);
+        if (!tcg_out_sti(s, type, imm, TCG_REG_CALL_STACK, ofs)) {
+            tcg_debug_assert(parm->ntmp != 0);
+            tcg_out_movi(s, type, parm->tmp[0], imm);
+            tcg_out_st(s, type, parm->tmp[0], TCG_REG_CALL_STACK, ofs);
+        }
+    }
+}
+
+static void tcg_out_helper_load_common_args(TCGContext *s,
+                                            const TCGLabelQemuLdst *ldst,
+                                            const TCGLdstHelperParam *parm,
+                                            const TCGHelperInfo *info,
+                                            unsigned next_arg)
+{
+    TCGMovExtend ptr_mov = {
+        .dst_type = TCG_TYPE_PTR,
+        .src_type = TCG_TYPE_PTR,
+        .src_ext = sizeof(void *) == 4 ? MO_32 : MO_64
+    };
+    const TCGCallArgumentLoc *loc = &info->in[0];
+    TCGType type;
+    unsigned slot;
+    tcg_target_ulong imm;
+
+    /*
+     * Handle env, which is always first.
+     */
+    ptr_mov.dst = loc->arg_slot;
+    ptr_mov.src = TCG_AREG0;
+    tcg_out_helper_load_slots(s, 1, &ptr_mov, parm);
+
+    /*
+     * Handle oi.
+     */
+    imm = ldst->oi;
+    loc = &info->in[next_arg];
+    type = TCG_TYPE_I32;
+    switch (loc->kind) {
+    case TCG_CALL_ARG_NORMAL:
+        break;
+    case TCG_CALL_ARG_EXTEND_U:
+    case TCG_CALL_ARG_EXTEND_S:
+        /* No extension required for MemOpIdx. */
+        tcg_debug_assert(imm <= INT32_MAX);
+        type = TCG_TYPE_REG;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    tcg_out_helper_load_imm(s, loc->arg_slot, type, imm, parm);
+    next_arg++;
+
+    /*
+     * Handle ra.
+     */
+    loc = &info->in[next_arg];
+    slot = loc->arg_slot;
+    if (parm->ra_gen) {
+        int arg_reg = -1;
+        TCGReg ra_reg;
+
+        if (arg_slot_reg_p(slot)) {
+            arg_reg = tcg_target_call_iarg_regs[slot];
+        }
+        ra_reg = parm->ra_gen(s, ldst, arg_reg);
+
+        ptr_mov.dst = slot;
+        ptr_mov.src = ra_reg;
+        tcg_out_helper_load_slots(s, 1, &ptr_mov, parm);
+    } else {
+        imm = (uintptr_t)ldst->raddr;
+        tcg_out_helper_load_imm(s, slot, TCG_TYPE_PTR, imm, parm);
+    }
+}
+
+static unsigned tcg_out_helper_add_mov(TCGMovExtend *mov,
+                                       const TCGCallArgumentLoc *loc,
+                                       TCGType dst_type, TCGType src_type,
+                                       TCGReg lo, TCGReg hi)
+{
+    if (dst_type <= TCG_TYPE_REG) {
+        MemOp src_ext;
+
+        switch (loc->kind) {
+        case TCG_CALL_ARG_NORMAL:
+            src_ext = src_type == TCG_TYPE_I32 ? MO_32 : MO_64;
+            break;
+        case TCG_CALL_ARG_EXTEND_U:
+            dst_type = TCG_TYPE_REG;
+            src_ext = MO_UL;
+            break;
+        case TCG_CALL_ARG_EXTEND_S:
+            dst_type = TCG_TYPE_REG;
+            src_ext = MO_SL;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+
+        mov[0].dst = loc->arg_slot;
+        mov[0].dst_type = dst_type;
+        mov[0].src = lo;
+        mov[0].src_type = src_type;
+        mov[0].src_ext = src_ext;
+        return 1;
+    }
+
+    assert(TCG_TARGET_REG_BITS == 32);
+
+    mov[0].dst = loc[HOST_BIG_ENDIAN].arg_slot;
+    mov[0].src = lo;
+    mov[0].dst_type = TCG_TYPE_I32;
+    mov[0].src_type = TCG_TYPE_I32;
+    mov[0].src_ext = MO_32;
+
+    mov[1].dst = loc[!HOST_BIG_ENDIAN].arg_slot;
+    mov[1].src = hi;
+    mov[1].dst_type = TCG_TYPE_I32;
+    mov[1].src_type = TCG_TYPE_I32;
+    mov[1].src_ext = MO_32;
+
+    return 2;
+}
+
+static void tcg_out_ld_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
+                                   const TCGLdstHelperParam *parm)
+{
+    const TCGHelperInfo *info;
+    const TCGCallArgumentLoc *loc;
+    TCGMovExtend mov[2];
+    unsigned next_arg, nmov;
+    MemOp mop = get_memop(ldst->oi);
+
+    switch (mop & MO_SIZE) {
+    case MO_8:
+    case MO_16:
+    case MO_32:
+        info = &info_helper_ld32_mmu;
+        break;
+    case MO_64:
+        info = &info_helper_ld64_mmu;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    /* Defer env argument. */
+    next_arg = 1;
+
+    loc = &info->in[next_arg];
+    nmov = tcg_out_helper_add_mov(mov, loc, TCG_TYPE_TL, TCG_TYPE_TL,
+                                  ldst->addrlo_reg, ldst->addrhi_reg);
+    next_arg += nmov;
+
+    tcg_out_helper_load_slots(s, nmov, mov, parm);
+
+    /* No special attention for 32 and 64-bit return values. */
+    tcg_debug_assert(info->out_kind == TCG_CALL_RET_NORMAL);
+
+    tcg_out_helper_load_common_args(s, ldst, parm, info, next_arg);
+}
+
+static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
+                                  bool load_sign,
+                                  const TCGLdstHelperParam *parm)
+{
+    TCGMovExtend mov[2];
+
+    if (ldst->type <= TCG_TYPE_REG) {
+        MemOp mop = get_memop(ldst->oi);
+
+        mov[0].dst = ldst->datalo_reg;
+        mov[0].src = tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, 0);
+        mov[0].dst_type = ldst->type;
+        mov[0].src_type = TCG_TYPE_REG;
+
+        /*
+         * If load_sign, then we allowed the helper to perform the
+         * appropriate sign extension to tcg_target_ulong, and all
+         * we need now is a plain move.
+         *
+         * If they do not, then we expect the relevant extension
+         * instruction to be no more expensive than a move, and
+         * we thus save the icache etc by only using one of two
+         * helper functions.
+         */
+        if (load_sign || !(mop & MO_SIGN)) {
+            if (TCG_TARGET_REG_BITS == 32 || ldst->type == TCG_TYPE_I32) {
+                mov[0].src_ext = MO_32;
+            } else {
+                mov[0].src_ext = MO_64;
+            }
+        } else {
+            mov[0].src_ext = mop & MO_SSIZE;
+        }
+        tcg_out_movext1(s, mov);
+    } else {
+        assert(TCG_TARGET_REG_BITS == 32);
+
+        mov[0].dst = ldst->datalo_reg;
+        mov[0].src =
+            tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
+        mov[0].dst_type = TCG_TYPE_I32;
+        mov[0].src_type = TCG_TYPE_I32;
+        mov[0].src_ext = MO_32;
+
+        mov[1].dst = ldst->datahi_reg;
+        mov[1].src =
+            tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, !HOST_BIG_ENDIAN);
+        mov[1].dst_type = TCG_TYPE_REG;
+        mov[1].src_type = TCG_TYPE_REG;
+        mov[1].src_ext = MO_32;
+
+        tcg_out_movext2(s, mov, mov + 1, parm->ntmp ? parm->tmp[0] : -1);
+    }
+}
+
+static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
+                                   const TCGLdstHelperParam *parm)
+{
+    const TCGHelperInfo *info;
+    const TCGCallArgumentLoc *loc;
+    TCGMovExtend mov[4];
+    TCGType data_type;
+    unsigned next_arg, nmov, n;
+    MemOp mop = get_memop(ldst->oi);
+
+    switch (mop & MO_SIZE) {
+    case MO_8:
+    case MO_16:
+    case MO_32:
+        info = &info_helper_st32_mmu;
+        data_type = TCG_TYPE_I32;
+        break;
+    case MO_64:
+        info = &info_helper_st64_mmu;
+        data_type = TCG_TYPE_I64;
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    /* Defer env argument. */
+    next_arg = 1;
+    nmov = 0;
+
+    /* Handle addr argument. */
+    loc = &info->in[next_arg];
+    n = tcg_out_helper_add_mov(mov, loc, TCG_TYPE_TL, TCG_TYPE_TL,
+                               ldst->addrlo_reg, ldst->addrhi_reg);
+    next_arg += n;
+    nmov += n;
+
+    /* Handle data argument. */
+    loc = &info->in[next_arg];
+    n = tcg_out_helper_add_mov(mov + nmov, loc, data_type, ldst->type,
+                               ldst->datalo_reg, ldst->datahi_reg);
+    next_arg += n;
+    nmov += n;
+    tcg_debug_assert(nmov <= ARRAY_SIZE(mov));
+
+    tcg_out_helper_load_slots(s, nmov, mov, parm);
+    tcg_out_helper_load_common_args(s, ldst, parm, info, next_arg);
+}
+
 #ifdef CONFIG_PROFILER
 
 /* avoid copy/paste errors */
-- 
2.34.1

Use tcg_out_ld_helper_args and tcg_out_ld_helper_ret.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 71 +++++++++++++++------------------------
 1 file changed, 28 insertions(+), 43 deletions(-)

Use tcg_out_st_helper_args.  This eliminates the use of a tail call to
the store helper.  This may or may not be an improvement, depending on
the call/return branch prediction of the host microarchitecture.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 57 +++------------------------------------
 1 file changed, 4 insertions(+), 53 deletions(-)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
  */
 static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    MemOp s_bits = opc & MO_SIZE;
+    MemOp opc = get_memop(l->oi);
     tcg_insn_unit **label_ptr = &l->label_ptr[0];
-    TCGReg retaddr;
 
     /* resolve label address */
     tcg_patch32(label_ptr[0], s->code_ptr - label_ptr[0] - 4);
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         tcg_patch32(label_ptr[1], s->code_ptr - label_ptr[1] - 4);
     }
 
-    if (TCG_TARGET_REG_BITS == 32) {
-        int ofs = 0;
+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
+    tcg_out_branch(s, 1, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
 
-        tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
-        ofs += 4;
-
-        tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
-        ofs += 4;
-
-        if (TARGET_LONG_BITS == 64) {
-            tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
-            ofs += 4;
-        }
-
-        tcg_out_st(s, TCG_TYPE_I32, l->datalo_reg, TCG_REG_ESP, ofs);
-        ofs += 4;
-
-        if (s_bits == MO_64) {
-            tcg_out_st(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_ESP, ofs);
-            ofs += 4;
-        }
-
-        tcg_out_sti(s, TCG_TYPE_I32, oi, TCG_REG_ESP, ofs);
-        ofs += 4;
-
-        retaddr = TCG_REG_EAX;
-        tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
-        tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP, ofs);
-    } else {
-        tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
-        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
-                    l->addrlo_reg);
-        tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
-                    tcg_target_call_iarg_regs[2], l->datalo_reg);
-        tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3], oi);
-
-        if (ARRAY_SIZE(tcg_target_call_iarg_regs) > 4) {
-            retaddr = tcg_target_call_iarg_regs[4];
-            tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
-        } else {
-            retaddr = TCG_REG_RAX;
-            tcg_out_movi(s, TCG_TYPE_PTR, retaddr, (uintptr_t)l->raddr);
-            tcg_out_st(s, TCG_TYPE_PTR, retaddr, TCG_REG_ESP,
-                       TCG_TARGET_CALL_STACK_OFFSET);
-        }
-    }
-
-    /* "Tail call" to the helper, with the return address back inline.  */
-    tcg_out_push(s, retaddr);
-    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_jmp(s, l->raddr);
     return true;
 }
 #else
-- 
2.34.1

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 40 +++++++++++++++---------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
     }
 }
 
-static void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
-{
-    ptrdiff_t offset = tcg_pcrel_diff(s, target);
-    tcg_debug_assert(offset == sextract64(offset, 0, 21));
-    tcg_out_insn(s, 3406, ADR, rd, offset);
-}
-
 typedef struct {
     TCGReg base;
     TCGReg index;
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
 #endif
 };
 
+static const TCGLdstHelperParam ldst_helper_param = {
+    .ntmp = 1, .tmp = { TCG_REG_TMP }
+};
+
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
-    MemOpIdx oi = lb->oi;
-    MemOp opc = get_memop(oi);
+    MemOp opc = get_memop(lb->oi);
 
     if (!reloc_pc19(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
         return false;
     }
 
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_X0, TCG_AREG0);
-    tcg_out_mov(s, TARGET_LONG_BITS == 64, TCG_REG_X1, lb->addrlo_reg);
-    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_X2, oi);
-    tcg_out_adr(s, TCG_REG_X3, lb->raddr);
+    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
     tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE]);
-
-    tcg_out_movext(s, lb->type, lb->datalo_reg,
-                   TCG_TYPE_REG, opc & MO_SSIZE, TCG_REG_X0);
+    tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
     tcg_out_goto(s, lb->raddr);
     return true;
 }
 
 static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
-    MemOpIdx oi = lb->oi;
-    MemOp opc = get_memop(oi);
-    MemOp size = opc & MO_SIZE;
+    MemOp opc = get_memop(lb->oi);
 
     if (!reloc_pc19(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
         return false;
     }
 
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_X0, TCG_AREG0);
-    tcg_out_mov(s, TARGET_LONG_BITS == 64, TCG_REG_X1, lb->addrlo_reg);
-    tcg_out_mov(s, size == MO_64, TCG_REG_X2, lb->datalo_reg);
-    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_X3, oi);
-    tcg_out_adr(s, TCG_REG_X4, lb->raddr);
+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
     tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE]);
     tcg_out_goto(s, lb->raddr);
     return true;
 }
 #else
+static void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
+{
+    ptrdiff_t offset = tcg_pcrel_diff(s, target);
+    tcg_debug_assert(offset == sextract64(offset, 0, 21));
+    tcg_out_insn(s, 3406, ADR, rd, offset);
+}
+
 static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
 {
     if (!reloc_pc19(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-- 
2.34.1

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.  This allows our local
tcg_out_arg_* infrastructure to be removed.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/arm/tcg-target.c.inc | 140 +++++----------------------------------
 1 file changed, 18 insertions(+), 122 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ tcg_out_ldrd_rwb(TCGContext *s, ARMCond cond, TCGReg rt, TCGReg rn, TCGReg rm)
     tcg_out_memop_r(s, cond, INSN_LDRD_REG, rt, rn, rm, 1, 1, 1);
 }
 
-static void tcg_out_strd_8(TCGContext *s, ARMCond cond, TCGReg rt,
-                           TCGReg rn, int imm8)
+static void __attribute__((unused))
+tcg_out_strd_8(TCGContext *s, ARMCond cond, TCGReg rt, TCGReg rn, int imm8)
 {
     tcg_out_memop_8(s, cond, INSN_STRD_IMM, rt, rn, imm8, 1, 0);
 }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ext8u(TCGContext *s, TCGReg rd, TCGReg rn)
     tcg_out_dat_imm(s, COND_AL, ARITH_AND, rd, rn, 0xff);
 }
 
-static void __attribute__((unused))
-tcg_out_ext8u_cond(TCGContext *s, ARMCond cond, TCGReg rd, TCGReg rn)
-{
-    tcg_out_dat_imm(s, cond, ARITH_AND, rd, rn, 0xff);
-}
-
 static void tcg_out_ext16s(TCGContext *s, TCGType t, TCGReg rd, TCGReg rn)
 {
     /* sxth */
     tcg_out32(s, 0x06bf0070 | (COND_AL << 28) | (rd << 12) | rn);
 }
 
-static void tcg_out_ext16u_cond(TCGContext *s, ARMCond cond,
-                                TCGReg rd, TCGReg rn)
-{
-    /* uxth */
-    tcg_out32(s, 0x06ff0070 | (cond << 28) | (rd << 12) | rn);
-}
-
 static void tcg_out_ext16u(TCGContext *s, TCGReg rd, TCGReg rn)
 {
-    tcg_out_ext16u_cond(s, COND_AL, rd, rn);
+    /* uxth */
+    tcg_out32(s, 0x06ff0070 | (COND_AL << 28) | (rd << 12) | rn);
 }
 
 static void tcg_out_ext32s(TCGContext *s, TCGReg rd, TCGReg rn)
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
 #endif
 };
 
-/* Helper routines for marshalling helper function arguments into
- * the correct registers and stack.
- * argreg is where we want to put this argument, arg is the argument itself.
- * Return value is the updated argreg ready for the next call.
- * Note that argreg 0..3 is real registers, 4+ on stack.
- *
- * We provide routines for arguments which are: immediate, 32 bit
- * value in register, 16 and 8 bit values in register (which must be zero
- * extended before use) and 64 bit value in a lo:hi register pair.
- */
-#define DEFINE_TCG_OUT_ARG(NAME, ARGTYPE, MOV_ARG, EXT_ARG)                \
-static TCGReg NAME(TCGContext *s, TCGReg argreg, ARGTYPE arg)              \
-{                                                                          \
-    if (argreg < 4) {                                                      \
-        MOV_ARG(s, COND_AL, argreg, arg);                                  \
-    } else {                                                               \
-        int ofs = (argreg - 4) * 4;                                        \
-        EXT_ARG;                                                           \
-        tcg_debug_assert(ofs + 4 <= TCG_STATIC_CALL_ARGS_SIZE);            \
-        tcg_out_st32_12(s, COND_AL, arg, TCG_REG_CALL_STACK, ofs);         \
-    }                                                                      \
-    return argreg + 1;                                                     \
-}
-
-DEFINE_TCG_OUT_ARG(tcg_out_arg_imm32, uint32_t, tcg_out_movi32,
-    (tcg_out_movi32(s, COND_AL, TCG_REG_TMP, arg), arg = TCG_REG_TMP))
-DEFINE_TCG_OUT_ARG(tcg_out_arg_reg8, TCGReg, tcg_out_ext8u_cond,
-    (tcg_out_ext8u_cond(s, COND_AL, TCG_REG_TMP, arg), arg = TCG_REG_TMP))
-DEFINE_TCG_OUT_ARG(tcg_out_arg_reg16, TCGReg, tcg_out_ext16u_cond,
-    (tcg_out_ext16u_cond(s, COND_AL, TCG_REG_TMP, arg), arg = TCG_REG_TMP))
-DEFINE_TCG_OUT_ARG(tcg_out_arg_reg32, TCGReg, tcg_out_mov_reg, )
-
-static TCGReg tcg_out_arg_reg64(TCGContext *s, TCGReg argreg,
-                                TCGReg arglo, TCGReg arghi)
+static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 {
-    /* 64 bit arguments must go in even/odd register pairs
-     * and in 8-aligned stack slots.
-     */
-    if (argreg & 1) {
-        argreg++;
-    }
-    if (argreg >= 4 && (arglo & 1) == 0 && arghi == arglo + 1) {
-        tcg_out_strd_8(s, COND_AL, arglo,
-                       TCG_REG_CALL_STACK, (argreg - 4) * 4);
-        return argreg + 2;
-    } else {
-        argreg = tcg_out_arg_reg32(s, argreg, arglo);
-        argreg = tcg_out_arg_reg32(s, argreg, arghi);
-        return argreg;
-    }
+    /* We arrive at the slow path via "BLNE", so R14 contains l->raddr. */
+    return TCG_REG_R14;
 }
 
+static const TCGLdstHelperParam ldst_helper_param = {
+    .ra_gen = ldst_ra_gen,
+    .ntmp = 1,
+    .tmp = { TCG_REG_TMP },
+};
+
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
-    TCGReg argreg;
-    MemOpIdx oi = lb->oi;
-    MemOp opc = get_memop(oi);
+    MemOp opc = get_memop(lb->oi);
 
     if (!reloc_pc24(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
         return false;
     }
 
-    argreg = tcg_out_arg_reg32(s, TCG_REG_R0, TCG_AREG0);
-    if (TARGET_LONG_BITS == 64) {
-        argreg = tcg_out_arg_reg64(s, argreg, lb->addrlo_reg, lb->addrhi_reg);
-    } else {
-        argreg = tcg_out_arg_reg32(s, argreg, lb->addrlo_reg);
-    }
-    argreg = tcg_out_arg_imm32(s, argreg, oi);
-    argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
-
-    /* Use the canonical unsigned helpers and minimize icache usage. */
+    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
     tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE]);
-
-    if ((opc & MO_SIZE) == MO_64) {
-        TCGMovExtend ext[2] = {
-            { .dst = lb->datalo_reg, .dst_type = TCG_TYPE_I32,
-              .src = TCG_REG_R0, .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
-            { .dst = lb->datahi_reg, .dst_type = TCG_TYPE_I32,
-              .src = TCG_REG_R1, .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
-        };
-        tcg_out_movext2(s, &ext[0], &ext[1], TCG_REG_TMP);
-    } else {
-        tcg_out_movext(s, TCG_TYPE_I32, lb->datalo_reg,
-                       TCG_TYPE_I32, opc & MO_SSIZE, TCG_REG_R0);
-    }
+    tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
 
     tcg_out_goto(s, COND_AL, lb->raddr);
     return true;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 
 static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
-    TCGReg argreg, datalo, datahi;
-    MemOpIdx oi = lb->oi;
-    MemOp opc = get_memop(oi);
+    MemOp opc = get_memop(lb->oi);
 
     if (!reloc_pc24(lb->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
         return false;
     }
 
-    argreg = TCG_REG_R0;
-    argreg = tcg_out_arg_reg32(s, argreg, TCG_AREG0);
-    if (TARGET_LONG_BITS == 64) {
-        argreg = tcg_out_arg_reg64(s, argreg, lb->addrlo_reg, lb->addrhi_reg);
-    } else {
-        argreg = tcg_out_arg_reg32(s, argreg, lb->addrlo_reg);
-    }
-
-    datalo = lb->datalo_reg;
-    datahi = lb->datahi_reg;
-    switch (opc & MO_SIZE) {
-    case MO_8:
-        argreg = tcg_out_arg_reg8(s, argreg, datalo);
-        break;
-    case MO_16:
-        argreg = tcg_out_arg_reg16(s, argreg, datalo);
-        break;
-    case MO_32:
-    default:
-        argreg = tcg_out_arg_reg32(s, argreg, datalo);
-        break;
-    case MO_64:
-        argreg = tcg_out_arg_reg64(s, argreg, datalo, datahi);
-        break;
-    }
-
-    argreg = tcg_out_arg_imm32(s, argreg, oi);
-    argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
 
     /* Tail-call to the helper, which will return to the fast path.  */
     tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & MO_SIZE]);
-- 
2.34.1

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target.c.inc | 37 ++++++++++----------------------
 1 file changed, 11 insertions(+), 26 deletions(-)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
     return reloc_br_sd10k16(s->code_ptr - 1, target);
 }
 
+static const TCGLdstHelperParam ldst_helper_param = {
+    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
+};
+
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    MemOp size = opc & MO_SIZE;
+    MemOp opc = get_memop(l->oi);
 
     /* resolve label address */
     if (!reloc_br_sk16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
         return false;
     }
 
-    /* call load helper */
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A1, l->addrlo_reg);
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A2, oi);
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A3, (tcg_target_long)l->raddr);
-
-    tcg_out_call_int(s, qemu_ld_helpers[size], false);
-
-    tcg_out_movext(s, l->type, l->datalo_reg,
-                   TCG_TYPE_REG, opc & MO_SSIZE, TCG_REG_A0);
+    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
+    tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE], false);
+    tcg_out_ld_helper_ret(s, l, false, &ldst_helper_param);
     return tcg_out_goto(s, l->raddr);
 }
 
 static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    MemOp size = opc & MO_SIZE;
+    MemOp opc = get_memop(l->oi);
 
     /* resolve label address */
     if (!reloc_br_sk16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
         return false;
     }
 
-    /* call store helper */
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A1, l->addrlo_reg);
-    tcg_out_movext(s, size == MO_64 ? TCG_TYPE_I32 : TCG_TYPE_I32, TCG_REG_A2,
-                   l->type, size, l->datalo_reg);
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A3, oi);
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_A4, (tcg_target_long)l->raddr);
-
-    tcg_out_call_int(s, qemu_st_helpers[size], false);
-
+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
+    tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
     return tcg_out_goto(s, l->raddr);
 }
 #else
-- 
2.34.1

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.  This allows our local
tcg_out_arg_* infrastructure to be removed.

We are no longer filling the call or return branch
delay slots, nor are we tail-calling for the store,
but this seems a small price to pay.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target.c.inc | 154 ++++++--------------------------------
 1 file changed, 22 insertions(+), 132 deletions(-)

diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
     [MO_BEUQ] = helper_be_stq_mmu,
 };
 
-/* Helper routines for marshalling helper function arguments into
- * the correct registers and stack.
- * I is where we want to put this argument, and is updated and returned
- * for the next call. ARG is the argument itself.
- *
- * We provide routines for arguments which are: immediate, 32 bit
- * value in register, 16 and 8 bit values in register (which must be zero
- * extended before use) and 64 bit value in a lo:hi register pair.
- */
-
-static int tcg_out_call_iarg_reg(TCGContext *s, int i, TCGReg arg)
-{
-    if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
-        tcg_out_mov(s, TCG_TYPE_REG, tcg_target_call_iarg_regs[i], arg);
-    } else {
-        /* For N32 and N64, the initial offset is different.  But there
-           we also have 8 argument register so we don't run out here.  */
-        tcg_debug_assert(TCG_TARGET_REG_BITS == 32);
-        tcg_out_st(s, TCG_TYPE_REG, arg, TCG_REG_SP, 4 * i);
-    }
-    return i + 1;
-}
-
-static int tcg_out_call_iarg_reg8(TCGContext *s, int i, TCGReg arg)
-{
-    TCGReg tmp = TCG_TMP0;
-    if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
-        tmp = tcg_target_call_iarg_regs[i];
-    }
-    tcg_out_ext8u(s, tmp, arg);
-    return tcg_out_call_iarg_reg(s, i, tmp);
-}
-
-static int tcg_out_call_iarg_reg16(TCGContext *s, int i, TCGReg arg)
-{
-    TCGReg tmp = TCG_TMP0;
-    if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
-        tmp = tcg_target_call_iarg_regs[i];
-    }
-    tcg_out_opc_imm(s, OPC_ANDI, tmp, arg, 0xffff);
-    return tcg_out_call_iarg_reg(s, i, tmp);
-}
-
-static int tcg_out_call_iarg_imm(TCGContext *s, int i, TCGArg arg)
-{
-    TCGReg tmp = TCG_TMP0;
-    if (arg == 0) {
-        tmp = TCG_REG_ZERO;
-    } else {
-        if (i < ARRAY_SIZE(tcg_target_call_iarg_regs)) {
-            tmp = tcg_target_call_iarg_regs[i];
-        }
-        tcg_out_movi(s, TCG_TYPE_REG, tmp, arg);
-    }
-    return tcg_out_call_iarg_reg(s, i, tmp);
-}
-
-static int tcg_out_call_iarg_reg2(TCGContext *s, int i, TCGReg al, TCGReg ah)
-{
-    tcg_debug_assert(TCG_TARGET_REG_BITS == 32);
-    i = (i + 1) & ~1;
-    i = tcg_out_call_iarg_reg(s, i, (MIPS_BE ? ah : al));
-    i = tcg_out_call_iarg_reg(s, i, (MIPS_BE ? al : ah));
-    return i;
-}
+/* We have four temps, we might as well expose three of them. */
+static const TCGLdstHelperParam ldst_helper_param = {
+    .ntmp = 3, .tmp = { TCG_TMP0, TCG_TMP1, TCG_TMP2 }
+};
 
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     const tcg_insn_unit *tgt_rx = tcg_splitwx_to_rx(s->code_ptr);
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    TCGReg v0;
-    int i;
+    MemOp opc = get_memop(l->oi);
 
     /* resolve label address */
     if (!reloc_pc16(l->label_ptr[0], tgt_rx)
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         return false;
     }
 
-    i = 1;
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        i = tcg_out_call_iarg_reg2(s, i, l->addrlo_reg, l->addrhi_reg);
-    } else {
-        i = tcg_out_call_iarg_reg(s, i, l->addrlo_reg);
-    }
-    i = tcg_out_call_iarg_imm(s, i, oi);
-    i = tcg_out_call_iarg_imm(s, i, (intptr_t)l->raddr);
+    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
+
     tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)], false);
     /* delay slot */
-    tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
+    tcg_out_nop(s);
 
-    v0 = l->datalo_reg;
-    if (TCG_TARGET_REG_BITS == 32 && (opc & MO_SIZE) == MO_64) {
-        /* We eliminated V0 from the possible output registers, so it
-           cannot be clobbered here.  So we must move V1 first.  */
-        if (MIPS_BE) {
-            tcg_out_mov(s, TCG_TYPE_I32, v0, TCG_REG_V1);
-            v0 = l->datahi_reg;
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_V1);
-        }
-    }
+    tcg_out_ld_helper_ret(s, l, true, &ldst_helper_param);
 
     tcg_out_opc_br(s, OPC_BEQ, TCG_REG_ZERO, TCG_REG_ZERO);
     if (!reloc_pc16(s->code_ptr - 1, l->raddr)) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     }
 
     /* delay slot */
-    if (TCG_TARGET_REG_BITS == 64 && l->type == TCG_TYPE_I32) {
-        /* we always sign-extend 32-bit loads */
-        tcg_out_ext32s(s, v0, TCG_REG_V0);
-    } else {
-        tcg_out_opc_reg(s, OPC_OR, v0, TCG_REG_V0, TCG_REG_ZERO);
-    }
+    tcg_out_nop(s);
     return true;
 }
 
 static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
     const tcg_insn_unit *tgt_rx = tcg_splitwx_to_rx(s->code_ptr);
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    MemOp s_bits = opc & MO_SIZE;
-    int i;
+    MemOp opc = get_memop(l->oi);
 
     /* resolve label address */
     if (!reloc_pc16(l->label_ptr[0], tgt_rx)
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         return false;
     }
 
-    i = 1;
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        i = tcg_out_call_iarg_reg2(s, i, l->addrlo_reg, l->addrhi_reg);
-    } else {
-        i = tcg_out_call_iarg_reg(s, i, l->addrlo_reg);
-    }
-    switch (s_bits) {
-    case MO_8:
-        i = tcg_out_call_iarg_reg8(s, i, l->datalo_reg);
-        break;
-    case MO_16:
-        i = tcg_out_call_iarg_reg16(s, i, l->datalo_reg);
-        break;
-    case MO_32:
-        i = tcg_out_call_iarg_reg(s, i, l->datalo_reg);
-        break;
-    case MO_64:
-        if (TCG_TARGET_REG_BITS == 32) {
-            i = tcg_out_call_iarg_reg2(s, i, l->datalo_reg, l->datahi_reg);
-        } else {
-            i = tcg_out_call_iarg_reg(s, i, l->datalo_reg);
-        }
-        break;
-    default:
-        g_assert_not_reached();
-    }
-    i = tcg_out_call_iarg_imm(s, i, oi);
+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
 
-    /* Tail call to the store helper.  Thus force the return address
-       computation to take place in the return address register.  */
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RA, (intptr_t)l->raddr);
-    i = tcg_out_call_iarg_reg(s, i, TCG_REG_RA);
-    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)], true);
+    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)], false);
     /* delay slot */
-    tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
+    tcg_out_nop(s);
+
+    tcg_out_opc_br(s, OPC_BEQ, TCG_REG_ZERO, TCG_REG_ZERO);
+    if (!reloc_pc16(s->code_ptr - 1, l->raddr)) {
+        return false;
+    }
+
+    /* delay slot */
+    tcg_out_nop(s);
     return true;
 }
 
-- 
2.34.1

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.c.inc | 88 ++++++++++++----------------------------
 1 file changed, 26 insertions(+), 62 deletions(-)

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/riscv/tcg-target.c.inc | 37 ++++++++++---------------------------
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
     tcg_debug_assert(ok);
 }
 
+/* We have three temps, we might as well expose them. */
+static const TCGLdstHelperParam ldst_helper_param = {
+    .ntmp = 3, .tmp = { TCG_REG_TMP0, TCG_REG_TMP1, TCG_REG_TMP2 }
+};
+
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    TCGReg a0 = tcg_target_call_iarg_regs[0];
-    TCGReg a1 = tcg_target_call_iarg_regs[1];
-    TCGReg a2 = tcg_target_call_iarg_regs[2];
-    TCGReg a3 = tcg_target_call_iarg_regs[3];
+    MemOp opc = get_memop(l->oi);
 
     /* resolve label address */
     if (!reloc_sbimm12(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     }
 
     /* call load helper */
-    tcg_out_mov(s, TCG_TYPE_PTR, a0, TCG_AREG0);
-    tcg_out_mov(s, TCG_TYPE_PTR, a1, l->addrlo_reg);
-    tcg_out_movi(s, TCG_TYPE_PTR, a2, oi);
-    tcg_out_movi(s, TCG_TYPE_PTR, a3, (tcg_target_long)l->raddr);
-
+    tcg_out_ld_helper_args(s, l, &ldst_helper_param);
     tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SSIZE], false);
-    tcg_out_mov(s, (opc & MO_SIZE) == MO_64, l->datalo_reg, a0);
+    tcg_out_ld_helper_ret(s, l, true, &ldst_helper_param);
 
     tcg_out_goto(s, l->raddr);
     return true;
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
 static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 {
-    MemOpIdx oi = l->oi;
-    MemOp opc = get_memop(oi);
-    MemOp s_bits = opc & MO_SIZE;
-    TCGReg a0 = tcg_target_call_iarg_regs[0];
-    TCGReg a1 = tcg_target_call_iarg_regs[1];
-    TCGReg a2 = tcg_target_call_iarg_regs[2];
-    TCGReg a3 = tcg_target_call_iarg_regs[3];
-    TCGReg a4 = tcg_target_call_iarg_regs[4];
+    MemOp opc = get_memop(l->oi);
 
     /* resolve label address */
     if (!reloc_sbimm12(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     }
 
     /* call store helper */
-    tcg_out_mov(s, TCG_TYPE_PTR, a0, TCG_AREG0);
-    tcg_out_mov(s, TCG_TYPE_PTR, a1, l->addrlo_reg);
-    tcg_out_movext(s, s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32, a2,
-                   l->type, s_bits, l->datalo_reg);
-    tcg_out_movi(s, TCG_TYPE_PTR, a3, oi);
-    tcg_out_movi(s, TCG_TYPE_PTR, a4, (tcg_target_long)l->raddr);
-
+    tcg_out_st_helper_args(s, l, &ldst_helper_param);
     tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
 
     tcg_out_goto(s, l->raddr);
-- 
2.34.1

Use tcg_out_ld_helper_args, tcg_out_ld_helper_ret,
and tcg_out_st_helper_args.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 35 ++++++++++-------------------------
 1 file changed, 10 insertions(+), 25 deletions(-)

The softmmu tlb uses TCG_REG_TMP[0-2], not any of the normally available
registers.  Now that we handle overlap betwen inputs and helper arguments,
we can allow any allocatable reg.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target-con-set.h |  2 --
 tcg/loongarch64/tcg-target-con-str.h |  1 -
 tcg/loongarch64/tcg-target.c.inc     | 23 ++++-------------------
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/tcg/loongarch64/tcg-target-con-set.h b/tcg/loongarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/loongarch64/tcg-target-con-set.h
+++ b/tcg/loongarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
 C_O0_I1(r)
 C_O0_I2(rZ, r)
 C_O0_I2(rZ, rZ)
-C_O0_I2(LZ, L)
 C_O1_I1(r, r)
-C_O1_I1(r, L)
 C_O1_I2(r, r, rC)
 C_O1_I2(r, r, ri)
 C_O1_I2(r, r, rI)
diff --git a/tcg/loongarch64/tcg-target-con-str.h b/tcg/loongarch64/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/loongarch64/tcg-target-con-str.h
+++ b/tcg/loongarch64/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
 
 /*
  * Define constraint letters for constants:
diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
 #define TCG_CT_CONST_C12   0x1000
 #define TCG_CT_CONST_WSZ   0x2000
 
-#define ALL_GENERAL_REGS      MAKE_64BIT_MASK(0, 32)
-/*
- * For softmmu, we need to avoid conflicts with the first 5
- * argument registers to call the helper.  Some of these are
- * also used for the tlb lookup.
- */
-#ifdef CONFIG_SOFTMMU
-#define SOFTMMU_RESERVE_REGS  MAKE_64BIT_MASK(TCG_REG_A0, 5)
-#else
-#define SOFTMMU_RESERVE_REGS  0
-#endif
-
+#define ALL_GENERAL_REGS   MAKE_64BIT_MASK(0, 32)
 
 static inline tcg_target_long sextreg(tcg_target_long val, int pos, int len)
 {
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_st32_i64:
     case INDEX_op_st_i32:
     case INDEX_op_st_i64:
+    case INDEX_op_qemu_st_i32:
+    case INDEX_op_qemu_st_i64:
         return C_O0_I2(rZ, r);
 
     case INDEX_op_brcond_i32:
     case INDEX_op_brcond_i64:
         return C_O0_I2(rZ, rZ);
 
-    case INDEX_op_qemu_st_i32:
-    case INDEX_op_qemu_st_i64:
-        return C_O0_I2(LZ, L);
-
     case INDEX_op_ext8s_i32:
     case INDEX_op_ext8s_i64:
     case INDEX_op_ext8u_i32:
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_ld32u_i64:
     case INDEX_op_ld_i32:
     case INDEX_op_ld_i64:
-        return C_O1_I1(r, r);
-
     case INDEX_op_qemu_ld_i32:
     case INDEX_op_qemu_ld_i64:
-        return C_O1_I1(r, L);
+        return C_O1_I1(r, r);
 
     case INDEX_op_andc_i32:
     case INDEX_op_andc_i64:
-- 
2.34.1

While performing the load in the delay slot of the call to the common
bswap helper function is cute, it is not worth the added complexity.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target.h     |   4 +-
 tcg/mips/tcg-target.c.inc | 284 ++++++--------------------------------
 2 files changed, 48 insertions(+), 240 deletions(-)

diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool use_mips32r2_instructions;
 #define TCG_TARGET_HAS_ext16u_i64       0 /* andi rt, rs, 0xffff */
 #endif
 
-#define TCG_TARGET_DEFAULT_MO (0)
-#define TCG_TARGET_HAS_MEMORY_BSWAP     1
+#define TCG_TARGET_DEFAULT_MO           0
+#define TCG_TARGET_HAS_MEMORY_BSWAP     0
 
 #define TCG_TARGET_NEED_LDST_LABELS
 
diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static void tcg_out_call(TCGContext *s, const tcg_insn_unit *arg,
 }
 
 #if defined(CONFIG_SOFTMMU)
-static void * const qemu_ld_helpers[(MO_SSIZE | MO_BSWAP) + 1] = {
+static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_SB]   = helper_ret_ldsb_mmu,
-    [MO_LEUW] = helper_le_lduw_mmu,
-    [MO_LESW] = helper_le_ldsw_mmu,
-    [MO_LEUL] = helper_le_ldul_mmu,
-    [MO_LEUQ] = helper_le_ldq_mmu,
-    [MO_BEUW] = helper_be_lduw_mmu,
-    [MO_BESW] = helper_be_ldsw_mmu,
-    [MO_BEUL] = helper_be_ldul_mmu,
-    [MO_BEUQ] = helper_be_ldq_mmu,
-#if TCG_TARGET_REG_BITS == 64
-    [MO_LESL] = helper_le_ldsl_mmu,
-    [MO_BESL] = helper_be_ldsl_mmu,
+#if HOST_BIG_ENDIAN
+    [MO_UW] = helper_be_lduw_mmu,
+    [MO_SW] = helper_be_ldsw_mmu,
+    [MO_UL] = helper_be_ldul_mmu,
+    [MO_SL] = helper_be_ldsl_mmu,
+    [MO_UQ] = helper_be_ldq_mmu,
+#else
+    [MO_UW] = helper_le_lduw_mmu,
+    [MO_SW] = helper_le_ldsw_mmu,
+    [MO_UL] = helper_le_ldul_mmu,
+    [MO_UQ] = helper_le_ldq_mmu,
+    [MO_SL] = helper_le_ldsl_mmu,
 #endif
 };
 
-static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
+static void * const qemu_st_helpers[MO_SIZE + 1] = {
     [MO_UB]   = helper_ret_stb_mmu,
-    [MO_LEUW] = helper_le_stw_mmu,
-    [MO_LEUL] = helper_le_stl_mmu,
-    [MO_LEUQ] = helper_le_stq_mmu,
-    [MO_BEUW] = helper_be_stw_mmu,
-    [MO_BEUL] = helper_be_stl_mmu,
-    [MO_BEUQ] = helper_be_stq_mmu,
+#if HOST_BIG_ENDIAN
+    [MO_UW] = helper_be_stw_mmu,
+    [MO_UL] = helper_be_stl_mmu,
+    [MO_UQ] = helper_be_stq_mmu,
+#else
+    [MO_UW] = helper_le_stw_mmu,
+    [MO_UL] = helper_le_stl_mmu,
+    [MO_UQ] = helper_le_stq_mmu,
+#endif
 };
 
 /* We have four temps, we might as well expose three of them. */
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
     tcg_out_ld_helper_args(s, l, &ldst_helper_param);
 
-    tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)], false);
+    tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SSIZE], false);
     /* delay slot */
     tcg_out_nop(s);
 
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
     tcg_out_st_helper_args(s, l, &ldst_helper_param);
 
-    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)], false);
+    tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
     /* delay slot */
     tcg_out_nop(s);
 
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
                                    TCGReg base, MemOp opc, TCGType type)
 {
-    switch (opc & (MO_SSIZE | MO_BSWAP)) {
+    switch (opc & MO_SSIZE) {
     case MO_UB:
         tcg_out_opc_imm(s, OPC_LBU, lo, base, 0);
         break;
     case MO_SB:
         tcg_out_opc_imm(s, OPC_LB, lo, base, 0);
         break;
-    case MO_UW | MO_BSWAP:
-        tcg_out_opc_imm(s, OPC_LHU, TCG_TMP1, base, 0);
-        tcg_out_bswap16(s, lo, TCG_TMP1, TCG_BSWAP_IZ | TCG_BSWAP_OZ);
-        break;
     case MO_UW:
         tcg_out_opc_imm(s, OPC_LHU, lo, base, 0);
         break;
-    case MO_SW | MO_BSWAP:
-        tcg_out_opc_imm(s, OPC_LHU, TCG_TMP1, base, 0);
-        tcg_out_bswap16(s, lo, TCG_TMP1, TCG_BSWAP_IZ | TCG_BSWAP_OS);
-        break;
     case MO_SW:
         tcg_out_opc_imm(s, OPC_LH, lo, base, 0);
         break;
-    case MO_UL | MO_BSWAP:
-        if (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64) {
-            if (use_mips32r2_instructions) {
-                tcg_out_opc_imm(s, OPC_LWU, lo, base, 0);
-                tcg_out_bswap32(s, lo, lo, TCG_BSWAP_IZ | TCG_BSWAP_OZ);
-            } else {
-                tcg_out_bswap_subr(s, bswap32u_addr);
-                /* delay slot */
-                tcg_out_opc_imm(s, OPC_LWU, TCG_TMP0, base, 0);
-                tcg_out_mov(s, TCG_TYPE_I64, lo, TCG_TMP3);
-            }
-            break;
-        }
-        /* FALLTHRU */
-    case MO_SL | MO_BSWAP:
-        if (use_mips32r2_instructions) {
-            tcg_out_opc_imm(s, OPC_LW, lo, base, 0);
-            tcg_out_bswap32(s, lo, lo, 0);
-        } else {
-            tcg_out_bswap_subr(s, bswap32_addr);
-            /* delay slot */
-            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 0);
-            tcg_out_mov(s, TCG_TYPE_I32, lo, TCG_TMP3);
-        }
-        break;
     case MO_UL:
         if (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64) {
             tcg_out_opc_imm(s, OPC_LWU, lo, base, 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
     case MO_SL:
         tcg_out_opc_imm(s, OPC_LW, lo, base, 0);
         break;
-    case MO_UQ | MO_BSWAP:
-        if (TCG_TARGET_REG_BITS == 64) {
-            if (use_mips32r2_instructions) {
-                tcg_out_opc_imm(s, OPC_LD, lo, base, 0);
-                tcg_out_bswap64(s, lo, lo);
-            } else {
-                tcg_out_bswap_subr(s, bswap64_addr);
-                /* delay slot */
-                tcg_out_opc_imm(s, OPC_LD, TCG_TMP0, base, 0);
-                tcg_out_mov(s, TCG_TYPE_I64, lo, TCG_TMP3);
-            }
-        } else if (use_mips32r2_instructions) {
-            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 0);
-            tcg_out_opc_imm(s, OPC_LW, TCG_TMP1, base, 4);
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, TCG_TMP0);
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, TCG_TMP1);
-            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? lo : hi, TCG_TMP0, 16);
-            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? hi : lo, TCG_TMP1, 16);
-        } else {
-            tcg_out_bswap_subr(s, bswap32_addr);
-            /* delay slot */
-            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 0);
-            tcg_out_opc_imm(s, OPC_LW, TCG_TMP0, base, 4);
-            tcg_out_bswap_subr(s, bswap32_addr);
-            /* delay slot */
-            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? lo : hi, TCG_TMP3);
-            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? hi : lo, TCG_TMP3);
-        }
-        break;
     case MO_UQ:
         /* Prefer to load from offset 0 first, but allow for overlap.  */
         if (TCG_TARGET_REG_BITS == 64) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
     const MIPSInsn lw2 = MIPS_BE ? OPC_LWR : OPC_LWL;
     const MIPSInsn ld1 = MIPS_BE ? OPC_LDL : OPC_LDR;
     const MIPSInsn ld2 = MIPS_BE ? OPC_LDR : OPC_LDL;
+    bool sgn = opc & MO_SIGN;
 
-    bool sgn = (opc & MO_SIGN);
-
-    switch (opc & (MO_SSIZE | MO_BSWAP)) {
-    case MO_SW | MO_BE:
-    case MO_UW | MO_BE:
-        tcg_out_opc_imm(s, sgn ? OPC_LB : OPC_LBU, TCG_TMP0, base, 0);
-        tcg_out_opc_imm(s, OPC_LBU, lo, base, 1);
-        if (use_mips32r2_instructions) {
-            tcg_out_opc_bf(s, OPC_INS, lo, TCG_TMP0, 31, 8);
-        } else {
-            tcg_out_opc_sa(s, OPC_SLL, TCG_TMP0, TCG_TMP0, 8);
-            tcg_out_opc_reg(s, OPC_OR, lo, TCG_TMP0, TCG_TMP1);
-        }
-        break;
-
-    case MO_SW | MO_LE:
-    case MO_UW | MO_LE:
-        if (use_mips32r2_instructions && lo != base) {
+    switch (opc & MO_SIZE) {
+    case MO_16:
+        if (HOST_BIG_ENDIAN) {
+            tcg_out_opc_imm(s, sgn ? OPC_LB : OPC_LBU, TCG_TMP0, base, 0);
+            tcg_out_opc_imm(s, OPC_LBU, lo, base, 1);
+            if (use_mips32r2_instructions) {
+                tcg_out_opc_bf(s, OPC_INS, lo, TCG_TMP0, 31, 8);
+            } else {
+                tcg_out_opc_sa(s, OPC_SLL, TCG_TMP0, TCG_TMP0, 8);
+                tcg_out_opc_reg(s, OPC_OR, lo, lo, TCG_TMP0);
+            }
+        } else if (use_mips32r2_instructions && lo != base) {
             tcg_out_opc_imm(s, OPC_LBU, lo, base, 0);
             tcg_out_opc_imm(s, sgn ? OPC_LB : OPC_LBU, TCG_TMP0, base, 1);
             tcg_out_opc_bf(s, OPC_INS, lo, TCG_TMP0, 31, 8);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
         }
         break;
 
-    case MO_SL:
-    case MO_UL:
+    case MO_32:
         tcg_out_opc_imm(s, lw1, lo, base, 0);
         tcg_out_opc_imm(s, lw2, lo, base, 3);
         if (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64 && !sgn) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
         }
         break;
 
-    case MO_UL | MO_BSWAP:
-    case MO_SL | MO_BSWAP:
-        if (use_mips32r2_instructions) {
-            tcg_out_opc_imm(s, lw1, lo, base, 0);
-            tcg_out_opc_imm(s, lw2, lo, base, 3);
-            tcg_out_bswap32(s, lo, lo,
-                            TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64
-                            ? (sgn ? TCG_BSWAP_OS : TCG_BSWAP_OZ) : 0);
-        } else {
-            const tcg_insn_unit *subr =
-                (TCG_TARGET_REG_BITS == 64 && type == TCG_TYPE_I64 && !sgn
-                 ? bswap32u_addr : bswap32_addr);
-
-            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 0);
-            tcg_out_bswap_subr(s, subr);
-            /* delay slot */
-            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 3);
-            tcg_out_mov(s, type, lo, TCG_TMP3);
-        }
-        break;
-
-    case MO_UQ:
+    case MO_64:
         if (TCG_TARGET_REG_BITS == 64) {
             tcg_out_opc_imm(s, ld1, lo, base, 0);
             tcg_out_opc_imm(s, ld2, lo, base, 7);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
         }
         break;
 
-    case MO_UQ | MO_BSWAP:
-        if (TCG_TARGET_REG_BITS == 64) {
-            if (use_mips32r2_instructions) {
-                tcg_out_opc_imm(s, ld1, lo, base, 0);
-                tcg_out_opc_imm(s, ld2, lo, base, 7);
-                tcg_out_bswap64(s, lo, lo);
-            } else {
-                tcg_out_opc_imm(s, ld1, TCG_TMP0, base, 0);
-                tcg_out_bswap_subr(s, bswap64_addr);
-                /* delay slot */
-                tcg_out_opc_imm(s, ld2, TCG_TMP0, base, 7);
-                tcg_out_mov(s, TCG_TYPE_I64, lo, TCG_TMP3);
-            }
-        } else if (use_mips32r2_instructions) {
-            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 0 + 0);
-            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 0 + 3);
-            tcg_out_opc_imm(s, lw1, TCG_TMP1, base, 4 + 0);
-            tcg_out_opc_imm(s, lw2, TCG_TMP1, base, 4 + 3);
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, TCG_TMP0);
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, TCG_TMP1);
-            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? lo : hi, TCG_TMP0, 16);
-            tcg_out_opc_sa(s, OPC_ROTR, MIPS_BE ? hi : lo, TCG_TMP1, 16);
-        } else {
-            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 0 + 0);
-            tcg_out_bswap_subr(s, bswap32_addr);
-            /* delay slot */
-            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 0 + 3);
-            tcg_out_opc_imm(s, lw1, TCG_TMP0, base, 4 + 0);
-            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? lo : hi, TCG_TMP3);
-            tcg_out_bswap_subr(s, bswap32_addr);
-            /* delay slot */
-            tcg_out_opc_imm(s, lw2, TCG_TMP0, base, 4 + 3);
-            tcg_out_mov(s, TCG_TYPE_I32, MIPS_BE ? hi : lo, TCG_TMP3);
-        }
-        break;
-
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg datalo, TCGReg datahi,
 static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg lo, TCGReg hi,
                                    TCGReg base, MemOp opc)
 {
-    /* Don't clutter the code below with checks to avoid bswapping ZERO.  */
-    if ((lo | hi) == 0) {
-        opc &= ~MO_BSWAP;
-    }
-
-    switch (opc & (MO_SIZE | MO_BSWAP)) {
+    switch (opc & MO_SIZE) {
     case MO_8:
         tcg_out_opc_imm(s, OPC_SB, lo, base, 0);
         break;
-
-    case MO_16 | MO_BSWAP:
-        tcg_out_bswap16(s, TCG_TMP1, lo, 0);
-        lo = TCG_TMP1;
-        /* FALLTHRU */
     case MO_16:
         tcg_out_opc_imm(s, OPC_SH, lo, base, 0);
         break;
-
-    case MO_32 | MO_BSWAP:
-        tcg_out_bswap32(s, TCG_TMP3, lo, 0);
-        lo = TCG_TMP3;
-        /* FALLTHRU */
     case MO_32:
         tcg_out_opc_imm(s, OPC_SW, lo, base, 0);
         break;
-
-    case MO_64 | MO_BSWAP:
-        if (TCG_TARGET_REG_BITS == 64) {
-            tcg_out_bswap64(s, TCG_TMP3, lo);
-            tcg_out_opc_imm(s, OPC_SD, TCG_TMP3, base, 0);
-        } else if (use_mips32r2_instructions) {
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, MIPS_BE ? lo : hi);
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, MIPS_BE ? hi : lo);
-            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP0, TCG_TMP0, 16);
-            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP1, TCG_TMP1, 16);
-            tcg_out_opc_imm(s, OPC_SW, TCG_TMP0, base, 0);
-            tcg_out_opc_imm(s, OPC_SW, TCG_TMP1, base, 4);
-        } else {
-            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? lo : hi, 0);
-            tcg_out_opc_imm(s, OPC_SW, TCG_TMP3, base, 0);
-            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? hi : lo, 0);
-            tcg_out_opc_imm(s, OPC_SW, TCG_TMP3, base, 4);
-        }
-        break;
     case MO_64:
         if (TCG_TARGET_REG_BITS == 64) {
             tcg_out_opc_imm(s, OPC_SD, lo, base, 0);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg lo, TCGReg hi,
             tcg_out_opc_imm(s, OPC_SW, MIPS_BE ? lo : hi, base, 4);
         }
         break;
-
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
     const MIPSInsn sd1 = MIPS_BE ? OPC_SDL : OPC_SDR;
     const MIPSInsn sd2 = MIPS_BE ? OPC_SDR : OPC_SDL;
 
-    /* Don't clutter the code below with checks to avoid bswapping ZERO.  */
-    if ((lo | hi) == 0) {
-        opc &= ~MO_BSWAP;
-    }
-
-    switch (opc & (MO_SIZE | MO_BSWAP)) {
-    case MO_16 | MO_BE:
+    switch (opc & MO_SIZE) {
+    case MO_16:
         tcg_out_opc_sa(s, OPC_SRL, TCG_TMP0, lo, 8);
-        tcg_out_opc_imm(s, OPC_SB, TCG_TMP0, base, 0);
-        tcg_out_opc_imm(s, OPC_SB, lo, base, 1);
+        tcg_out_opc_imm(s, OPC_SB, HOST_BIG_ENDIAN ? TCG_TMP0 : lo, base, 0);
+        tcg_out_opc_imm(s, OPC_SB, HOST_BIG_ENDIAN ? lo : TCG_TMP0, base, 1);
         break;
 
-    case MO_16 | MO_LE:
-        tcg_out_opc_sa(s, OPC_SRL, TCG_TMP0, lo, 8);
-        tcg_out_opc_imm(s, OPC_SB, lo, base, 0);
-        tcg_out_opc_imm(s, OPC_SB, TCG_TMP0, base, 1);
-        break;
-
-    case MO_32 | MO_BSWAP:
-        tcg_out_bswap32(s, TCG_TMP3, lo, 0);
-        lo = TCG_TMP3;
-        /* fall through */
     case MO_32:
         tcg_out_opc_imm(s, sw1, lo, base, 0);
         tcg_out_opc_imm(s, sw2, lo, base, 3);
         break;
 
-    case MO_64 | MO_BSWAP:
-        if (TCG_TARGET_REG_BITS == 64) {
-            tcg_out_bswap64(s, TCG_TMP3, lo);
-            lo = TCG_TMP3;
-        } else if (use_mips32r2_instructions) {
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP0, 0, MIPS_BE ? hi : lo);
-            tcg_out_opc_reg(s, OPC_WSBH, TCG_TMP1, 0, MIPS_BE ? lo : hi);
-            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP0, TCG_TMP0, 16);
-            tcg_out_opc_sa(s, OPC_ROTR, TCG_TMP1, TCG_TMP1, 16);
-            hi = MIPS_BE ? TCG_TMP0 : TCG_TMP1;
-            lo = MIPS_BE ? TCG_TMP1 : TCG_TMP0;
-        } else {
-            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? lo : hi, 0);
-            tcg_out_opc_imm(s, sw1, TCG_TMP3, base, 0 + 0);
-            tcg_out_opc_imm(s, sw2, TCG_TMP3, base, 0 + 3);
-            tcg_out_bswap32(s, TCG_TMP3, MIPS_BE ? hi : lo, 0);
-            tcg_out_opc_imm(s, sw1, TCG_TMP3, base, 4 + 0);
-            tcg_out_opc_imm(s, sw2, TCG_TMP3, base, 4 + 3);
-            break;
-        }
-        /* fall through */
     case MO_64:
         if (TCG_TARGET_REG_BITS == 64) {
             tcg_out_opc_imm(s, sd1, lo, base, 0);
-- 
2.34.1

Compare the address vs the tlb entry with sign-extended values.
This simplifies the page+alignment mask constant, and the
generation of the last byte address for the misaligned test.

Move the tlb addend load up, and the zero-extension down.

This frees up a register, which allows us use TMP3 as the returned base
address register instead of A0, which we were using as a 5th temporary.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target.c.inc | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum {
     ALIAS_PADDI    = sizeof(void *) == 4 ? OPC_ADDIU : OPC_DADDIU,
     ALIAS_TSRL     = TARGET_LONG_BITS == 32 || TCG_TARGET_REG_BITS == 32
                      ? OPC_SRL : OPC_DSRL,
+    ALIAS_TADDI    = TARGET_LONG_BITS == 32 || TCG_TARGET_REG_BITS == 32
+                     ? OPC_ADDIU : OPC_DADDIU,
 } MIPSInsn;
 
 /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     int add_off = offsetof(CPUTLBEntry, addend);
     int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
                         : offsetof(CPUTLBEntry, addr_write);
-    target_ulong tlb_mask;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
     ldst->oi = oi;
     ldst->addrlo_reg = addrlo;
     ldst->addrhi_reg = addrhi;
-    base = TCG_REG_A0;
 
     /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
         tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
     } else {
-        tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
-                         : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
-                     TCG_TMP0, TCG_TMP3, cmp_off);
+        tcg_out_ld(s, TCG_TYPE_TL, TCG_TMP0, TCG_TMP3, cmp_off);
     }
 
-    /* Zero extend a 32-bit guest address for a 64-bit host. */
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, base, addrlo);
-        addrlo = base;
+    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
+        /* Load the tlb addend for the fast path.  */
+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP3, TCG_TMP3, add_off);
     }
 
     /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * For unaligned accesses, compare against the end of the access to
      * verify that it does not cross a page boundary.
      */
-    tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-    tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
-    if (a_mask >= s_mask) {
-        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrlo);
-    } else {
-        tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrlo, s_mask - a_mask);
+    tcg_out_movi(s, TCG_TYPE_TL, TCG_TMP1, TARGET_PAGE_MASK | a_mask);
+    if (a_mask < s_mask) {
+        tcg_out_opc_imm(s, ALIAS_TADDI, TCG_TMP2, addrlo, s_mask - a_mask);
         tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
+    } else {
+        tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrlo);
     }
 
-    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
-        /* Load the tlb addend for the fast path.  */
-        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
+    /* Zero extend a 32-bit guest address for a 64-bit host. */
+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
+        tcg_out_ext32u(s, TCG_TMP2, addrlo);
+        addrlo = TCG_TMP2;
     }
 
     ldst->label_ptr[0] = s->code_ptr;
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
         tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
 
         /* Load the tlb addend for the fast path.  */
-        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP3, TCG_TMP3, add_off);
 
         ldst->label_ptr[1] = s->code_ptr;
         tcg_out_opc_br(s, OPC_BNE, addrhi, TCG_TMP0);
     }
 
     /* delay slot */
-    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP2, addrlo);
+    base = TCG_TMP3;
+    tcg_out_opc_reg(s, ALIAS_PADD, base, TCG_TMP3, addrlo);
 #else
     if (a_mask && (use_mips32r6_instructions || a_bits != s_bits)) {
         ldst = new_ldst_label(s);
-- 
2.34.1

The softmmu tlb uses TCG_REG_TMP[0-3], not any of the normally available
registers.  Now that we handle overlap betwen inputs and helper arguments,
and have eliminated use of A0, we can allow any allocatable reg.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target-con-set.h | 13 +++++--------
 tcg/mips/tcg-target-con-str.h |  2 --
 tcg/mips/tcg-target.c.inc     | 30 ++++++++----------------------
 3 files changed, 13 insertions(+), 32 deletions(-)

diff --git a/tcg/mips/tcg-target-con-set.h b/tcg/mips/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target-con-set.h
+++ b/tcg/mips/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
 C_O0_I1(r)
 C_O0_I2(rZ, r)
 C_O0_I2(rZ, rZ)
-C_O0_I2(SZ, S)
-C_O0_I3(SZ, S, S)
-C_O0_I3(SZ, SZ, S)
+C_O0_I3(rZ, r, r)
+C_O0_I3(rZ, rZ, r)
 C_O0_I4(rZ, rZ, rZ, rZ)
-C_O0_I4(SZ, SZ, S, S)
-C_O1_I1(r, L)
+C_O0_I4(rZ, rZ, r, r)
 C_O1_I1(r, r)
 C_O1_I2(r, 0, rZ)
-C_O1_I2(r, L, L)
+C_O1_I2(r, r, r)
 C_O1_I2(r, r, ri)
 C_O1_I2(r, r, rI)
 C_O1_I2(r, r, rIK)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(r, rZ, rN)
 C_O1_I2(r, rZ, rZ)
 C_O1_I4(r, rZ, rZ, rZ, 0)
 C_O1_I4(r, rZ, rZ, rZ, rZ)
-C_O2_I1(r, r, L)
-C_O2_I2(r, r, L, L)
+C_O2_I1(r, r, r)
 C_O2_I2(r, r, r, r)
 C_O2_I4(r, r, rZ, rZ, rN, rN)
diff --git a/tcg/mips/tcg-target-con-str.h b/tcg/mips/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target-con-str.h
+++ b/tcg/mips/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('L', ALL_QLOAD_REGS)
-REGS('S', ALL_QSTORE_REGS)
 
 /*
  * Define constraint letters for constants:
diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
 #define TCG_CT_CONST_WSZ  0x2000   /* word size */
 
 #define ALL_GENERAL_REGS  0xffffffffu
-#define NOA0_REGS         (ALL_GENERAL_REGS & ~(1 << TCG_REG_A0))
-
-#ifdef CONFIG_SOFTMMU
-#define ALL_QLOAD_REGS \
-    (NOA0_REGS & ~((TCG_TARGET_REG_BITS < TARGET_LONG_BITS) << TCG_REG_A2))
-#define ALL_QSTORE_REGS \
-    (NOA0_REGS & ~(TCG_TARGET_REG_BITS < TARGET_LONG_BITS   \
-                   ? (1 << TCG_REG_A2) | (1 << TCG_REG_A3)  \
-                   : (1 << TCG_REG_A1)))
-#else
-#define ALL_QLOAD_REGS   NOA0_REGS
-#define ALL_QSTORE_REGS  NOA0_REGS
-#endif
-
 
 static bool is_p2m1(tcg_target_long val)
 {
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
 
     case INDEX_op_qemu_ld_i32:
         return (TCG_TARGET_REG_BITS == 64 || TARGET_LONG_BITS == 32
-                ? C_O1_I1(r, L) : C_O1_I2(r, L, L));
+                ? C_O1_I1(r, r) : C_O1_I2(r, r, r));
     case INDEX_op_qemu_st_i32:
         return (TCG_TARGET_REG_BITS == 64 || TARGET_LONG_BITS == 32
-                ? C_O0_I2(SZ, S) : C_O0_I3(SZ, S, S));
+                ? C_O0_I2(rZ, r) : C_O0_I3(rZ, r, r));
     case INDEX_op_qemu_ld_i64:
-        return (TCG_TARGET_REG_BITS == 64 ? C_O1_I1(r, L)
-                : TARGET_LONG_BITS == 32 ? C_O2_I1(r, r, L)
-                : C_O2_I2(r, r, L, L));
+        return (TCG_TARGET_REG_BITS == 64 ? C_O1_I1(r, r)
+                : TARGET_LONG_BITS == 32 ? C_O2_I1(r, r, r)
+                : C_O2_I2(r, r, r, r));
     case INDEX_op_qemu_st_i64:
-        return (TCG_TARGET_REG_BITS == 64 ? C_O0_I2(SZ, S)
-                : TARGET_LONG_BITS == 32 ? C_O0_I3(SZ, SZ, S)
-                : C_O0_I4(SZ, SZ, S, S));
+        return (TCG_TARGET_REG_BITS == 64 ? C_O0_I2(rZ, r)
+                : TARGET_LONG_BITS == 32 ? C_O0_I3(rZ, rZ, r)
+                : C_O0_I4(rZ, rZ, r, r));
 
     default:
         g_assert_not_reached();
-- 
2.34.1

Allocate TCG_REG_TMP2.  Use R0, TMP1, TMP2 instead of any of
the normally allocated registers for the tlb load.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.c.inc | 78 ++++++++++++++++++++++++----------------
 1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@
 #else
 # define TCG_REG_TMP1   TCG_REG_R12
 #endif
+#define TCG_REG_TMP2    TCG_REG_R11
 
 #define TCG_VEC_TMP1    TCG_REG_V0
 #define TCG_VEC_TMP2    TCG_REG_V1
@@ -XXX,XX +XXX,XX @@ static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 /*
  * For the purposes of ppc32 sorting 4 input registers into 4 argument
  * registers, there is an outside chance we would require 3 temps.
- * Because of constraints, no inputs are in r3, and env will not be
- * placed into r3 until after the sorting is done, and is thus free.
  */
 static const TCGLdstHelperParam ldst_helper_param = {
     .ra_gen = ldst_ra_gen,
     .ntmp = 3,
-    .tmp = { TCG_REG_TMP1, TCG_REG_R0, TCG_REG_R3 }
+    .tmp = { TCG_REG_TMP1, TCG_REG_TMP2, TCG_REG_R0 }
 };
 
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -32768);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R3, TCG_AREG0, mask_off);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R4, TCG_AREG0, table_off);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, mask_off);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_AREG0, table_off);
 
     /* Extract the page index, shifted into place for tlb index.  */
     if (TCG_TARGET_REG_BITS == 32) {
-        tcg_out_shri32(s, TCG_REG_TMP1, addrlo,
+        tcg_out_shri32(s, TCG_REG_R0, addrlo,
                        TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
     } else {
-        tcg_out_shri64(s, TCG_REG_TMP1, addrlo,
+        tcg_out_shri64(s, TCG_REG_R0, addrlo,
                        TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
     }
-    tcg_out32(s, AND | SAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_TMP1));
+    tcg_out32(s, AND | SAB(TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_R0));
 
-    /* Load the TLB comparator.  */
+    /* Load the (low part) TLB comparator into TMP2.  */
     if (cmp_off == 0 && TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
         uint32_t lxu = (TCG_TARGET_REG_BITS == 32 || TARGET_LONG_BITS == 32
                         ? LWZUX : LDUX);
-        tcg_out32(s, lxu | TAB(TCG_REG_TMP1, TCG_REG_R3, TCG_REG_R4));
+        tcg_out32(s, lxu | TAB(TCG_REG_TMP2, TCG_REG_TMP1, TCG_REG_TMP2));
     } else {
-        tcg_out32(s, ADD | TAB(TCG_REG_R3, TCG_REG_R3, TCG_REG_R4));
+        tcg_out32(s, ADD | TAB(TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP2));
         if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP1, TCG_REG_R3, cmp_off + 4);
-            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_R4, TCG_REG_R3, cmp_off);
+            tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP2,
+                       TCG_REG_TMP1, cmp_off + 4 * HOST_BIG_ENDIAN);
         } else {
-            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP1, TCG_REG_R3, cmp_off);
+            tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP2, TCG_REG_TMP1, cmp_off);
         }
     }
 
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * Load the TLB addend for use on the fast path.
      * Do this asap to minimize any load use delay.
      */
-    h->base = TCG_REG_R3;
-    tcg_out_ld(s, TCG_TYPE_PTR, h->base, TCG_REG_R3,
-               offsetof(CPUTLBEntry, addend));
+    if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
+                   offsetof(CPUTLBEntry, addend));
+    }
 
-    /* Clear the non-page, non-alignment bits from the address */
+    /* Clear the non-page, non-alignment bits from the address in R0. */
     if (TCG_TARGET_REG_BITS == 32) {
         /*
          * We don't support unaligned accesses on 32-bits.
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
         if (TARGET_LONG_BITS == 32) {
             tcg_out_rlw(s, RLWINM, TCG_REG_R0, t, 0,
                         (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
-            /* Zero-extend the address for use in the final address.  */
-            tcg_out_ext32u(s, TCG_REG_R4, addrlo);
-            addrlo = TCG_REG_R4;
         } else if (a_bits == 0) {
             tcg_out_rld(s, RLDICR, TCG_REG_R0, t, 0, 63 - TARGET_PAGE_BITS);
         } else {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
             tcg_out_rld(s, RLDICL, TCG_REG_R0, TCG_REG_R0, TARGET_PAGE_BITS, 0);
         }
     }
-    h->index = addrlo;
 
     if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
+        /* Low part comparison into cr7. */
+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP2,
                     0, 7, TCG_TYPE_I32);
-        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_R4, 0, 6, TCG_TYPE_I32);
+
+        /* Load the high part TLB comparator into TMP2.  */
+        tcg_out_ld(s, TCG_TYPE_I32, TCG_REG_TMP2, TCG_REG_TMP1,
+                   cmp_off + 4 * !HOST_BIG_ENDIAN);
+
+        /* Load addend, deferred for this case. */
+        tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
+                   offsetof(CPUTLBEntry, addend));
+
+        /* High part comparison into cr6. */
+        tcg_out_cmp(s, TCG_COND_EQ, addrhi, TCG_REG_TMP2, 0, 6, TCG_TYPE_I32);
+
+        /* Combine comparisons into cr7. */
         tcg_out32(s, CRAND | BT(7, CR_EQ) | BA(6, CR_EQ) | BB(7, CR_EQ));
     } else {
-        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP1,
+        /* Full comparison into cr7. */
+        tcg_out_cmp(s, TCG_COND_EQ, TCG_REG_R0, TCG_REG_TMP2,
                     0, 7, TCG_TYPE_TL);
     }
 
     /* Load a pointer into the current opcode w/conditional branch-link. */
     ldst->label_ptr[0] = s->code_ptr;
     tcg_out32(s, BC | BI(7, CR_EQ) | BO_COND_FALSE | LK);
+
+    h->base = TCG_REG_TMP1;
 #else
     if (a_bits) {
         ldst = new_ldst_label(s);
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     }
 
     h->base = guest_base ? TCG_GUEST_BASE_REG : 0;
-    h->index = addrlo;
-    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
-        tcg_out_ext32u(s, TCG_REG_TMP1, addrlo);
-        h->index = TCG_REG_TMP1;
-    }
 #endif
 
+    if (TCG_TARGET_REG_BITS > TARGET_LONG_BITS) {
+        /* Zero-extend the guest address for use in the host address. */
+        tcg_out_ext32u(s, TCG_REG_R0, addrlo);
+        h->index = TCG_REG_R0;
+    } else {
+        h->index = addrlo;
+    }
+
     return ldst;
 }
 
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 #if defined(_CALL_SYSV) || TCG_TARGET_REG_BITS == 64
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_R13); /* thread pointer */
 #endif
-    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1); /* mem temp */
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP1);
     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP2);
     if (USE_REG_TB) {
-- 
2.34.1

The softmmu tlb uses TCG_REG_{TMP1,TMP2,R0}, not any of the normally
available registers.  Now that we handle overlap betwen inputs and
helper arguments, we can allow any allocatable reg.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target-con-set.h | 11 ++++-------
 tcg/ppc/tcg-target-con-str.h |  2 --
 tcg/ppc/tcg-target.c.inc     | 32 ++++++++++----------------------
 3 files changed, 14 insertions(+), 31 deletions(-)

Never used since its introduction.

Fixes: 3d582c6179c ("tcg-ppc64: Rearrange integer constant constraints")
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target-con-str.h | 1 -
 tcg/ppc/tcg-target.c.inc     | 3 ---
 2 files changed, 4 deletions(-)

diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-str.h
+++ b/tcg/ppc/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@ REGS('v', ALL_VECTOR_REGS)
  * CONST(letter, TCG_CT_CONST_* bit set)
  */
 CONST('I', TCG_CT_CONST_S16)
-CONST('J', TCG_CT_CONST_U16)
 CONST('M', TCG_CT_CONST_MONE)
 CONST('T', TCG_CT_CONST_S32)
 CONST('U', TCG_CT_CONST_U32)
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@
 #define SZR  (TCG_TARGET_REG_BITS / 8)
 
 #define TCG_CT_CONST_S16  0x100
-#define TCG_CT_CONST_U16  0x200
 #define TCG_CT_CONST_S32  0x400
 #define TCG_CT_CONST_U32  0x800
 #define TCG_CT_CONST_ZERO 0x1000
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 
     if ((ct & TCG_CT_CONST_S16) && val == (int16_t)val) {
         return 1;
-    } else if ((ct & TCG_CT_CONST_U16) && val == (uint16_t)val) {
-        return 1;
     } else if ((ct & TCG_CT_CONST_S32) && val == (int32_t)val) {
         return 1;
     } else if ((ct & TCG_CT_CONST_U32) && val == (uint32_t)val) {
-- 
2.34.1

The softmmu tlb uses TCG_REG_TMP[0-2], not any of the normally available
registers.  Now that we handle overlap betwen inputs and helper arguments,
we can allow any allocatable reg.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/riscv/tcg-target-con-set.h |  2 --
 tcg/riscv/tcg-target-con-str.h |  1 -
 tcg/riscv/tcg-target.c.inc     | 16 +++-------------
 3 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/tcg/riscv/tcg-target-con-set.h b/tcg/riscv/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target-con-set.h
+++ b/tcg/riscv/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
  * tcg-target-con-str.h; the constraint combination is inclusive or.
  */
 C_O0_I1(r)
-C_O0_I2(LZ, L)
 C_O0_I2(rZ, r)
 C_O0_I2(rZ, rZ)
-C_O1_I1(r, L)
 C_O1_I1(r, r)
 C_O1_I2(r, r, ri)
 C_O1_I2(r, r, rI)
diff --git a/tcg/riscv/tcg-target-con-str.h b/tcg/riscv/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target-con-str.h
+++ b/tcg/riscv/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
 
 /*
  * Define constraint letters for constants:
diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
 #define TCG_CT_CONST_N12   0x400
 #define TCG_CT_CONST_M12   0x800
 
-#define ALL_GENERAL_REGS      MAKE_64BIT_MASK(0, 32)
-/*
- * For softmmu, we need to avoid conflicts with the first 5
- * argument registers to call the helper.  Some of these are
- * also used for the tlb lookup.
- */
-#ifdef CONFIG_SOFTMMU
-#define SOFTMMU_RESERVE_REGS  MAKE_64BIT_MASK(TCG_REG_A0, 5)
-#else
-#define SOFTMMU_RESERVE_REGS  0
-#endif
+#define ALL_GENERAL_REGS   MAKE_64BIT_MASK(0, 32)
 
 #define sextreg  sextract64
 
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
 
     case INDEX_op_qemu_ld_i32:
     case INDEX_op_qemu_ld_i64:
-        return C_O1_I1(r, L);
+        return C_O1_I1(r, r);
     case INDEX_op_qemu_st_i32:
     case INDEX_op_qemu_st_i64:
-        return C_O0_I2(LZ, L);
+        return C_O0_I2(rZ, r);
 
     default:
         g_assert_not_reached();
-- 
2.34.1

Rather than zero-extend the guest address into a register,
use an add instruction which zero-extends the second input.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RRE_ALGR    = 0xb90a,
     RRE_ALCR    = 0xb998,
     RRE_ALCGR   = 0xb988,
+    RRE_ALGFR   = 0xb91a,
     RRE_CGR     = 0xb920,
     RRE_CLGR    = 0xb921,
     RRE_DLGR    = 0xb987,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     tcg_out_insn(s, RXY, LG, h->index, TCG_REG_R2, TCG_REG_NONE,
                  offsetof(CPUTLBEntry, addend));
 
-    h->base = addr_reg;
     if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_R3, addr_reg);
-        h->base = TCG_REG_R3;
+        tcg_out_insn(s, RRE, ALGFR, h->index, addr_reg);
+        h->base = TCG_REG_NONE;
+    } else {
+        h->base = addr_reg;
     }
     h->disp = 0;
 #else
-- 
2.34.1

Adjust the softmmu tlb to use R0+R1, not any of the normally available
registers.  Since we handle overlap betwen inputs and helper arguments,
we can allow any allocatable reg.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |  2 --
 tcg/s390x/tcg-target-con-str.h |  1 -
 tcg/s390x/tcg-target.c.inc     | 36 ++++++++++++----------------------
 3 files changed, 12 insertions(+), 27 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
  * tcg-target-con-str.h; the constraint combination is inclusive or.
  */
 C_O0_I1(r)
-C_O0_I2(L, L)
 C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(r, rA)
 C_O0_I2(v, r)
-C_O1_I1(r, L)
 C_O1_I1(r, r)
 C_O1_I1(v, r)
 C_O1_I1(v, v)
diff --git a/tcg/s390x/tcg-target-con-str.h b/tcg/s390x/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target-con-str.h
+++ b/tcg/s390x/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('L', ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
 REGS('v', ALL_VECTOR_REGS)
 REGS('o', 0xaaaa) /* odd numbered general regs */
 
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@
 #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 16)
 #define ALL_VECTOR_REGS      MAKE_64BIT_MASK(32, 32)
 
-/*
- * For softmmu, we need to avoid conflicts with the first 3
- * argument registers to perform the tlb lookup, and to call
- * the helper function.
- */
-#ifdef CONFIG_SOFTMMU
-#define SOFTMMU_RESERVE_REGS MAKE_64BIT_MASK(TCG_REG_R2, 3)
-#else
-#define SOFTMMU_RESERVE_REGS 0
-#endif
-
-
 /* Several places within the instruction set 0 means "no register"
    rather than TCG_REG_R0.  */
 #define TCG_REG_NONE    0
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     ldst->oi = oi;
     ldst->addrlo_reg = addr_reg;
 
-    tcg_out_sh64(s, RSY_SRLG, TCG_REG_R2, addr_reg, TCG_REG_NONE,
+    tcg_out_sh64(s, RSY_SRLG, TCG_TMP0, addr_reg, TCG_REG_NONE,
                  TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
 
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
-    tcg_out_insn(s, RXY, NG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, mask_off);
-    tcg_out_insn(s, RXY, AG, TCG_REG_R2, TCG_AREG0, TCG_REG_NONE, table_off);
+    tcg_out_insn(s, RXY, NG, TCG_TMP0, TCG_AREG0, TCG_REG_NONE, mask_off);
+    tcg_out_insn(s, RXY, AG, TCG_TMP0, TCG_AREG0, TCG_REG_NONE, table_off);
 
     /*
      * For aligned accesses, we check the first byte and include the alignment
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
     tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
     if (a_off == 0) {
-        tgen_andi_risbg(s, TCG_REG_R3, addr_reg, tlb_mask);
+        tgen_andi_risbg(s, TCG_REG_R0, addr_reg, tlb_mask);
     } else {
-        tcg_out_insn(s, RX, LA, TCG_REG_R3, addr_reg, TCG_REG_NONE, a_off);
-        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R3, tlb_mask);
+        tcg_out_insn(s, RX, LA, TCG_REG_R0, addr_reg, TCG_REG_NONE, a_off);
+        tgen_andi(s, TCG_TYPE_TL, TCG_REG_R0, tlb_mask);
     }
 
     if (is_ld) {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
         ofs = offsetof(CPUTLBEntry, addr_write);
     }
     if (TARGET_LONG_BITS == 32) {
-        tcg_out_insn(s, RX, C, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
+        tcg_out_insn(s, RX, C, TCG_REG_R0, TCG_TMP0, TCG_REG_NONE, ofs);
     } else {
-        tcg_out_insn(s, RXY, CG, TCG_REG_R3, TCG_REG_R2, TCG_REG_NONE, ofs);
+        tcg_out_insn(s, RXY, CG, TCG_REG_R0, TCG_TMP0, TCG_REG_NONE, ofs);
     }
 
     tcg_out16(s, RI_BRC | (S390_CC_NE << 4));
     ldst->label_ptr[0] = s->code_ptr++;
 
-    h->index = TCG_REG_R2;
-    tcg_out_insn(s, RXY, LG, h->index, TCG_REG_R2, TCG_REG_NONE,
+    h->index = TCG_TMP0;
+    tcg_out_insn(s, RXY, LG, h->index, TCG_TMP0, TCG_REG_NONE,
                  offsetof(CPUTLBEntry, addend));
 
     if (TARGET_LONG_BITS == 32) {
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
 
     case INDEX_op_qemu_ld_i32:
     case INDEX_op_qemu_ld_i64:
-        return C_O1_I1(r, L);
+        return C_O1_I1(r, r);
     case INDEX_op_qemu_st_i64:
     case INDEX_op_qemu_st_i32:
-        return C_O0_I2(L, L);
+        return C_O0_I2(r, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

These are atomic operations, so mark as requiring alignment.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/mips/tcg/nanomips_translate.c.inc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/target/mips/tcg/nanomips_translate.c.inc b/target/mips/tcg/nanomips_translate.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/target/mips/tcg/nanomips_translate.c.inc
+++ b/target/mips/tcg/nanomips_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_llwp(DisasContext *ctx, uint32_t base, int16_t offset,
     TCGv tmp2 = tcg_temp_new();
 
     gen_base_offset_addr(ctx, taddr, base, offset);
-    tcg_gen_qemu_ld_i64(tval, taddr, ctx->mem_idx, MO_TEUQ);
+    tcg_gen_qemu_ld_i64(tval, taddr, ctx->mem_idx, MO_TEUQ | MO_ALIGN);
     if (cpu_is_bigendian(ctx)) {
         tcg_gen_extr_i64_tl(tmp2, tmp1, tval);
     } else {
@@ -XXX,XX +XXX,XX @@ static void gen_scwp(DisasContext *ctx, uint32_t base, int16_t offset,
 
     tcg_gen_ld_i64(llval, cpu_env, offsetof(CPUMIPSState, llval_wp));
     tcg_gen_atomic_cmpxchg_i64(val, taddr, llval, tval,
-                               eva ? MIPS_HFLAG_UM : ctx->mem_idx, MO_64);
+                               eva ? MIPS_HFLAG_UM : ctx->mem_idx,
+                               MO_64 | MO_ALIGN);
     if (reg1 != 0) {
         tcg_gen_movi_tl(cpu_gpr[reg1], 1);
     }
-- 
2.34.1

Memory operations that are not already aligned, or otherwise
marked up, require addition of ctx->default_tcg_memop_mask.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/mips/tcg/mxu_translate.c           |  3 ++-
 target/mips/tcg/micromips_translate.c.inc | 24 ++++++++++++++--------
 target/mips/tcg/mips16e_translate.c.inc   | 18 ++++++++++------
 target/mips/tcg/nanomips_translate.c.inc  | 25 +++++++++++------------
 4 files changed, 42 insertions(+), 28 deletions(-)

diff --git a/target/mips/tcg/mxu_translate.c b/target/mips/tcg/mxu_translate.c
index XXXXXXX..XXXXXXX 100644
--- a/target/mips/tcg/mxu_translate.c
+++ b/target/mips/tcg/mxu_translate.c
@@ -XXX,XX +XXX,XX @@ static void gen_mxu_s32ldd_s32lddr(DisasContext *ctx)
         tcg_gen_ori_tl(t1, t1, 0xFFFFF000);
     }
     tcg_gen_add_tl(t1, t0, t1);
-    tcg_gen_qemu_ld_tl(t1, t1, ctx->mem_idx, MO_TESL ^ (sel * MO_BSWAP));
+    tcg_gen_qemu_ld_tl(t1, t1, ctx->mem_idx, (MO_TESL ^ (sel * MO_BSWAP)) |
+                       ctx->default_tcg_memop_mask);
 
     gen_store_mxu_gpr(t1, XRa);
 }
diff --git a/target/mips/tcg/micromips_translate.c.inc b/target/mips/tcg/micromips_translate.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/target/mips/tcg/micromips_translate.c.inc
+++ b/target/mips/tcg/micromips_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_ldst_pair(DisasContext *ctx, uint32_t opc, int rd,
             gen_reserved_instruction(ctx);
             return;
         }
-        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL);
+        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL |
+                           ctx->default_tcg_memop_mask);
         gen_store_gpr(t1, rd);
         tcg_gen_movi_tl(t1, 4);
         gen_op_addr_add(ctx, t0, t0, t1);
-        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL);
+        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL |
+                           ctx->default_tcg_memop_mask);
         gen_store_gpr(t1, rd + 1);
         break;
     case SWP:
         gen_load_gpr(t1, rd);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
+                           ctx->default_tcg_memop_mask);
         tcg_gen_movi_tl(t1, 4);
         gen_op_addr_add(ctx, t0, t0, t1);
         gen_load_gpr(t1, rd + 1);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
+                           ctx->default_tcg_memop_mask);
         break;
 #ifdef TARGET_MIPS64
     case LDP:
@@ -XXX,XX +XXX,XX @@ static void gen_ldst_pair(DisasContext *ctx, uint32_t opc, int rd,
             gen_reserved_instruction(ctx);
             return;
         }
-        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
+        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
+                           ctx->default_tcg_memop_mask);
         gen_store_gpr(t1, rd);
         tcg_gen_movi_tl(t1, 8);
         gen_op_addr_add(ctx, t0, t0, t1);
-        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
+        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
+                           ctx->default_tcg_memop_mask);
         gen_store_gpr(t1, rd + 1);
         break;
     case SDP:
         gen_load_gpr(t1, rd);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
+                           ctx->default_tcg_memop_mask);
         tcg_gen_movi_tl(t1, 8);
         gen_op_addr_add(ctx, t0, t0, t1);
         gen_load_gpr(t1, rd + 1);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUQ |
+                           ctx->default_tcg_memop_mask);
         break;
 #endif
     }
diff --git a/target/mips/tcg/mips16e_translate.c.inc b/target/mips/tcg/mips16e_translate.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/target/mips/tcg/mips16e_translate.c.inc
+++ b/target/mips/tcg/mips16e_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_mips16_save(DisasContext *ctx,
     case 4:
         gen_base_offset_addr(ctx, t0, 29, 12);
         gen_load_gpr(t1, 7);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
+                           ctx->default_tcg_memop_mask);
         /* Fall through */
     case 3:
         gen_base_offset_addr(ctx, t0, 29, 8);
         gen_load_gpr(t1, 6);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
+                           ctx->default_tcg_memop_mask);
         /* Fall through */
     case 2:
         gen_base_offset_addr(ctx, t0, 29, 4);
         gen_load_gpr(t1, 5);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
+                           ctx->default_tcg_memop_mask);
         /* Fall through */
     case 1:
         gen_base_offset_addr(ctx, t0, 29, 0);
         gen_load_gpr(t1, 4);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |
+                           ctx->default_tcg_memop_mask);
     }
 
     gen_load_gpr(t0, 29);
@@ -XXX,XX +XXX,XX @@ static void gen_mips16_save(DisasContext *ctx,
         tcg_gen_movi_tl(t2, -4);                                 \
         gen_op_addr_add(ctx, t0, t0, t2);                        \
         gen_load_gpr(t1, reg);                                   \
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL); \
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL |       \
+                           ctx->default_tcg_memop_mask);         \
     } while (0)
 
     if (do_ra) {
@@ -XXX,XX +XXX,XX @@ static void gen_mips16_restore(DisasContext *ctx,
 #define DECR_AND_LOAD(reg) do {                            \
         tcg_gen_movi_tl(t2, -4);                           \
         gen_op_addr_add(ctx, t0, t0, t2);                  \
-        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL); \
+        tcg_gen_qemu_ld_tl(t1, t0, ctx->mem_idx, MO_TESL | \
+                           ctx->default_tcg_memop_mask);   \
         gen_store_gpr(t1, reg);                            \
     } while (0)
 
diff --git a/target/mips/tcg/nanomips_translate.c.inc b/target/mips/tcg/nanomips_translate.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/target/mips/tcg/nanomips_translate.c.inc
+++ b/target/mips/tcg/nanomips_translate.c.inc
@@ -XXX,XX +XXX,XX @@ static void gen_p_lsx(DisasContext *ctx, int rd, int rs, int rt)
 
     switch (extract32(ctx->opcode, 7, 4)) {
     case NM_LBX:
-        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
-                           MO_SB);
+        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx, MO_SB);
         gen_store_gpr(t0, rd);
         break;
     case NM_LHX:
     /*case NM_LHXS:*/
         tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
-                           MO_TESW);
+                           MO_TESW | ctx->default_tcg_memop_mask);
         gen_store_gpr(t0, rd);
         break;
     case NM_LWX:
     /*case NM_LWXS:*/
         tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
-                           MO_TESL);
+                           MO_TESL | ctx->default_tcg_memop_mask);
         gen_store_gpr(t0, rd);
         break;
     case NM_LBUX:
-        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
-                           MO_UB);
+        tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx, MO_UB);
         gen_store_gpr(t0, rd);
         break;
     case NM_LHUX:
     /*case NM_LHUXS:*/
         tcg_gen_qemu_ld_tl(t0, t0, ctx->mem_idx,
-                           MO_TEUW);
+                           MO_TEUW | ctx->default_tcg_memop_mask);
         gen_store_gpr(t0, rd);
         break;
     case NM_SBX:
         check_nms(ctx);
         gen_load_gpr(t1, rd);
-        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
-                           MO_8);
+        tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_8);
         break;
     case NM_SHX:
     /*case NM_SHXS:*/
         check_nms(ctx);
         gen_load_gpr(t1, rd);
         tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
-                           MO_TEUW);
+                           MO_TEUW | ctx->default_tcg_memop_mask);
         break;
     case NM_SWX:
     /*case NM_SWXS:*/
         check_nms(ctx);
         gen_load_gpr(t1, rd);
         tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
-                           MO_TEUL);
+                           MO_TEUL | ctx->default_tcg_memop_mask);
         break;
     case NM_LWC1X:
     /*case NM_LWC1XS:*/
@@ -XXX,XX +XXX,XX @@ static int decode_nanomips_32_48_opc(CPUMIPSState *env, DisasContext *ctx)
                                                 addr_off);
 
                     tcg_gen_movi_tl(t0, addr);
-                    tcg_gen_qemu_ld_tl(cpu_gpr[rt], t0, ctx->mem_idx, MO_TESL);
+                    tcg_gen_qemu_ld_tl(cpu_gpr[rt], t0, ctx->mem_idx,
+                                       MO_TESL | ctx->default_tcg_memop_mask);
                 }
                 break;
             case NM_SWPC48:
@@ -XXX,XX +XXX,XX @@ static int decode_nanomips_32_48_opc(CPUMIPSState *env, DisasContext *ctx)
                     tcg_gen_movi_tl(t0, addr);
                     gen_load_gpr(t1, rt);
 
-                    tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx, MO_TEUL);
+                    tcg_gen_qemu_st_tl(t1, t0, ctx->mem_idx,
+                                       MO_TEUL | ctx->default_tcg_memop_mask);
                 }
                 break;
             default:
-- 
2.34.1

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 configs/targets/mips-linux-user.mak      | 1 -
 configs/targets/mips-softmmu.mak         | 1 -
 configs/targets/mips64-linux-user.mak    | 1 -
 configs/targets/mips64-softmmu.mak       | 1 -
 configs/targets/mips64el-linux-user.mak  | 1 -
 configs/targets/mips64el-softmmu.mak     | 1 -
 configs/targets/mipsel-linux-user.mak    | 1 -
 configs/targets/mipsel-softmmu.mak       | 1 -
 configs/targets/mipsn32-linux-user.mak   | 1 -
 configs/targets/mipsn32el-linux-user.mak | 1 -
 10 files changed, 10 deletions(-)

diff --git a/configs/targets/mips-linux-user.mak b/configs/targets/mips-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mips-linux-user.mak
+++ b/configs/targets/mips-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ARCH=mips
 TARGET_ABI_MIPSO32=y
 TARGET_SYSTBL_ABI=o32
 TARGET_SYSTBL=syscall_o32.tbl
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
diff --git a/configs/targets/mips-softmmu.mak b/configs/targets/mips-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mips-softmmu.mak
+++ b/configs/targets/mips-softmmu.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=mips
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
 TARGET_SUPPORTS_MTTCG=y
diff --git a/configs/targets/mips64-linux-user.mak b/configs/targets/mips64-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mips64-linux-user.mak
+++ b/configs/targets/mips64-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI_MIPSN64=y
 TARGET_BASE_ARCH=mips
 TARGET_SYSTBL_ABI=n64
 TARGET_SYSTBL=syscall_n64.tbl
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
diff --git a/configs/targets/mips64-softmmu.mak b/configs/targets/mips64-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mips64-softmmu.mak
+++ b/configs/targets/mips64-softmmu.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=mips64
 TARGET_BASE_ARCH=mips
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
diff --git a/configs/targets/mips64el-linux-user.mak b/configs/targets/mips64el-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mips64el-linux-user.mak
+++ b/configs/targets/mips64el-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI_MIPSN64=y
 TARGET_BASE_ARCH=mips
 TARGET_SYSTBL_ABI=n64
 TARGET_SYSTBL=syscall_n64.tbl
-TARGET_ALIGNED_ONLY=y
diff --git a/configs/targets/mips64el-softmmu.mak b/configs/targets/mips64el-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mips64el-softmmu.mak
+++ b/configs/targets/mips64el-softmmu.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=mips64
 TARGET_BASE_ARCH=mips
-TARGET_ALIGNED_ONLY=y
 TARGET_NEED_FDT=y
diff --git a/configs/targets/mipsel-linux-user.mak b/configs/targets/mipsel-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mipsel-linux-user.mak
+++ b/configs/targets/mipsel-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ARCH=mips
 TARGET_ABI_MIPSO32=y
 TARGET_SYSTBL_ABI=o32
 TARGET_SYSTBL=syscall_o32.tbl
-TARGET_ALIGNED_ONLY=y
diff --git a/configs/targets/mipsel-softmmu.mak b/configs/targets/mipsel-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mipsel-softmmu.mak
+++ b/configs/targets/mipsel-softmmu.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=mips
-TARGET_ALIGNED_ONLY=y
 TARGET_SUPPORTS_MTTCG=y
diff --git a/configs/targets/mipsn32-linux-user.mak b/configs/targets/mipsn32-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mipsn32-linux-user.mak
+++ b/configs/targets/mipsn32-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI32=y
 TARGET_BASE_ARCH=mips
 TARGET_SYSTBL_ABI=n32
 TARGET_SYSTBL=syscall_n32.tbl
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
diff --git a/configs/targets/mipsn32el-linux-user.mak b/configs/targets/mipsn32el-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/mipsn32el-linux-user.mak
+++ b/configs/targets/mipsn32el-linux-user.mak
@@ -XXX,XX +XXX,XX @@ TARGET_ABI32=y
 TARGET_BASE_ARCH=mips
 TARGET_SYSTBL_ABI=n32
 TARGET_SYSTBL=syscall_n32.tbl
-TARGET_ALIGNED_ONLY=y
-- 
2.34.1

In gen_ldx/gen_stx, the only two locations for memory operations,
mark the operation as either aligned (softmmu) or unaligned
(user-only, as if emulated by the kernel).

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 configs/targets/nios2-softmmu.mak |  1 -
 target/nios2/translate.c          | 10 ++++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/configs/targets/nios2-softmmu.mak b/configs/targets/nios2-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/nios2-softmmu.mak
+++ b/configs/targets/nios2-softmmu.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=nios2
-TARGET_ALIGNED_ONLY=y
 TARGET_NEED_FDT=y
diff --git a/target/nios2/translate.c b/target/nios2/translate.c
index XXXXXXX..XXXXXXX 100644
--- a/target/nios2/translate.c
+++ b/target/nios2/translate.c
@@ -XXX,XX +XXX,XX @@ static void gen_ldx(DisasContext *dc, uint32_t code, uint32_t flags)
     TCGv data = dest_gpr(dc, instr.b);
 
     tcg_gen_addi_tl(addr, load_gpr(dc, instr.a), instr.imm16.s);
+#ifdef CONFIG_USER_ONLY
+    flags |= MO_UNALN;
+#else
+    flags |= MO_ALIGN;
+#endif
     tcg_gen_qemu_ld_tl(data, addr, dc->mem_idx, flags);
 }
 
@@ -XXX,XX +XXX,XX @@ static void gen_stx(DisasContext *dc, uint32_t code, uint32_t flags)
 
     TCGv addr = tcg_temp_new();
     tcg_gen_addi_tl(addr, load_gpr(dc, instr.a), instr.imm16.s);
+#ifdef CONFIG_USER_ONLY
+    flags |= MO_UNALN;
+#else
+    flags |= MO_ALIGN;
+#endif
     tcg_gen_qemu_st_tl(val, addr, dc->mem_idx, flags);
 }
 
-- 
2.34.1

Mark all memory operations that are not already marked with UNALIGN.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/sh4/translate.c | 102 ++++++++++++++++++++++++++---------------
 1 file changed, 66 insertions(+), 36 deletions(-)

diff --git a/target/sh4/translate.c b/target/sh4/translate.c
index XXXXXXX..XXXXXXX 100644
--- a/target/sh4/translate.c
+++ b/target/sh4/translate.c
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
     case 0x9000:		/* mov.w @(disp,PC),Rn */
 	{
             TCGv addr = tcg_constant_i32(ctx->base.pc_next + 4 + B7_0 * 2);
-            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx, MO_TESW);
+            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx,
+                                MO_TESW | MO_ALIGN);
 	}
 	return;
     case 0xd000:		/* mov.l @(disp,PC),Rn */
 	{
             TCGv addr = tcg_constant_i32((ctx->base.pc_next + 4 + B7_0 * 4) & ~3);
-            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(REG(B11_8), addr, ctx->memidx,
+                                MO_TESL | MO_ALIGN);
 	}
 	return;
     case 0x7000:		/* add #imm,Rn */
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	{
 	    TCGv arg0, arg1;
 	    arg0 = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
 	    arg1 = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
             gen_helper_macl(cpu_env, arg0, arg1);
 	    tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 4);
 	    tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	{
 	    TCGv arg0, arg1;
 	    arg0 = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(arg0, REG(B7_4), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
 	    arg1 = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(arg1, REG(B11_8), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
             gen_helper_macw(cpu_env, arg0, arg1);
 	    tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 2);
 	    tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 2);
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
         if (ctx->tbflags & FPSCR_SZ) {
             TCGv_i64 fp = tcg_temp_new_i64();
             gen_load_fpr64(ctx, fp, XHACK(B7_4));
-            tcg_gen_qemu_st_i64(fp, REG(B11_8), ctx->memidx, MO_TEUQ);
+            tcg_gen_qemu_st_i64(fp, REG(B11_8), ctx->memidx,
+                                MO_TEUQ | MO_ALIGN);
 	} else {
-            tcg_gen_qemu_st_i32(FREG(B7_4), REG(B11_8), ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_st_i32(FREG(B7_4), REG(B11_8), ctx->memidx,
+                                MO_TEUL | MO_ALIGN);
 	}
 	return;
     case 0xf008: /* fmov @Rm,{F,D,X}Rn - FPSCR: Nothing */
 	CHECK_FPU_ENABLED
         if (ctx->tbflags & FPSCR_SZ) {
             TCGv_i64 fp = tcg_temp_new_i64();
-            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx, MO_TEUQ);
+            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx,
+                                MO_TEUQ | MO_ALIGN);
             gen_store_fpr64(ctx, fp, XHACK(B11_8));
 	} else {
-            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx,
+                                MO_TEUL | MO_ALIGN);
 	}
 	return;
     case 0xf009: /* fmov @Rm+,{F,D,X}Rn - FPSCR: Nothing */
 	CHECK_FPU_ENABLED
         if (ctx->tbflags & FPSCR_SZ) {
             TCGv_i64 fp = tcg_temp_new_i64();
-            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx, MO_TEUQ);
+            tcg_gen_qemu_ld_i64(fp, REG(B7_4), ctx->memidx,
+                                MO_TEUQ | MO_ALIGN);
             gen_store_fpr64(ctx, fp, XHACK(B11_8));
             tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 8);
 	} else {
-            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_ld_i32(FREG(B11_8), REG(B7_4), ctx->memidx,
+                                MO_TEUL | MO_ALIGN);
 	    tcg_gen_addi_i32(REG(B7_4), REG(B7_4), 4);
 	}
 	return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
                 TCGv_i64 fp = tcg_temp_new_i64();
                 gen_load_fpr64(ctx, fp, XHACK(B7_4));
                 tcg_gen_subi_i32(addr, REG(B11_8), 8);
-                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx, MO_TEUQ);
+                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx,
+                                    MO_TEUQ | MO_ALIGN);
             } else {
                 tcg_gen_subi_i32(addr, REG(B11_8), 4);
-                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx, MO_TEUL);
+                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx,
+                                    MO_TEUL | MO_ALIGN);
             }
             tcg_gen_mov_i32(REG(B11_8), addr);
         }
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	    tcg_gen_add_i32(addr, REG(B7_4), REG(0));
             if (ctx->tbflags & FPSCR_SZ) {
                 TCGv_i64 fp = tcg_temp_new_i64();
-                tcg_gen_qemu_ld_i64(fp, addr, ctx->memidx, MO_TEUQ);
+                tcg_gen_qemu_ld_i64(fp, addr, ctx->memidx,
+                                    MO_TEUQ | MO_ALIGN);
                 gen_store_fpr64(ctx, fp, XHACK(B11_8));
 	    } else {
-                tcg_gen_qemu_ld_i32(FREG(B11_8), addr, ctx->memidx, MO_TEUL);
+                tcg_gen_qemu_ld_i32(FREG(B11_8), addr, ctx->memidx,
+                                    MO_TEUL | MO_ALIGN);
 	    }
 	}
 	return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
             if (ctx->tbflags & FPSCR_SZ) {
                 TCGv_i64 fp = tcg_temp_new_i64();
                 gen_load_fpr64(ctx, fp, XHACK(B7_4));
-                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx, MO_TEUQ);
+                tcg_gen_qemu_st_i64(fp, addr, ctx->memidx,
+                                    MO_TEUQ | MO_ALIGN);
 	    } else {
-                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx, MO_TEUL);
+                tcg_gen_qemu_st_i32(FREG(B7_4), addr, ctx->memidx,
+                                    MO_TEUL | MO_ALIGN);
 	    }
 	}
 	return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	{
 	    TCGv addr = tcg_temp_new();
 	    tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 2);
-            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESW);
+            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESW | MO_ALIGN);
 	}
 	return;
     case 0xc600:		/* mov.l @(disp,GBR),R0 */
 	{
 	    TCGv addr = tcg_temp_new();
 	    tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 4);
-            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(REG(0), addr, ctx->memidx, MO_TESL | MO_ALIGN);
 	}
 	return;
     case 0xc000:		/* mov.b R0,@(disp,GBR) */
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	{
 	    TCGv addr = tcg_temp_new();
 	    tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 2);
-            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUW);
+            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUW | MO_ALIGN);
 	}
 	return;
     case 0xc200:		/* mov.l R0,@(disp,GBR) */
 	{
 	    TCGv addr = tcg_temp_new();
 	    tcg_gen_addi_i32(addr, cpu_gbr, B7_0 * 4);
-            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_st_i32(REG(0), addr, ctx->memidx, MO_TEUL | MO_ALIGN);
 	}
 	return;
     case 0x8000:		/* mov.b R0,@(disp,Rn) */
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	return;
     case 0x4087:		/* ldc.l @Rm+,Rn_BANK */
 	CHECK_PRIVILEGED
-        tcg_gen_qemu_ld_i32(ALTREG(B6_4), REG(B11_8), ctx->memidx, MO_TESL);
+        tcg_gen_qemu_ld_i32(ALTREG(B6_4), REG(B11_8), ctx->memidx,
+                            MO_TESL | MO_ALIGN);
 	tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
 	return;
     case 0x0082:		/* stc Rm_BANK,Rn */
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	{
 	    TCGv addr = tcg_temp_new();
 	    tcg_gen_subi_i32(addr, REG(B11_8), 4);
-            tcg_gen_qemu_st_i32(ALTREG(B6_4), addr, ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_st_i32(ALTREG(B6_4), addr, ctx->memidx,
+                                MO_TEUL | MO_ALIGN);
 	    tcg_gen_mov_i32(REG(B11_8), addr);
 	}
 	return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	CHECK_PRIVILEGED
 	{
 	    TCGv val = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
             tcg_gen_andi_i32(val, val, 0x700083f3);
             gen_write_sr(val);
 	    tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
             TCGv val = tcg_temp_new();
 	    tcg_gen_subi_i32(addr, REG(B11_8), 4);
             gen_read_sr(val);
-            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL | MO_ALIGN);
 	    tcg_gen_mov_i32(REG(B11_8), addr);
 	}
 	return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
     return;							\
   case ldpnum:							\
     prechk    							\
-    tcg_gen_qemu_ld_i32(cpu_##reg, REG(B11_8), ctx->memidx, MO_TESL); \
+    tcg_gen_qemu_ld_i32(cpu_##reg, REG(B11_8), ctx->memidx,     \
+                        MO_TESL | MO_ALIGN);                    \
     tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);		\
     return;
 #define ST(reg,stnum,stpnum,prechk)		\
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
     {								\
 	TCGv addr = tcg_temp_new();				\
 	tcg_gen_subi_i32(addr, REG(B11_8), 4);			\
-        tcg_gen_qemu_st_i32(cpu_##reg, addr, ctx->memidx, MO_TEUL); \
+        tcg_gen_qemu_st_i32(cpu_##reg, addr, ctx->memidx,       \
+                            MO_TEUL | MO_ALIGN);                \
 	tcg_gen_mov_i32(REG(B11_8), addr);			\
     }								\
     return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	CHECK_FPU_ENABLED
 	{
 	    TCGv addr = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(addr, REG(B11_8), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(addr, REG(B11_8), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
 	    tcg_gen_addi_i32(REG(B11_8), REG(B11_8), 4);
             gen_helper_ld_fpscr(cpu_env, addr);
             ctx->base.is_jmp = DISAS_STOP;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
 	    tcg_gen_andi_i32(val, cpu_fpscr, 0x003fffff);
 	    addr = tcg_temp_new();
 	    tcg_gen_subi_i32(addr, REG(B11_8), 4);
-            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_st_i32(val, addr, ctx->memidx, MO_TEUL | MO_ALIGN);
 	    tcg_gen_mov_i32(REG(B11_8), addr);
 	}
 	return;
     case 0x00c3:		/* movca.l R0,@Rm */
         {
             TCGv val = tcg_temp_new();
-            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_ld_i32(val, REG(B11_8), ctx->memidx,
+                                MO_TEUL | MO_ALIGN);
             gen_helper_movcal(cpu_env, REG(B11_8), val);
-            tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx, MO_TEUL);
+            tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx,
+                                MO_TEUL | MO_ALIGN);
         }
         ctx->has_movcal = 1;
 	return;
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
                                    cpu_lock_addr, fail);
                 tmp = tcg_temp_new();
                 tcg_gen_atomic_cmpxchg_i32(tmp, REG(B11_8), cpu_lock_value,
-                                           REG(0), ctx->memidx, MO_TEUL);
+                                           REG(0), ctx->memidx,
+                                           MO_TEUL | MO_ALIGN);
                 tcg_gen_setcond_i32(TCG_COND_EQ, cpu_sr_t, tmp, cpu_lock_value);
             } else {
                 tcg_gen_brcondi_i32(TCG_COND_EQ, cpu_lock_addr, -1, fail);
-                tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx, MO_TEUL);
+                tcg_gen_qemu_st_i32(REG(0), REG(B11_8), ctx->memidx,
+                                    MO_TEUL | MO_ALIGN);
                 tcg_gen_movi_i32(cpu_sr_t, 1);
             }
             tcg_gen_br(done);
@@ -XXX,XX +XXX,XX @@ static void _decode_opc(DisasContext * ctx)
         if ((tb_cflags(ctx->base.tb) & CF_PARALLEL)) {
             TCGv tmp = tcg_temp_new();
             tcg_gen_mov_i32(tmp, REG(B11_8));
-            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
             tcg_gen_mov_i32(cpu_lock_value, REG(0));
             tcg_gen_mov_i32(cpu_lock_addr, tmp);
         } else {
-            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx, MO_TESL);
+            tcg_gen_qemu_ld_i32(REG(0), REG(B11_8), ctx->memidx,
+                                MO_TESL | MO_ALIGN);
             tcg_gen_movi_i32(cpu_lock_addr, 0);
         }
         return;
-- 
2.34.1

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 configs/targets/sh4-linux-user.mak   | 1 -
 configs/targets/sh4-softmmu.mak      | 1 -
 configs/targets/sh4eb-linux-user.mak | 1 -
 configs/targets/sh4eb-softmmu.mak    | 1 -
 4 files changed, 4 deletions(-)

diff --git a/configs/targets/sh4-linux-user.mak b/configs/targets/sh4-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/sh4-linux-user.mak
+++ b/configs/targets/sh4-linux-user.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=sh4
 TARGET_SYSTBL_ABI=common
 TARGET_SYSTBL=syscall.tbl
-TARGET_ALIGNED_ONLY=y
 TARGET_HAS_BFLT=y
diff --git a/configs/targets/sh4-softmmu.mak b/configs/targets/sh4-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/sh4-softmmu.mak
+++ b/configs/targets/sh4-softmmu.mak
@@ -1,2 +1 @@
 TARGET_ARCH=sh4
-TARGET_ALIGNED_ONLY=y
diff --git a/configs/targets/sh4eb-linux-user.mak b/configs/targets/sh4eb-linux-user.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/sh4eb-linux-user.mak
+++ b/configs/targets/sh4eb-linux-user.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=sh4
 TARGET_SYSTBL_ABI=common
 TARGET_SYSTBL=syscall.tbl
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
 TARGET_HAS_BFLT=y
diff --git a/configs/targets/sh4eb-softmmu.mak b/configs/targets/sh4eb-softmmu.mak
index XXXXXXX..XXXXXXX 100644
--- a/configs/targets/sh4eb-softmmu.mak
+++ b/configs/targets/sh4eb-softmmu.mak
@@ -XXX,XX +XXX,XX @@
 TARGET_ARCH=sh4
-TARGET_ALIGNED_ONLY=y
 TARGET_BIG_ENDIAN=y
-- 
2.34.1

All uses have now been expunged.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/exec/memop.h  | 13 ++-----------
 include/exec/poison.h |  1 -
 tcg/tcg.c             |  5 -----
 3 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/include/exec/memop.h b/include/exec/memop.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/memop.h
+++ b/include/exec/memop.h
@@ -XXX,XX +XXX,XX @@ typedef enum MemOp {
      * MO_UNALN accesses are never checked for alignment.
      * MO_ALIGN accesses will result in a call to the CPU's
      * do_unaligned_access hook if the guest address is not aligned.
-     * The default depends on whether the target CPU defines
-     * TARGET_ALIGNED_ONLY.
      *
      * Some architectures (e.g. ARMv8) need the address which is aligned
      * to a size more than the size of the memory access.
@@ -XXX,XX +XXX,XX @@ typedef enum MemOp {
      */
     MO_ASHIFT = 5,
     MO_AMASK = 0x7 << MO_ASHIFT,
-#ifdef NEED_CPU_H
-#ifdef TARGET_ALIGNED_ONLY
-    MO_ALIGN = 0,
-    MO_UNALN = MO_AMASK,
-#else
-    MO_ALIGN = MO_AMASK,
-    MO_UNALN = 0,
-#endif
-#endif
+    MO_UNALN    = 0,
     MO_ALIGN_2  = 1 << MO_ASHIFT,
     MO_ALIGN_4  = 2 << MO_ASHIFT,
     MO_ALIGN_8  = 3 << MO_ASHIFT,
     MO_ALIGN_16 = 4 << MO_ASHIFT,
     MO_ALIGN_32 = 5 << MO_ASHIFT,
     MO_ALIGN_64 = 6 << MO_ASHIFT,
+    MO_ALIGN    = MO_AMASK,
 
     /* Combinations of the above, for ease of use.  */
     MO_UB    = MO_8,
diff --git a/include/exec/poison.h b/include/exec/poison.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/poison.h
+++ b/include/exec/poison.h
@@ -XXX,XX +XXX,XX @@
 #pragma GCC poison TARGET_TRICORE
 #pragma GCC poison TARGET_XTENSA
 
-#pragma GCC poison TARGET_ALIGNED_ONLY
 #pragma GCC poison TARGET_HAS_BFLT
 #pragma GCC poison TARGET_NAME
 #pragma GCC poison TARGET_SUPPORTS_MTTCG
diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ static const char * const ldst_name[] =
 };
 
 static const char * const alignment_name[(MO_AMASK >> MO_ASHIFT) + 1] = {
-#ifdef TARGET_ALIGNED_ONLY
     [MO_UNALN >> MO_ASHIFT]    = "un+",
-    [MO_ALIGN >> MO_ASHIFT]    = "",
-#else
-    [MO_UNALN >> MO_ASHIFT]    = "",
     [MO_ALIGN >> MO_ASHIFT]    = "al+",
-#endif
     [MO_ALIGN_2 >> MO_ASHIFT]  = "al2+",
     [MO_ALIGN_4 >> MO_ASHIFT]  = "al4+",
     [MO_ALIGN_8 >> MO_ASHIFT]  = "al8+",
-- 
2.34.1

Like cpu_in_exclusive_context, but also true if
there is no other cpu against which we could race.

Use it in tb_flush as a direct replacement.
Use it in cpu_loop_exit_atomic to ensure that there
is no loop against cpu_exec_step_atomic.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/internal.h        | 9 +++++++++
 accel/tcg/cpu-exec-common.c | 3 +++
 accel/tcg/tb-maint.c        | 2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/internal.h
+++ b/accel/tcg/internal.h
@@ -XXX,XX +XXX,XX @@ static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
     }
 }
 
+/*
+ * Return true if CS is not running in parallel with other cpus, either
+ * because there are no other cpus or we are within an exclusive context.
+ */
+static inline bool cpu_in_serial_context(CPUState *cs)
+{
+    return !(cs->tcg_cflags & CF_PARALLEL) || cpu_in_exclusive_context(cs);
+}
+
 extern int64_t max_delay;
 extern int64_t max_advance;
 
diff --git a/accel/tcg/cpu-exec-common.c b/accel/tcg/cpu-exec-common.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/cpu-exec-common.c
+++ b/accel/tcg/cpu-exec-common.c
@@ -XXX,XX +XXX,XX @@
 #include "sysemu/tcg.h"
 #include "exec/exec-all.h"
 #include "qemu/plugin.h"
+#include "internal.h"
 
 bool tcg_allowed;
 
@@ -XXX,XX +XXX,XX @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
 
 void cpu_loop_exit_atomic(CPUState *cpu, uintptr_t pc)
 {
+    /* Prevent looping if already executing in a serial context. */
+    g_assert(!cpu_in_serial_context(cpu));
     cpu->exception_index = EXCP_ATOMIC;
     cpu_loop_exit_restore(cpu, pc);
 }
diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/tb-maint.c
+++ b/accel/tcg/tb-maint.c
@@ -XXX,XX +XXX,XX @@ void tb_flush(CPUState *cpu)
     if (tcg_enabled()) {
         unsigned tb_flush_count = qatomic_read(&tb_ctx.tb_flush_count);
 
-        if (cpu_in_exclusive_context(cpu)) {
+        if (cpu_in_serial_context(cpu)) {
             do_tb_flush(cpu, RUN_ON_CPU_HOST_INT(tb_flush_count));
         } else {
             async_safe_run_on_cpu(cpu, do_tb_flush,
-- 
2.34.1

Instead of playing with offsetof in various places, use
MMUAccessType to index an array.  This is easily defined
instead of the previous dummy padding array in the union.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/exec/cpu-defs.h |   7 ++-
 include/exec/cpu_ldst.h |  26 ++++++++--
 accel/tcg/cputlb.c      | 104 +++++++++++++---------------------------
 3 files changed, 59 insertions(+), 78 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -XXX,XX +XXX,XX @@ typedef struct CPUTLBEntry {
                use the corresponding iotlb value.  */
             uintptr_t addend;
         };
-        /* padding to get a power of two size */
-        uint8_t dummy[1 << CPU_TLB_ENTRY_BITS];
+        /*
+         * Padding to get a power of two size, as well as index
+         * access to addr_{read,write,code}.
+         */
+        target_ulong addr_idx[(1 << CPU_TLB_ENTRY_BITS) / TARGET_LONG_SIZE];
     };
 } CPUTLBEntry;
 
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index XXXXXXX..XXXXXXX 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -XXX,XX +XXX,XX @@ static inline void clear_helper_retaddr(void)
 /* Needed for TCG_OVERSIZED_GUEST */
 #include "tcg/tcg.h"
 
+static inline target_ulong tlb_read_idx(const CPUTLBEntry *entry,
+                                        MMUAccessType access_type)
+{
+    /* Do not rearrange the CPUTLBEntry structure members. */
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_read) !=
+                      MMU_DATA_LOAD * TARGET_LONG_SIZE);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_write) !=
+                      MMU_DATA_STORE * TARGET_LONG_SIZE);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_code) !=
+                      MMU_INST_FETCH * TARGET_LONG_SIZE);
+
+    const target_ulong *ptr = &entry->addr_idx[access_type];
+#if TCG_OVERSIZED_GUEST
+    return *ptr;
+#else
+    /* ofs might correspond to .addr_write, so use qatomic_read */
+    return qatomic_read(ptr);
+#endif
+}
+
 static inline target_ulong tlb_addr_write(const CPUTLBEntry *entry)
 {
-#if TCG_OVERSIZED_GUEST
-    return entry->addr_write;
-#else
-    return qatomic_read(&entry->addr_write);
-#endif
+    return tlb_read_idx(entry, MMU_DATA_STORE);
 }
 
 /* Find the TLB index corresponding to the mmu_idx + address pair.  */
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -XXX,XX +XXX,XX @@ static void io_writex(CPUArchState *env, CPUTLBEntryFull *full,
     }
 }
 
-static inline target_ulong tlb_read_ofs(CPUTLBEntry *entry, size_t ofs)
-{
-#if TCG_OVERSIZED_GUEST
-    return *(target_ulong *)((uintptr_t)entry + ofs);
-#else
-    /* ofs might correspond to .addr_write, so use qatomic_read */
-    return qatomic_read((target_ulong *)((uintptr_t)entry + ofs));
-#endif
-}
-
 /* Return true if ADDR is present in the victim tlb, and has been copied
    back to the main tlb.  */
 static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
-                           size_t elt_ofs, target_ulong page)
+                           MMUAccessType access_type, target_ulong page)
 {
     size_t vidx;
 
     assert_cpu_is_self(env_cpu(env));
     for (vidx = 0; vidx < CPU_VTLB_SIZE; ++vidx) {
         CPUTLBEntry *vtlb = &env_tlb(env)->d[mmu_idx].vtable[vidx];
-        target_ulong cmp;
-
-        /* elt_ofs might correspond to .addr_write, so use qatomic_read */
-#if TCG_OVERSIZED_GUEST
-        cmp = *(target_ulong *)((uintptr_t)vtlb + elt_ofs);
-#else
-        cmp = qatomic_read((target_ulong *)((uintptr_t)vtlb + elt_ofs));
-#endif
+        target_ulong cmp = tlb_read_idx(vtlb, access_type);
 
         if (cmp == page) {
             /* Found entry in victim tlb, swap tlb and iotlb.  */
@@ -XXX,XX +XXX,XX @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
     return false;
 }
 
-/* Macro to call the above, with local variables from the use context.  */
-#define VICTIM_TLB_HIT(TY, ADDR) \
-  victim_tlb_hit(env, mmu_idx, index, offsetof(CPUTLBEntry, TY), \
-                 (ADDR) & TARGET_PAGE_MASK)
-
 static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
                            CPUTLBEntryFull *full, uintptr_t retaddr)
 {
@@ -XXX,XX +XXX,XX @@ static int probe_access_internal(CPUArchState *env, target_ulong addr,
 {
     uintptr_t index = tlb_index(env, mmu_idx, addr);
     CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-    target_ulong tlb_addr, page_addr;
-    size_t elt_ofs;
-    int flags;
+    target_ulong tlb_addr = tlb_read_idx(entry, access_type);
+    target_ulong page_addr = addr & TARGET_PAGE_MASK;
+    int flags = TLB_FLAGS_MASK;
 
-    switch (access_type) {
-    case MMU_DATA_LOAD:
-        elt_ofs = offsetof(CPUTLBEntry, addr_read);
-        break;
-    case MMU_DATA_STORE:
-        elt_ofs = offsetof(CPUTLBEntry, addr_write);
-        break;
-    case MMU_INST_FETCH:
-        elt_ofs = offsetof(CPUTLBEntry, addr_code);
-        break;
-    default:
-        g_assert_not_reached();
-    }
-    tlb_addr = tlb_read_ofs(entry, elt_ofs);
-
-    flags = TLB_FLAGS_MASK;
-    page_addr = addr & TARGET_PAGE_MASK;
     if (!tlb_hit_page(tlb_addr, page_addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, elt_ofs, page_addr)) {
+        if (!victim_tlb_hit(env, mmu_idx, index, access_type, page_addr)) {
             CPUState *cs = env_cpu(env);
 
             if (!cs->cc->tcg_ops->tlb_fill(cs, addr, fault_size, access_type,
@@ -XXX,XX +XXX,XX @@ static int probe_access_internal(CPUArchState *env, target_ulong addr,
              */
             flags &= ~TLB_INVALID_MASK;
         }
-        tlb_addr = tlb_read_ofs(entry, elt_ofs);
+        tlb_addr = tlb_read_idx(entry, access_type);
     }
     flags &= tlb_addr;
 
@@ -XXX,XX +XXX,XX @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     if (prot & PAGE_WRITE) {
         tlb_addr = tlb_addr_write(tlbe);
         if (!tlb_hit(tlb_addr, addr)) {
-            if (!VICTIM_TLB_HIT(addr_write, addr)) {
+            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
+                                addr & TARGET_PAGE_MASK)) {
                 tlb_fill(env_cpu(env), addr, size,
                          MMU_DATA_STORE, mmu_idx, retaddr);
                 index = tlb_index(env, mmu_idx, addr);
@@ -XXX,XX +XXX,XX @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     } else /* if (prot & PAGE_READ) */ {
         tlb_addr = tlbe->addr_read;
         if (!tlb_hit(tlb_addr, addr)) {
-            if (!VICTIM_TLB_HIT(addr_read, addr)) {
+            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
+                                addr & TARGET_PAGE_MASK)) {
                 tlb_fill(env_cpu(env), addr, size,
                          MMU_DATA_LOAD, mmu_idx, retaddr);
                 index = tlb_index(env, mmu_idx, addr);
@@ -XXX,XX +XXX,XX @@ load_memop(const void *haddr, MemOp op)
 
 static inline uint64_t QEMU_ALWAYS_INLINE
 load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
-            uintptr_t retaddr, MemOp op, bool code_read,
+            uintptr_t retaddr, MemOp op, MMUAccessType access_type,
             FullLoadHelper *full_load)
 {
-    const size_t tlb_off = code_read ?
-        offsetof(CPUTLBEntry, addr_code) : offsetof(CPUTLBEntry, addr_read);
-    const MMUAccessType access_type =
-        code_read ? MMU_INST_FETCH : MMU_DATA_LOAD;
     const unsigned a_bits = get_alignment_bits(get_memop(oi));
     const size_t size = memop_size(op);
     uintptr_t mmu_idx = get_mmuidx(oi);
@@ -XXX,XX +XXX,XX @@ load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 
     index = tlb_index(env, mmu_idx, addr);
     entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = code_read ? entry->addr_code : entry->addr_read;
+    tlb_addr = tlb_read_idx(entry, access_type);
 
     /* If the TLB entry is for a different page, reload and try again.  */
     if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
+        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
                             addr & TARGET_PAGE_MASK)) {
             tlb_fill(env_cpu(env), addr, size,
                      access_type, mmu_idx, retaddr);
             index = tlb_index(env, mmu_idx, addr);
             entry = tlb_entry(env, mmu_idx, addr);
         }
-        tlb_addr = code_read ? entry->addr_code : entry->addr_read;
+        tlb_addr = tlb_read_idx(entry, access_type);
         tlb_addr &= ~TLB_INVALID_MASK;
     }
 
@@ -XXX,XX +XXX,XX @@ static uint64_t full_ldub_mmu(CPUArchState *env, target_ulong addr,
                               MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_UB);
-    return load_helper(env, addr, oi, retaddr, MO_UB, false, full_ldub_mmu);
+    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
+                       full_ldub_mmu);
 }
 
 tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
@@ -XXX,XX +XXX,XX @@ static uint64_t full_le_lduw_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUW);
-    return load_helper(env, addr, oi, retaddr, MO_LEUW, false,
+    return load_helper(env, addr, oi, retaddr, MO_LEUW, MMU_DATA_LOAD,
                        full_le_lduw_mmu);
 }
 
@@ -XXX,XX +XXX,XX @@ static uint64_t full_be_lduw_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUW);
-    return load_helper(env, addr, oi, retaddr, MO_BEUW, false,
+    return load_helper(env, addr, oi, retaddr, MO_BEUW, MMU_DATA_LOAD,
                        full_be_lduw_mmu);
 }
 
@@ -XXX,XX +XXX,XX @@ static uint64_t full_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUL);
-    return load_helper(env, addr, oi, retaddr, MO_LEUL, false,
+    return load_helper(env, addr, oi, retaddr, MO_LEUL, MMU_DATA_LOAD,
                        full_le_ldul_mmu);
 }
 
@@ -XXX,XX +XXX,XX @@ static uint64_t full_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUL);
-    return load_helper(env, addr, oi, retaddr, MO_BEUL, false,
+    return load_helper(env, addr, oi, retaddr, MO_BEUL, MMU_DATA_LOAD,
                        full_be_ldul_mmu);
 }
 
@@ -XXX,XX +XXX,XX @@ uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_LEUQ, false,
+    return load_helper(env, addr, oi, retaddr, MO_LEUQ, MMU_DATA_LOAD,
                        helper_le_ldq_mmu);
 }
 
@@ -XXX,XX +XXX,XX @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_BEUQ, false,
+    return load_helper(env, addr, oi, retaddr, MO_BEUQ, MMU_DATA_LOAD,
                        helper_be_ldq_mmu);
 }
 
@@ -XXX,XX +XXX,XX @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
                        uintptr_t retaddr, size_t size, uintptr_t mmu_idx,
                        bool big_endian)
 {
-    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
     uintptr_t index, index2;
     CPUTLBEntry *entry, *entry2;
     target_ulong page1, page2, tlb_addr, tlb_addr2;
@@ -XXX,XX +XXX,XX @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
 
     tlb_addr2 = tlb_addr_write(entry2);
     if (page1 != page2 && !tlb_hit_page(tlb_addr2, page2)) {
-        if (!victim_tlb_hit(env, mmu_idx, index2, tlb_off, page2)) {
+        if (!victim_tlb_hit(env, mmu_idx, index2, MMU_DATA_STORE, page2)) {
             tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
                      mmu_idx, retaddr);
             index2 = tlb_index(env, mmu_idx, page2);
@@ -XXX,XX +XXX,XX @@ static inline void QEMU_ALWAYS_INLINE
 store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
              MemOpIdx oi, uintptr_t retaddr, MemOp op)
 {
-    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
     const unsigned a_bits = get_alignment_bits(get_memop(oi));
     const size_t size = memop_size(op);
     uintptr_t mmu_idx = get_mmuidx(oi);
@@ -XXX,XX +XXX,XX @@ store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
 
     /* If the TLB entry is for a different page, reload and try again.  */
     if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
+        if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
             addr & TARGET_PAGE_MASK)) {
             tlb_fill(env_cpu(env), addr, size, MMU_DATA_STORE,
                      mmu_idx, retaddr);
@@ -XXX,XX +XXX,XX @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
 static uint64_t full_ldub_code(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_8, true, full_ldub_code);
+    return load_helper(env, addr, oi, retaddr, MO_8,
+                       MMU_INST_FETCH, full_ldub_code);
 }
 
 uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
@@ -XXX,XX +XXX,XX @@ uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
 static uint64_t full_lduw_code(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_TEUW, true, full_lduw_code);
+    return load_helper(env, addr, oi, retaddr, MO_TEUW,
+                       MMU_INST_FETCH, full_lduw_code);
 }
 
 uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
@@ -XXX,XX +XXX,XX @@ uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
 static uint64_t full_ldl_code(CPUArchState *env, target_ulong addr,
                               MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_TEUL, true, full_ldl_code);
+    return load_helper(env, addr, oi, retaddr, MO_TEUL,
+                       MMU_INST_FETCH, full_ldl_code);
 }
 
 uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
@@ -XXX,XX +XXX,XX @@ uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
 static uint64_t full_ldq_code(CPUArchState *env, target_ulong addr,
                               MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_TEUQ, true, full_ldq_code);
+    return load_helper(env, addr, oi, retaddr, MO_TEUQ,
+                       MMU_INST_FETCH, full_ldq_code);
 }
 
 uint64_t cpu_ldq_code(CPUArchState *env, abi_ptr addr)
-- 
2.34.1

Instead of trying to unify all operations on uint64_t, pull out
mmu_lookup() to perform the basic tlb hit and resolution.
Create individual functions to handle access by size.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c | 645 +++++++++++++++++++++++++++++----------------
 1 file changed, 424 insertions(+), 221 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -XXX,XX +XXX,XX @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, int mmu_idx,
 
 #endif
 
+/*
+ * Probe for a load/store operation.
+ * Return the host address and into @flags.
+ */
+
+typedef struct MMULookupPageData {
+    CPUTLBEntryFull *full;
+    void *haddr;
+    target_ulong addr;
+    int flags;
+    int size;
+} MMULookupPageData;
+
+typedef struct MMULookupLocals {
+    MMULookupPageData page[2];
+    MemOp memop;
+    int mmu_idx;
+} MMULookupLocals;
+
+/**
+ * mmu_lookup1: translate one page
+ * @env: cpu context
+ * @data: lookup parameters
+ * @mmu_idx: virtual address context
+ * @access_type: load/store/code
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Resolve the translation for the one page at @data.addr, filling in
+ * the rest of @data with the results.  If the translation fails,
+ * tlb_fill will longjmp out.  Return true if the softmmu tlb for
+ * @mmu_idx may have resized.
+ */
+static bool mmu_lookup1(CPUArchState *env, MMULookupPageData *data,
+                        int mmu_idx, MMUAccessType access_type, uintptr_t ra)
+{
+    target_ulong addr = data->addr;
+    uintptr_t index = tlb_index(env, mmu_idx, addr);
+    CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
+    target_ulong tlb_addr = tlb_read_idx(entry, access_type);
+    bool maybe_resized = false;
+
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if (!tlb_hit(tlb_addr, addr)) {
+        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
+                            addr & TARGET_PAGE_MASK)) {
+            tlb_fill(env_cpu(env), addr, data->size, access_type, mmu_idx, ra);
+            maybe_resized = true;
+            index = tlb_index(env, mmu_idx, addr);
+            entry = tlb_entry(env, mmu_idx, addr);
+        }
+        tlb_addr = tlb_read_idx(entry, access_type) & ~TLB_INVALID_MASK;
+    }
+
+    data->flags = tlb_addr & TLB_FLAGS_MASK;
+    data->full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
+    /* Compute haddr speculatively; depending on flags it might be invalid. */
+    data->haddr = (void *)((uintptr_t)addr + entry->addend);
+
+    return maybe_resized;
+}
+
+/**
+ * mmu_watch_or_dirty
+ * @env: cpu context
+ * @data: lookup parameters
+ * @access_type: load/store/code
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Trigger watchpoints for @data.addr:@data.size;
+ * record writes to protected clean pages.
+ */
+static void mmu_watch_or_dirty(CPUArchState *env, MMULookupPageData *data,
+                               MMUAccessType access_type, uintptr_t ra)
+{
+    CPUTLBEntryFull *full = data->full;
+    target_ulong addr = data->addr;
+    int flags = data->flags;
+    int size = data->size;
+
+    /* On watchpoint hit, this will longjmp out.  */
+    if (flags & TLB_WATCHPOINT) {
+        int wp = access_type == MMU_DATA_STORE ? BP_MEM_WRITE : BP_MEM_READ;
+        cpu_check_watchpoint(env_cpu(env), addr, size, full->attrs, wp, ra);
+        flags &= ~TLB_WATCHPOINT;
+    }
+
+    /* Note that notdirty is only set for writes. */
+    if (flags & TLB_NOTDIRTY) {
+        notdirty_write(env_cpu(env), addr, size, full, ra);
+        flags &= ~TLB_NOTDIRTY;
+    }
+    data->flags = flags;
+}
+
+/**
+ * mmu_lookup: translate page(s)
+ * @env: cpu context
+ * @addr: virtual address
+ * @oi: combined mmu_idx and MemOp
+ * @ra: return address into tcg generated code, or 0
+ * @access_type: load/store/code
+ * @l: output result
+ *
+ * Resolve the translation for the page(s) beginning at @addr, for MemOp.size
+ * bytes.  Return true if the lookup crosses a page boundary.
+ */
+static bool mmu_lookup(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                       uintptr_t ra, MMUAccessType type, MMULookupLocals *l)
+{
+    unsigned a_bits;
+    bool crosspage;
+    int flags;
+
+    l->memop = get_memop(oi);
+    l->mmu_idx = get_mmuidx(oi);
+
+    tcg_debug_assert(l->mmu_idx < NB_MMU_MODES);
+
+    /* Handle CPU specific unaligned behaviour */
+    a_bits = get_alignment_bits(l->memop);
+    if (addr & ((1 << a_bits) - 1)) {
+        cpu_unaligned_access(env_cpu(env), addr, type, l->mmu_idx, ra);
+    }
+
+    l->page[0].addr = addr;
+    l->page[0].size = memop_size(l->memop);
+    l->page[1].addr = (addr + l->page[0].size - 1) & TARGET_PAGE_MASK;
+    l->page[1].size = 0;
+    crosspage = (addr ^ l->page[1].addr) & TARGET_PAGE_MASK;
+
+    if (likely(!crosspage)) {
+        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
+
+        flags = l->page[0].flags;
+        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
+            mmu_watch_or_dirty(env, &l->page[0], type, ra);
+        }
+        if (unlikely(flags & TLB_BSWAP)) {
+            l->memop ^= MO_BSWAP;
+        }
+    } else {
+        /* Finish compute of page crossing. */
+        int size0 = l->page[1].addr - addr;
+        l->page[1].size = l->page[0].size - size0;
+        l->page[0].size = size0;
+
+        /*
+         * Lookup both pages, recognizing exceptions from either.  If the
+         * second lookup potentially resized, refresh first CPUTLBEntryFull.
+         */
+        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
+        if (mmu_lookup1(env, &l->page[1], l->mmu_idx, type, ra)) {
+            uintptr_t index = tlb_index(env, l->mmu_idx, addr);
+            l->page[0].full = &env_tlb(env)->d[l->mmu_idx].fulltlb[index];
+        }
+
+        flags = l->page[0].flags | l->page[1].flags;
+        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
+            mmu_watch_or_dirty(env, &l->page[0], type, ra);
+            mmu_watch_or_dirty(env, &l->page[1], type, ra);
+        }
+
+        /*
+         * Since target/sparc is the only user of TLB_BSWAP, and all
+         * Sparc accesses are aligned, any treatment across two pages
+         * would be arbitrary.  Refuse it until there's a use.
+         */
+        tcg_debug_assert((flags & TLB_BSWAP) == 0);
+    }
+
+    return crosspage;
+}
+
 /*
  * Probe for an atomic operation.  Do not allow unaligned operations,
  * or io operations to proceed.  Return the host address.
@@ -XXX,XX +XXX,XX @@ load_memop(const void *haddr, MemOp op)
     }
 }
 
-static inline uint64_t QEMU_ALWAYS_INLINE
-load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
-            uintptr_t retaddr, MemOp op, MMUAccessType access_type,
-            FullLoadHelper *full_load)
-{
-    const unsigned a_bits = get_alignment_bits(get_memop(oi));
-    const size_t size = memop_size(op);
-    uintptr_t mmu_idx = get_mmuidx(oi);
-    uintptr_t index;
-    CPUTLBEntry *entry;
-    target_ulong tlb_addr;
-    void *haddr;
-    uint64_t res;
-
-    tcg_debug_assert(mmu_idx < NB_MMU_MODES);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, access_type,
-                             mmu_idx, retaddr);
-    }
-
-    index = tlb_index(env, mmu_idx, addr);
-    entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = tlb_read_idx(entry, access_type);
-
-    /* If the TLB entry is for a different page, reload and try again.  */
-    if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
-                            addr & TARGET_PAGE_MASK)) {
-            tlb_fill(env_cpu(env), addr, size,
-                     access_type, mmu_idx, retaddr);
-            index = tlb_index(env, mmu_idx, addr);
-            entry = tlb_entry(env, mmu_idx, addr);
-        }
-        tlb_addr = tlb_read_idx(entry, access_type);
-        tlb_addr &= ~TLB_INVALID_MASK;
-    }
-
-    /* Handle anything that isn't just a straight memory access.  */
-    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUTLBEntryFull *full;
-        bool need_swap;
-
-        /* For anything that is unaligned, recurse through full_load.  */
-        if ((addr & (size - 1)) != 0) {
-            goto do_unaligned_access;
-        }
-
-        full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
-
-        /* Handle watchpoints.  */
-        if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
-            /* On watchpoint hit, this will longjmp out.  */
-            cpu_check_watchpoint(env_cpu(env), addr, size,
-                                 full->attrs, BP_MEM_READ, retaddr);
-        }
-
-        need_swap = size > 1 && (tlb_addr & TLB_BSWAP);
-
-        /* Handle I/O access.  */
-        if (likely(tlb_addr & TLB_MMIO)) {
-            return io_readx(env, full, mmu_idx, addr, retaddr,
-                            access_type, op ^ (need_swap * MO_BSWAP));
-        }
-
-        haddr = (void *)((uintptr_t)addr + entry->addend);
-
-        /*
-         * Keep these two load_memop separate to ensure that the compiler
-         * is able to fold the entire function to a single instruction.
-         * There is a build-time assert inside to remind you of this.  ;-)
-         */
-        if (unlikely(need_swap)) {
-            return load_memop(haddr, op ^ MO_BSWAP);
-        }
-        return load_memop(haddr, op);
-    }
-
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (size > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
-                    >= TARGET_PAGE_SIZE)) {
-        target_ulong addr1, addr2;
-        uint64_t r1, r2;
-        unsigned shift;
-    do_unaligned_access:
-        addr1 = addr & ~((target_ulong)size - 1);
-        addr2 = addr1 + size;
-        r1 = full_load(env, addr1, oi, retaddr);
-        r2 = full_load(env, addr2, oi, retaddr);
-        shift = (addr & (size - 1)) * 8;
-
-        if (memop_big_endian(op)) {
-            /* Big-endian combine.  */
-            res = (r1 << shift) | (r2 >> ((size * 8) - shift));
-        } else {
-            /* Little-endian combine.  */
-            res = (r1 >> shift) | (r2 << ((size * 8) - shift));
-        }
-        return res & MAKE_64BIT_MASK(0, size * 8);
-    }
-
-    haddr = (void *)((uintptr_t)addr + entry->addend);
-    return load_memop(haddr, op);
-}
-
 /*
  * For the benefit of TCG generated code, we want to avoid the
  * complication of ABI-specific return type promotion and always
@@ -XXX,XX +XXX,XX @@ load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
  * We don't bother with this widened value for SOFTMMU_CODE_ACCESS.
  */
 
-static uint64_t full_ldub_mmu(CPUArchState *env, target_ulong addr,
-                              MemOpIdx oi, uintptr_t retaddr)
+/**
+ * do_ld_mmio_beN:
+ * @env: cpu context
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ * @mmu_idx: virtual address context
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Load @p->size bytes from @p->addr, which is memory-mapped i/o.
+ * The bytes are concatenated in big-endian order with @ret_be.
+ */
+static uint64_t do_ld_mmio_beN(CPUArchState *env, MMULookupPageData *p,
+                               uint64_t ret_be, int mmu_idx,
+                               MMUAccessType type, uintptr_t ra)
 {
-    validate_memop(oi, MO_UB);
-    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
-                       full_ldub_mmu);
+    CPUTLBEntryFull *full = p->full;
+    target_ulong addr = p->addr;
+    int i, size = p->size;
+
+    QEMU_IOTHREAD_LOCK_GUARD();
+    for (i = 0; i < size; i++) {
+        uint8_t x = io_readx(env, full, mmu_idx, addr + i, ra, type, MO_UB);
+        ret_be = (ret_be << 8) | x;
+    }
+    return ret_be;
+}
+
+/**
+ * do_ld_bytes_beN
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ *
+ * Load @p->size bytes from @p->haddr, which is RAM.
+ * The bytes to concatenated in big-endian order with @ret_be.
+ */
+static uint64_t do_ld_bytes_beN(MMULookupPageData *p, uint64_t ret_be)
+{
+    uint8_t *haddr = p->haddr;
+    int i, size = p->size;
+
+    for (i = 0; i < size; i++) {
+        ret_be = (ret_be << 8) | haddr[i];
+    }
+    return ret_be;
+}
+
+/*
+ * Wrapper for the above.
+ */
+static uint64_t do_ld_beN(CPUArchState *env, MMULookupPageData *p,
+                          uint64_t ret_be, int mmu_idx,
+                          MMUAccessType type, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return do_ld_mmio_beN(env, p, ret_be, mmu_idx, type, ra);
+    } else {
+        return do_ld_bytes_beN(p, ret_be);
+    }
+}
+
+static uint8_t do_ld_1(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                       MMUAccessType type, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, MO_UB);
+    } else {
+        return *(uint8_t *)p->haddr;
+    }
+}
+
+static uint16_t do_ld_2(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                        MMUAccessType type, MemOp memop, uintptr_t ra)
+{
+    uint64_t ret;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
+    }
+
+    /* Perform the load host endian, then swap if necessary. */
+    ret = load_memop(p->haddr, MO_UW);
+    if (memop & MO_BSWAP) {
+        ret = bswap16(ret);
+    }
+    return ret;
+}
+
+static uint32_t do_ld_4(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                        MMUAccessType type, MemOp memop, uintptr_t ra)
+{
+    uint32_t ret;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
+    }
+
+    /* Perform the load host endian. */
+    ret = load_memop(p->haddr, MO_UL);
+    if (memop & MO_BSWAP) {
+        ret = bswap32(ret);
+    }
+    return ret;
+}
+
+static uint64_t do_ld_8(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                        MMUAccessType type, MemOp memop, uintptr_t ra)
+{
+    uint64_t ret;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
+    }
+
+    /* Perform the load host endian. */
+    ret = load_memop(p->haddr, MO_UQ);
+    if (memop & MO_BSWAP) {
+        ret = bswap64(ret);
+    }
+    return ret;
+}
+
+static uint8_t do_ld1_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                          uintptr_t ra, MMUAccessType access_type)
+{
+    MMULookupLocals l;
+    bool crosspage;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    tcg_debug_assert(!crosspage);
+
+    return do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
 }
 
 tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
                                      MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_ldub_mmu(env, addr, oi, retaddr);
+    validate_memop(oi, MO_UB);
+    return do_ld1_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
-static uint64_t full_le_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
+static uint16_t do_ld2_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                           uintptr_t ra, MMUAccessType access_type)
 {
-    validate_memop(oi, MO_LEUW);
-    return load_helper(env, addr, oi, retaddr, MO_LEUW, MMU_DATA_LOAD,
-                       full_le_lduw_mmu);
+    MMULookupLocals l;
+    bool crosspage;
+    uint16_t ret;
+    uint8_t a, b;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    if (likely(!crosspage)) {
+        return do_ld_2(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
+    }
+
+    a = do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
+    b = do_ld_1(env, &l.page[1], l.mmu_idx, access_type, ra);
+
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = a | (b << 8);
+    } else {
+        ret = b | (a << 8);
+    }
+    return ret;
 }
 
 tcg_target_ulong helper_le_lduw_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_le_lduw_mmu(env, addr, oi, retaddr);
-}
-
-static uint64_t full_be_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUW);
-    return load_helper(env, addr, oi, retaddr, MO_BEUW, MMU_DATA_LOAD,
-                       full_be_lduw_mmu);
+    validate_memop(oi, MO_LEUW);
+    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 tcg_target_ulong helper_be_lduw_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_be_lduw_mmu(env, addr, oi, retaddr);
+    validate_memop(oi, MO_BEUW);
+    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
-static uint64_t full_le_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
+static uint32_t do_ld4_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                           uintptr_t ra, MMUAccessType access_type)
 {
-    validate_memop(oi, MO_LEUL);
-    return load_helper(env, addr, oi, retaddr, MO_LEUL, MMU_DATA_LOAD,
-                       full_le_ldul_mmu);
+    MMULookupLocals l;
+    bool crosspage;
+    uint32_t ret;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    if (likely(!crosspage)) {
+        return do_ld_4(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
+    }
+
+    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
+    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = bswap32(ret);
+    }
+    return ret;
 }
 
 tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_le_ldul_mmu(env, addr, oi, retaddr);
-}
-
-static uint64_t full_be_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUL);
-    return load_helper(env, addr, oi, retaddr, MO_BEUL, MMU_DATA_LOAD,
-                       full_be_ldul_mmu);
+    validate_memop(oi, MO_LEUL);
+    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_be_ldul_mmu(env, addr, oi, retaddr);
+    validate_memop(oi, MO_BEUL);
+    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
+}
+
+static uint64_t do_ld8_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                           uintptr_t ra, MMUAccessType access_type)
+{
+    MMULookupLocals l;
+    bool crosspage;
+    uint64_t ret;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    if (likely(!crosspage)) {
+        return do_ld_8(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
+    }
+
+    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
+    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = bswap64(ret);
+    }
+    return ret;
 }
 
 uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_LEUQ, MMU_DATA_LOAD,
-                       helper_le_ldq_mmu);
+    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_BEUQ, MMU_DATA_LOAD,
-                       helper_be_ldq_mmu);
+    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ tcg_target_ulong helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr,
  * Load helpers for cpu_ldst.h.
  */
 
-static inline uint64_t cpu_load_helper(CPUArchState *env, abi_ptr addr,
-                                       MemOpIdx oi, uintptr_t retaddr,
-                                       FullLoadHelper *full_load)
+static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    uint64_t ret;
-
-    ret = full_load(env, addr, oi, retaddr);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
 }
 
 uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_ldub_mmu);
+    uint8_t ret;
+
+    validate_memop(oi, MO_UB);
+    ret = do_ld1_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint16_t cpu_ldw_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_be_lduw_mmu);
+    uint16_t ret;
+
+    validate_memop(oi, MO_BEUW);
+    ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_be_ldul_mmu);
+    uint32_t ret;
+
+    validate_memop(oi, MO_BEUL);
+    ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, helper_be_ldq_mmu);
+    uint64_t ret;
+
+    validate_memop(oi, MO_BEUQ);
+    ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_le_lduw_mmu);
+    uint16_t ret;
+
+    validate_memop(oi, MO_LEUW);
+    ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_le_ldul_mmu);
+    uint32_t ret;
+
+    validate_memop(oi, MO_LEUL);
+    ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, helper_le_ldq_mmu);
+    uint64_t ret;
+
+    validate_memop(oi, MO_LEUQ);
+    ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
@@ -XXX,XX +XXX,XX @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
 
 /* Code access functions.  */
 
-static uint64_t full_ldub_code(CPUArchState *env, target_ulong addr,
-                               MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_8,
-                       MMU_INST_FETCH, full_ldub_code);
-}
-
 uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_UB, cpu_mmu_index(env, true));
-    return full_ldub_code(env, addr, oi, 0);
-}
-
-static uint64_t full_lduw_code(CPUArchState *env, target_ulong addr,
-                               MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_TEUW,
-                       MMU_INST_FETCH, full_lduw_code);
+    return do_ld1_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_TEUW, cpu_mmu_index(env, true));
-    return full_lduw_code(env, addr, oi, 0);
-}
-
-static uint64_t full_ldl_code(CPUArchState *env, target_ulong addr,
-                              MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_TEUL,
-                       MMU_INST_FETCH, full_ldl_code);
+    return do_ld2_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_TEUL, cpu_mmu_index(env, true));
-    return full_ldl_code(env, addr, oi, 0);
-}
-
-static uint64_t full_ldq_code(CPUArchState *env, target_ulong addr,
-                              MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_TEUQ,
-                       MMU_INST_FETCH, full_ldq_code);
+    return do_ld4_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint64_t cpu_ldq_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_TEUQ, cpu_mmu_index(env, true));
-    return full_ldq_code(env, addr, oi, 0);
+    return do_ld8_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint8_t cpu_ldb_code_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_ldub_code(env, addr, oi, retaddr);
+    return do_ld1_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
 
 uint16_t cpu_ldw_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int idx = get_mmuidx(oi);
-    uint16_t ret;
-
-    ret = full_lduw_code(env, addr, make_memop_idx(MO_TEUW, idx), retaddr);
-    if ((mop & MO_BSWAP) != MO_TE) {
-        ret = bswap16(ret);
-    }
-    return ret;
+    return do_ld2_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
 
 uint32_t cpu_ldl_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int idx = get_mmuidx(oi);
-    uint32_t ret;
-
-    ret = full_ldl_code(env, addr, make_memop_idx(MO_TEUL, idx), retaddr);
-    if ((mop & MO_BSWAP) != MO_TE) {
-        ret = bswap32(ret);
-    }
-    return ret;
+    return do_ld4_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
 
 uint64_t cpu_ldq_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int idx = get_mmuidx(oi);
-    uint64_t ret;
-
-    ret = full_ldq_code(env, addr, make_memop_idx(MO_TEUQ, idx), retaddr);
-    if ((mop & MO_BSWAP) != MO_TE) {
-        ret = bswap64(ret);
-    }
-    return ret;
+    return do_ld8_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
-- 
2.34.1

Instead of trying to unify all operations on uint64_t, use
mmu_lookup() to perform the basic tlb hit and resolution.
Create individual functions to handle access by size.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c | 408 +++++++++++++++++++++------------------------
 1 file changed, 193 insertions(+), 215 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -XXX,XX +XXX,XX @@ store_memop(void *haddr, uint64_t val, MemOp op)
     }
 }
 
-static void full_stb_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                         MemOpIdx oi, uintptr_t retaddr);
-
-static void __attribute__((noinline))
-store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
-                       uintptr_t retaddr, size_t size, uintptr_t mmu_idx,
-                       bool big_endian)
+/**
+ * do_st_mmio_leN:
+ * @env: cpu context
+ * @p: translation parameters
+ * @val_le: data to store
+ * @mmu_idx: virtual address context
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Store @p->size bytes at @p->addr, which is memory-mapped i/o.
+ * The bytes to store are extracted in little-endian order from @val_le;
+ * return the bytes of @val_le beyond @p->size that have not been stored.
+ */
+static uint64_t do_st_mmio_leN(CPUArchState *env, MMULookupPageData *p,
+                               uint64_t val_le, int mmu_idx, uintptr_t ra)
 {
-    uintptr_t index, index2;
-    CPUTLBEntry *entry, *entry2;
-    target_ulong page1, page2, tlb_addr, tlb_addr2;
-    MemOpIdx oi;
-    size_t size2;
-    int i;
+    CPUTLBEntryFull *full = p->full;
+    target_ulong addr = p->addr;
+    int i, size = p->size;
 
-    /*
-     * Ensure the second page is in the TLB.  Note that the first page
-     * is already guaranteed to be filled, and that the second page
-     * cannot evict the first.  An exception to this rule is PAGE_WRITE_INV
-     * handling: the first page could have evicted itself.
-     */
-    page1 = addr & TARGET_PAGE_MASK;
-    page2 = (addr + size) & TARGET_PAGE_MASK;
-    size2 = (addr + size) & ~TARGET_PAGE_MASK;
-    index2 = tlb_index(env, mmu_idx, page2);
-    entry2 = tlb_entry(env, mmu_idx, page2);
-
-    tlb_addr2 = tlb_addr_write(entry2);
-    if (page1 != page2 && !tlb_hit_page(tlb_addr2, page2)) {
-        if (!victim_tlb_hit(env, mmu_idx, index2, MMU_DATA_STORE, page2)) {
-            tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
-                     mmu_idx, retaddr);
-            index2 = tlb_index(env, mmu_idx, page2);
-            entry2 = tlb_entry(env, mmu_idx, page2);
-        }
-        tlb_addr2 = tlb_addr_write(entry2);
+    QEMU_IOTHREAD_LOCK_GUARD();
+    for (i = 0; i < size; i++, val_le >>= 8) {
+        io_writex(env, full, mmu_idx, val_le, addr + i, ra, MO_UB);
     }
+    return val_le;
+}
 
-    index = tlb_index(env, mmu_idx, addr);
-    entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = tlb_addr_write(entry);
+/**
+ * do_st_bytes_leN:
+ * @p: translation parameters
+ * @val_le: data to store
+ *
+ * Store @p->size bytes at @p->haddr, which is RAM.
+ * The bytes to store are extracted in little-endian order from @val_le;
+ * return the bytes of @val_le beyond @p->size that have not been stored.
+ */
+static uint64_t do_st_bytes_leN(MMULookupPageData *p, uint64_t val_le)
+{
+    uint8_t *haddr = p->haddr;
+    int i, size = p->size;
 
-    /*
-     * Handle watchpoints.  Since this may trap, all checks
-     * must happen before any store.
-     */
-    if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
-        cpu_check_watchpoint(env_cpu(env), addr, size - size2,
-                             env_tlb(env)->d[mmu_idx].fulltlb[index].attrs,
-                             BP_MEM_WRITE, retaddr);
-    }
-    if (unlikely(tlb_addr2 & TLB_WATCHPOINT)) {
-        cpu_check_watchpoint(env_cpu(env), page2, size2,
-                             env_tlb(env)->d[mmu_idx].fulltlb[index2].attrs,
-                             BP_MEM_WRITE, retaddr);
+    for (i = 0; i < size; i++, val_le >>= 8) {
+        haddr[i] = val_le;
     }
+    return val_le;
+}
 
-    /*
-     * XXX: not efficient, but simple.
-     * This loop must go in the forward direction to avoid issues
-     * with self-modifying code in Windows 64-bit.
-     */
-    oi = make_memop_idx(MO_UB, mmu_idx);
-    if (big_endian) {
-        for (i = 0; i < size; ++i) {
-            /* Big-endian extract.  */
-            uint8_t val8 = val >> (((size - 1) * 8) - (i * 8));
-            full_stb_mmu(env, addr + i, val8, oi, retaddr);
-        }
+/*
+ * Wrapper for the above.
+ */
+static uint64_t do_st_leN(CPUArchState *env, MMULookupPageData *p,
+                          uint64_t val_le, int mmu_idx, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return do_st_mmio_leN(env, p, val_le, mmu_idx, ra);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        return val_le >> (p->size * 8);
     } else {
-        for (i = 0; i < size; ++i) {
-            /* Little-endian extract.  */
-            uint8_t val8 = val >> (i * 8);
-            full_stb_mmu(env, addr + i, val8, oi, retaddr);
-        }
+        return do_st_bytes_leN(p, val_le);
     }
 }
 
-static inline void QEMU_ALWAYS_INLINE
-store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
-             MemOpIdx oi, uintptr_t retaddr, MemOp op)
+static void do_st_1(CPUArchState *env, MMULookupPageData *p, uint8_t val,
+                    int mmu_idx, uintptr_t ra)
 {
-    const unsigned a_bits = get_alignment_bits(get_memop(oi));
-    const size_t size = memop_size(op);
-    uintptr_t mmu_idx = get_mmuidx(oi);
-    uintptr_t index;
-    CPUTLBEntry *entry;
-    target_ulong tlb_addr;
-    void *haddr;
-
-    tcg_debug_assert(mmu_idx < NB_MMU_MODES);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
-                             mmu_idx, retaddr);
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, MO_UB);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        *(uint8_t *)p->haddr = val;
     }
-
-    index = tlb_index(env, mmu_idx, addr);
-    entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = tlb_addr_write(entry);
-
-    /* If the TLB entry is for a different page, reload and try again.  */
-    if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
-            addr & TARGET_PAGE_MASK)) {
-            tlb_fill(env_cpu(env), addr, size, MMU_DATA_STORE,
-                     mmu_idx, retaddr);
-            index = tlb_index(env, mmu_idx, addr);
-            entry = tlb_entry(env, mmu_idx, addr);
-        }
-        tlb_addr = tlb_addr_write(entry) & ~TLB_INVALID_MASK;
-    }
-
-    /* Handle anything that isn't just a straight memory access.  */
-    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUTLBEntryFull *full;
-        bool need_swap;
-
-        /* For anything that is unaligned, recurse through byte stores.  */
-        if ((addr & (size - 1)) != 0) {
-            goto do_unaligned_access;
-        }
-
-        full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
-
-        /* Handle watchpoints.  */
-        if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
-            /* On watchpoint hit, this will longjmp out.  */
-            cpu_check_watchpoint(env_cpu(env), addr, size,
-                                 full->attrs, BP_MEM_WRITE, retaddr);
-        }
-
-        need_swap = size > 1 && (tlb_addr & TLB_BSWAP);
-
-        /* Handle I/O access.  */
-        if (tlb_addr & TLB_MMIO) {
-            io_writex(env, full, mmu_idx, val, addr, retaddr,
-                      op ^ (need_swap * MO_BSWAP));
-            return;
-        }
-
-        /* Ignore writes to ROM.  */
-        if (unlikely(tlb_addr & TLB_DISCARD_WRITE)) {
-            return;
-        }
-
-        /* Handle clean RAM pages.  */
-        if (tlb_addr & TLB_NOTDIRTY) {
-            notdirty_write(env_cpu(env), addr, size, full, retaddr);
-        }
-
-        haddr = (void *)((uintptr_t)addr + entry->addend);
-
-        /*
-         * Keep these two store_memop separate to ensure that the compiler
-         * is able to fold the entire function to a single instruction.
-         * There is a build-time assert inside to remind you of this.  ;-)
-         */
-        if (unlikely(need_swap)) {
-            store_memop(haddr, val, op ^ MO_BSWAP);
-        } else {
-            store_memop(haddr, val, op);
-        }
-        return;
-    }
-
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (size > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
-                     >= TARGET_PAGE_SIZE)) {
-    do_unaligned_access:
-        store_helper_unaligned(env, addr, val, retaddr, size,
-                               mmu_idx, memop_big_endian(op));
-        return;
-    }
-
-    haddr = (void *)((uintptr_t)addr + entry->addend);
-    store_memop(haddr, val, op);
 }
 
-static void __attribute__((noinline))
-full_stb_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-             MemOpIdx oi, uintptr_t retaddr)
+static void do_st_2(CPUArchState *env, MMULookupPageData *p, uint16_t val,
+                    int mmu_idx, MemOp memop, uintptr_t ra)
 {
-    validate_memop(oi, MO_UB);
-    store_helper(env, addr, val, oi, retaddr, MO_UB);
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        /* Swap to host endian if necessary, then store. */
+        if (memop & MO_BSWAP) {
+            val = bswap16(val);
+        }
+        store_memop(p->haddr, val, MO_UW);
+    }
+}
+
+static void do_st_4(CPUArchState *env, MMULookupPageData *p, uint32_t val,
+                    int mmu_idx, MemOp memop, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        /* Swap to host endian if necessary, then store. */
+        if (memop & MO_BSWAP) {
+            val = bswap32(val);
+        }
+        store_memop(p->haddr, val, MO_UL);
+    }
+}
+
+static void do_st_8(CPUArchState *env, MMULookupPageData *p, uint64_t val,
+                    int mmu_idx, MemOp memop, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        /* Swap to host endian if necessary, then store. */
+        if (memop & MO_BSWAP) {
+            val = bswap64(val);
+        }
+        store_memop(p->haddr, val, MO_UQ);
+    }
 }
 
 void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                        MemOpIdx oi, uintptr_t retaddr)
+                        MemOpIdx oi, uintptr_t ra)
 {
-    full_stb_mmu(env, addr, val, oi, retaddr);
+    MMULookupLocals l;
+    bool crosspage;
+
+    validate_memop(oi, MO_UB);
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    tcg_debug_assert(!crosspage);
+
+    do_st_1(env, &l.page[0], val, l.mmu_idx, ra);
 }
 
-static void full_le_stw_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
+static void do_st2_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                       MemOpIdx oi, uintptr_t ra)
 {
-    validate_memop(oi, MO_LEUW);
-    store_helper(env, addr, val, oi, retaddr, MO_LEUW);
+    MMULookupLocals l;
+    bool crosspage;
+    uint8_t a, b;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        do_st_2(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        return;
+    }
+
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        a = val, b = val >> 8;
+    } else {
+        b = val, a = val >> 8;
+    }
+    do_st_1(env, &l.page[0], a, l.mmu_idx, ra);
+    do_st_1(env, &l.page[1], b, l.mmu_idx, ra);
 }
 
 void helper_le_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_le_stw_mmu(env, addr, val, oi, retaddr);
-}
-
-static void full_be_stw_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUW);
-    store_helper(env, addr, val, oi, retaddr, MO_BEUW);
+    validate_memop(oi, MO_LEUW);
+    do_st2_mmu(env, addr, val, oi, retaddr);
 }
 
 void helper_be_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_be_stw_mmu(env, addr, val, oi, retaddr);
+    validate_memop(oi, MO_BEUW);
+    do_st2_mmu(env, addr, val, oi, retaddr);
 }
 
-static void full_le_stl_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
+static void do_st4_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                       MemOpIdx oi, uintptr_t ra)
 {
-    validate_memop(oi, MO_LEUL);
-    store_helper(env, addr, val, oi, retaddr, MO_LEUL);
+    MMULookupLocals l;
+    bool crosspage;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        do_st_4(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        return;
+    }
+
+    /* Swap to little endian for simplicity, then store by bytes. */
+    if ((l.memop & MO_BSWAP) != MO_LE) {
+        val = bswap32(val);
+    }
+    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
+    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
 }
 
 void helper_le_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_le_stl_mmu(env, addr, val, oi, retaddr);
-}
-
-static void full_be_stl_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUL);
-    store_helper(env, addr, val, oi, retaddr, MO_BEUL);
+    validate_memop(oi, MO_LEUL);
+    do_st4_mmu(env, addr, val, oi, retaddr);
 }
 
 void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_be_stl_mmu(env, addr, val, oi, retaddr);
+    validate_memop(oi, MO_BEUL);
+    do_st4_mmu(env, addr, val, oi, retaddr);
+}
+
+static void do_st8_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                       MemOpIdx oi, uintptr_t ra)
+{
+    MMULookupLocals l;
+    bool crosspage;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        do_st_8(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        return;
+    }
+
+    /* Swap to little endian for simplicity, then store by bytes. */
+    if ((l.memop & MO_BSWAP) != MO_LE) {
+        val = bswap64(val);
+    }
+    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
+    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
 }
 
 void helper_le_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUQ);
-    store_helper(env, addr, val, oi, retaddr, MO_LEUQ);
+    do_st8_mmu(env, addr, val, oi, retaddr);
 }
 
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUQ);
-    store_helper(env, addr, val, oi, retaddr, MO_BEUQ);
+    do_st8_mmu(env, addr, val, oi, retaddr);
 }
 
 /*
  * Store Helpers for cpu_ldst.h
  */
 
-typedef void FullStoreHelper(CPUArchState *env, target_ulong addr,
-                             uint64_t val, MemOpIdx oi, uintptr_t retaddr);
-
-static inline void cpu_store_helper(CPUArchState *env, target_ulong addr,
-                                    uint64_t val, MemOpIdx oi, uintptr_t ra,
-                                    FullStoreHelper *full_store)
+static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    full_store(env, addr, val, oi, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 void cpu_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
                  MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_stb_mmu);
+    helper_ret_stb_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stw_be_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_be_stw_mmu);
+    helper_be_stw_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stl_be_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_be_stl_mmu);
+    helper_be_stl_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stq_be_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, helper_be_stq_mmu);
+    helper_be_stq_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stw_le_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_le_stw_mmu);
+    helper_le_stw_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stl_le_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_le_stl_mmu);
+    helper_le_stl_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stq_le_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, helper_le_stq_mmu);
+    helper_le_stq_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
-- 
2.34.1

This header is supposed to be private to tcg and in fact
does not need to be included here at all.

Reviewed-by: Song Gao <gaosong@loongson.cn>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/loongarch/csr_helper.c   | 1 -
 target/loongarch/iocsr_helper.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/target/loongarch/csr_helper.c b/target/loongarch/csr_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/loongarch/csr_helper.c
+++ b/target/loongarch/csr_helper.c
@@ -XXX,XX +XXX,XX @@
 #include "exec/cpu_ldst.h"
 #include "hw/irq.h"
 #include "cpu-csr.h"
-#include "tcg/tcg-ldst.h"
 
 target_ulong helper_csrrd_pgd(CPULoongArchState *env)
 {
diff --git a/target/loongarch/iocsr_helper.c b/target/loongarch/iocsr_helper.c
index XXXXXXX..XXXXXXX 100644
--- a/target/loongarch/iocsr_helper.c
+++ b/target/loongarch/iocsr_helper.c
@@ -XXX,XX +XXX,XX @@
 #include "exec/helper-proto.h"
 #include "exec/exec-all.h"
 #include "exec/cpu_ldst.h"
-#include "tcg/tcg-ldst.h"
 
 #define GET_MEMTXATTRS(cas) \
         ((MemTxAttrs){.requester_id = env_cpu(cas)->cpu_index})
-- 
2.34.1

The following changes since commit 7fe6cb68117ac856e03c93d18aca09de015392b0:

Merge tag 'pull-target-arm-20230530-1' of https://git.linaro.org/people/pmaydell/qemu-arm into staging (2023-05-30 08:02:05 -0700)

are available in the Git repository at:

https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20230530

for you to fetch changes up to 276d77de503e8f5f5cbd3f7d94302ca12d1d982e:

tests/decode: Add tests for various named-field cases (2023-05-30 10:55:39 -0700)

----------------------------------------------------------------
Improvements to 128-bit atomics:
  - Separate __int128_t type and arithmetic detection
  - Support 128-bit load/store in backend for i386, aarch64, ppc64, s390x
  - Accelerate atomics via host/include/
Decodetree:
  - Add named field syntax
  - Move tests to meson

----------------------------------------------------------------
Peter Maydell (5):
      docs: Document decodetree named field syntax
      scripts/decodetree: Pass lvalue-formatter function to str_extract()
      scripts/decodetree: Implement a topological sort
      scripts/decodetree: Implement named field support
      tests/decode: Add tests for various named-field cases

Richard Henderson (22):
      tcg: Fix register move type in tcg_out_ld_helper_ret
      accel/tcg: Fix check for page writeability in load_atomic16_or_exit
      meson: Split test for __int128_t type from __int128_t arithmetic
      qemu/atomic128: Add x86_64 atomic128-ldst.h
      tcg/i386: Support 128-bit load/store
      tcg/aarch64: Rename temporaries
      tcg/aarch64: Reserve TCG_REG_TMP1, TCG_REG_TMP2
      tcg/aarch64: Simplify constraints on qemu_ld/st
      tcg/aarch64: Support 128-bit load/store
      tcg/ppc: Support 128-bit load/store
      tcg/s390x: Support 128-bit load/store
      accel/tcg: Extract load_atom_extract_al16_or_al8 to host header
      accel/tcg: Extract store_atom_insert_al16 to host header
      accel/tcg: Add x86_64 load_atom_extract_al16_or_al8
      accel/tcg: Add aarch64 lse2 load_atom_extract_al16_or_al8
      accel/tcg: Add aarch64 store_atom_insert_al16
      tcg: Remove TCG_TARGET_TLB_DISPLACEMENT_BITS
      decodetree: Add --test-for-error
      decodetree: Fix recursion in prop_format and build_tree
      decodetree: Diagnose empty pattern group
      decodetree: Do not remove output_file from /dev
      tests/decode: Convert tests to meson

docs/devel/decodetree.rst                         |  33 ++-
 meson.build                                       |  15 +-
 host/include/aarch64/host/load-extract-al16-al8.h |  40 ++++
 host/include/aarch64/host/store-insert-al16.h     |  47 ++++
 host/include/generic/host/load-extract-al16-al8.h |  45 ++++
 host/include/generic/host/store-insert-al16.h     |  50 ++++
 host/include/x86_64/host/atomic128-ldst.h         |  68 ++++++
 host/include/x86_64/host/load-extract-al16-al8.h  |  50 ++++
 include/qemu/int128.h                             |   4 +-
 tcg/aarch64/tcg-target-con-set.h                  |   4 +-
 tcg/aarch64/tcg-target-con-str.h                  |   1 -
 tcg/aarch64/tcg-target.h                          |  12 +-
 tcg/arm/tcg-target.h                              |   1 -
 tcg/i386/tcg-target.h                             |   5 +-
 tcg/mips/tcg-target.h                             |   1 -
 tcg/ppc/tcg-target-con-set.h                      |   2 +
 tcg/ppc/tcg-target-con-str.h                      |   1 +
 tcg/ppc/tcg-target.h                              |   4 +-
 tcg/riscv/tcg-target.h                            |   1 -
 tcg/s390x/tcg-target-con-set.h                    |   2 +
 tcg/s390x/tcg-target.h                            |   3 +-
 tcg/sparc64/tcg-target.h                          |   1 -
 tcg/tci/tcg-target.h                              |   1 -
 tests/decode/err_field10.decode                   |   7 +
 tests/decode/err_field7.decode                    |   7 +
 tests/decode/err_field8.decode                    |   8 +
 tests/decode/err_field9.decode                    |  14 ++
 tests/decode/succ_named_field.decode              |  19 ++
 tcg/tcg.c                                         |   4 +-
 accel/tcg/ldst_atomicity.c.inc                    |  80 +------
 tcg/aarch64/tcg-target.c.inc                      | 243 +++++++++++++++-----
 tcg/i386/tcg-target.c.inc                         | 191 +++++++++++++++-
 tcg/ppc/tcg-target.c.inc                          | 108 ++++++++-
 tcg/s390x/tcg-target.c.inc                        | 107 ++++++++-
 scripts/decodetree.py                             | 265 ++++++++++++++++++++--
 tests/decode/check.sh                             |  24 --
 tests/decode/meson.build                          |  64 ++++++
 tests/meson.build                                 |   5 +-
 38 files changed, 1312 insertions(+), 225 deletions(-)
 create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h
 create mode 100644 host/include/aarch64/host/store-insert-al16.h
 create mode 100644 host/include/generic/host/load-extract-al16-al8.h
 create mode 100644 host/include/generic/host/store-insert-al16.h
 create mode 100644 host/include/x86_64/host/atomic128-ldst.h
 create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h
 create mode 100644 tests/decode/err_field10.decode
 create mode 100644 tests/decode/err_field7.decode
 create mode 100644 tests/decode/err_field8.decode
 create mode 100644 tests/decode/err_field9.decode
 create mode 100644 tests/decode/succ_named_field.decode
 delete mode 100755 tests/decode/check.sh
 create mode 100644 tests/decode/meson.build

The first move was incorrectly using TCG_TYPE_I32 while the second
move was correctly using TCG_TYPE_REG.  This prevents a 64-bit host
from moving all 128-bits of the return value.

Fixes: ebebea53ef8 ("tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}")
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
---
 tcg/tcg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
     mov[0].dst = ldst->datalo_reg;
     mov[0].src =
         tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
-    mov[0].dst_type = TCG_TYPE_I32;
-    mov[0].src_type = TCG_TYPE_I32;
+    mov[0].dst_type = TCG_TYPE_REG;
+    mov[0].src_type = TCG_TYPE_REG;
     mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
 
     mov[1].dst = ldst->datahi_reg;
-- 
2.34.1

PAGE_WRITE is current writability, as modified by TB protection;
PAGE_WRITE_ORG is the original page writability.

Fixes: cdfac37be0d ("accel/tcg: Honor atomicity of loads")
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@ static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * another process, because the fallback start_exclusive solution
      * provides no protection across processes.
      */
-    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
+    if (!page_check_range(h2g(pv), 8, PAGE_WRITE_ORG)) {
         uint64_t *p = __builtin_assume_aligned(pv, 8);
         return *p;
     }
@@ -XXX,XX +XXX,XX @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * another process, because the fallback start_exclusive solution
      * provides no protection across processes.
      */
-    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
+    if (!page_check_range(h2g(p), 16, PAGE_WRITE_ORG)) {
         return *p;
     }
 #endif
-- 
2.34.1

Older versions of clang have missing runtime functions for arithmetic
with -fsanitize=undefined (see 464e3671f9d5c), so we cannot use
__int128_t for implementing Int128.  But __int128_t is present,
data movement works, and it can be used for atomic128.

Probe for both CONFIG_INT128_TYPE and CONFIG_INT128, adjust
qemu/int128.h to define Int128Alias if CONFIG_INT128_TYPE,
and adjust the meson probe for atomics to use has_int128_type.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 meson.build           | 15 ++++++++++-----
 include/qemu/int128.h |  4 ++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/meson.build
+++ b/meson.build
@@ -XXX,XX +XXX,XX @@ config_host_data.set('CONFIG_ATOMIC64', cc.links('''
     return 0;
   }'''))
 
-has_int128 = cc.links('''
+has_int128_type = cc.compiles('''
+  __int128_t a;
+  __uint128_t b;
+  int main(void) { b = a; }''')
+config_host_data.set('CONFIG_INT128_TYPE', has_int128_type)
+
+has_int128 = has_int128_type and cc.links('''
   __int128_t a;
   __uint128_t b;
   int main (void) {
@@ -XXX,XX +XXX,XX @@ has_int128 = cc.links('''
     a = a * a;
     return 0;
   }''')
-
 config_host_data.set('CONFIG_INT128', has_int128)
 
-if has_int128
+if has_int128_type
   # "do we have 128-bit atomics which are handled inline and specifically not
   # via libatomic". The reason we can't use libatomic is documented in the
   # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
@@ -XXX,XX +XXX,XX @@ if has_int128
   # __alignof(unsigned __int128) for the host.
   atomic_test_128 = '''
     int main(int ac, char **av) {
-      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], 16);
+      __uint128_t *p = __builtin_assume_aligned(av[ac - 1], 16);
       p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
       __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
       __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
@@ -XXX,XX +XXX,XX @@ if has_int128
       config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
         int main(void)
         {
-          unsigned __int128 x = 0, y = 0;
+          __uint128_t x = 0, y = 0;
           __sync_val_compare_and_swap_16(&x, y, x);
           return 0;
         }
diff --git a/include/qemu/int128.h b/include/qemu/int128.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/int128.h
+++ b/include/qemu/int128.h
@@ -XXX,XX +XXX,XX @@ static inline void bswap128s(Int128 *s)
  * a possible structure and the native types.  Ease parameter passing
  * via use of the transparent union extension.
  */
-#ifdef CONFIG_INT128
+#ifdef CONFIG_INT128_TYPE
 typedef union {
     __uint128_t u;
     __int128_t i;
@@ -XXX,XX +XXX,XX @@ typedef union {
 } Int128Alias __attribute__((transparent_union));
 #else
 typedef Int128 Int128Alias;
-#endif /* CONFIG_INT128 */
+#endif /* CONFIG_INT128_TYPE */
 
 #endif /* INT128_H */
-- 
2.34.1

With CPUINFO_ATOMIC_VMOVDQA, we can perform proper atomic
load/store without cmpxchg16b.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/x86_64/host/atomic128-ldst.h | 68 +++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 host/include/x86_64/host/atomic128-ldst.h

diff --git a/host/include/x86_64/host/atomic128-ldst.h b/host/include/x86_64/host/atomic128-ldst.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/x86_64/host/atomic128-ldst.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Load/store for 128-bit atomic operations, x86_64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ *
+ * See docs/devel/atomics.rst for discussion about the guarantees each
+ * atomic primitive is meant to provide.
+ */
+
+#ifndef AARCH64_ATOMIC128_LDST_H
+#define AARCH64_ATOMIC128_LDST_H
+
+#ifdef CONFIG_INT128_TYPE
+#include "host/cpuinfo.h"
+#include "tcg/debug-assert.h"
+
+/*
+ * Through clang 16, with -mcx16, __atomic_load_n is incorrectly
+ * expanded to a read-write operation: lock cmpxchg16b.
+ */
+
+#define HAVE_ATOMIC128_RO  likely(cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
+#define HAVE_ATOMIC128_RW  1
+
+static inline Int128 atomic16_read_ro(const Int128 *ptr)
+{
+    Int128Alias r;
+
+    tcg_debug_assert(HAVE_ATOMIC128_RO);
+    asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr));
+
+    return r.s;
+}
+
+static inline Int128 atomic16_read_rw(Int128 *ptr)
+{
+    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
+    Int128Alias r;
+
+    if (HAVE_ATOMIC128_RO) {
+        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    } else {
+        r.i = __sync_val_compare_and_swap_16(ptr_align, 0, 0);
+    }
+    return r.s;
+}
+
+static inline void atomic16_set(Int128 *ptr, Int128 val)
+{
+    __int128_t *ptr_align = __builtin_assume_aligned(ptr, 16);
+    Int128Alias new = { .s = val };
+
+    if (HAVE_ATOMIC128_RO) {
+        asm("vmovdqa %1, %0" : "=m"(*ptr_align) : "x" (new.i));
+    } else {
+        __int128_t old;
+        do {
+            old = *ptr_align;
+        } while (!__sync_bool_compare_and_swap_16(ptr_align, old, new.i));
+    }
+}
+#else
+/* Provide QEMU_ERROR stubs. */
+#include "host/include/generic/host/atomic128-ldst.h"
+#endif
+
+#endif /* AARCH64_ATOMIC128_LDST_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.h     |   4 +-
 tcg/i386/tcg-target.c.inc | 191 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define have_avx1         (cpuinfo & CPUINFO_AVX1)
 #define have_avx2         (cpuinfo & CPUINFO_AVX2)
 #define have_movbe        (cpuinfo & CPUINFO_MOVBE)
-#define have_atomic16     (cpuinfo & CPUINFO_ATOMIC_VMOVDQA)
 
 /*
  * There are interesting instructions in AVX512, so long as we have AVX512VL,
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define TCG_TARGET_HAS_qemu_st8_i32     1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128 \
+    (TCG_TARGET_REG_BITS == 64 && (cpuinfo & CPUINFO_ATOMIC_VMOVDQA))
 
 /* We do not support older SSE systems, only beginning with AVX1.  */
 #define TCG_TARGET_HAS_v64              have_avx1
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
 #endif
 };
 
+#define TCG_TMP_VEC  TCG_REG_XMM5
+
 static const int tcg_target_call_iarg_regs[] = {
 #if TCG_TARGET_REG_BITS == 64
 #if defined(_WIN64)
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
 #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
 #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
+#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
+#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
 #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
 #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
 #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return have_movbe;
+    TCGAtomAlign aa;
+
+    if (!have_movbe) {
+        return false;
+    }
+    if ((memop & MO_SIZE) < MO_128) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
+     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom < MO_128;
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static const TCGLdstHelperParam ldst_helper_param = {
 static const TCGLdstHelperParam ldst_helper_param = { };
 #endif
 
+static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
+                                TCGReg l, TCGReg h, TCGReg v)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vpmov{d,q} %v, %l */
+    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
+    /* vpextr{d,q} $1, %v, %h */
+    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
+    tcg_out8(s, 1);
+}
+
+static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
+                                TCGReg v, TCGReg l, TCGReg h)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vmov{d,q} %l, %v */
+    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
+    /* vpinsr{d,q} $1, %h, %v, %v */
+    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
+    tcg_out8(s, 1);
+}
+
 /*
  * Generate code for the slow path for a load at the end of block
  */
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     *h = x86_guest_base;
 #endif
     h->base = addrlo;
-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
+    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType tlbtype = TCG_TYPE_I32;
     int trexw = 0, hrexw = 0, tlbrexw = 0;
     unsigned mem_index = get_mmuidx(oi);
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     int tlb_mask;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+        /*
+         * Without 16-byte atomicity, use integer regs.
+         * That is where we want the data, and it allows bswaps.
+         */
+        if (h.aa.atom < MO_128) {
+            if (use_movbe) {
+                TCGReg t = datalo;
+                datalo = datahi;
+                datahi = t;
+            }
+            if (h.base == datalo || h.index == datalo) {
+                tcg_out_modrm_sib_offset(s, OPC_LEA + P_REXW, datahi,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                     datalo, datahi, 0);
+                tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                     datahi, datahi, 8);
+            } else {
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                         h.base, h.index, 0, h.ofs + 8);
+            }
+            break;
+        }
+
+        /*
+         * With 16-byte atomicity, a vector load is required.
+         * If we already have 16-byte alignment, then VMOVDQA always works.
+         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
+         * Else use we require a runtime test for alignment for VMOVDQA;
+         * use VMOVDQU on the unaligned nonatomic path for simplicity.
+         */
+        if (h.aa.align >= MO_128) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else {
+            TCGLabel *l1 = gen_new_label();
+            TCGLabel *l2 = gen_new_label();
+
+            tcg_out_testi(s, h.base, 15);
+            tcg_out_jxx(s, JCC_JNE, l1, true);
+
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_jxx(s, JCC_JMP, l2, true);
+
+            tcg_out_label(s, l1);
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_VxWx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_label(s, l2);
+        }
+        tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo, datahi, TCG_TMP_VEC);
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+        /*
+         * Without 16-byte atomicity, use integer regs.
+         * That is where we have the data, and it allows bswaps.
+         */
+        if (h.aa.atom < MO_128) {
+            if (use_movbe) {
+                TCGReg t = datalo;
+                datalo = datahi;
+                datahi = t;
+            }
+            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                     h.base, h.index, 0, h.ofs);
+            tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                     h.base, h.index, 0, h.ofs + 8);
+            break;
+        }
+
+        /*
+         * With 16-byte atomicity, a vector store is required.
+         * If we already have 16-byte alignment, then VMOVDQA always works.
+         * Else if VMOVDQU has atomicity with dynamic alignment, use that.
+         * Else use we require a runtime test for alignment for VMOVDQA;
+         * use VMOVDQU on the unaligned nonatomic path for simplicity.
+         */
+        tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC, datalo, datahi);
+        if (h.aa.align >= MO_128) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else if (cpuinfo & CPUINFO_ATOMIC_VMOVDQU) {
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+        } else {
+            TCGLabel *l1 = gen_new_label();
+            TCGLabel *l2 = gen_new_label();
+
+            tcg_out_testi(s, h.base, 15);
+            tcg_out_jxx(s, JCC_JNE, l1, true);
+
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_jxx(s, JCC_JMP, l2, true);
+
+            tcg_out_label(s, l1);
+            tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQU_WxVx + h.seg,
+                                         TCG_TMP_VEC, 0,
+                                         h.base, h.index, 0, h.ofs);
+            tcg_out_label(s, l2);
+        }
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st8_a64_i32:
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     OP_32_64(mulu2):
         tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a64_i64:
         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(L, L) : C_O0_I4(L, L, L, L);
 
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O2_I1(r, r, L);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O0_I3(L, L, L);
+
     case INDEX_op_brcond2_i32:
         return C_O0_I4(r, r, ri, ri);
 
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
 
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
+    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
 #ifdef _WIN64
     /* These are call saved, and we don't save them, so don't use them. */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
-- 
2.34.1

We will need to allocate a second general-purpose temporary.
Rename the existing temps to add a distinguishing number.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
-#define TCG_REG_TMP TCG_REG_X30
-#define TCG_VEC_TMP TCG_REG_V31
+#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
 #define TCG_REG_GUEST_BASE TCG_REG_X28
@@ -XXX,XX +XXX,XX @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
 static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                              TCGReg r, TCGReg base, intptr_t offset)
 {
-    TCGReg temp = TCG_REG_TMP;
+    TCGReg temp = TCG_REG_TMP0;
 
     if (offset < -0xffffff || offset > 0xffffff) {
         tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
     }
 
     /* Worst-case scenario, move offset to temp register, use reg offset.  */
-    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
-    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
+    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
+    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
 }
 
 static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
     if (offset == sextract64(offset, 0, 26)) {
         tcg_out_insn(s, 3206, BL, offset);
     } else {
-        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
-        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
+        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
+        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
     AArch64Insn insn;
 
     if (rl == ah || (!const_bh && rl == bh)) {
-        rl = TCG_REG_TMP;
+        rl = TCG_REG_TMP0;
     }
 
     if (const_bl) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
                possibility of adding 0+const in the low part, and the
                immediate add instructions encode XSP not XZR.  Don't try
                anything more elaborate here than loading another zero.  */
-            al = TCG_REG_TMP;
+            al = TCG_REG_TMP0;
             tcg_out_movi(s, ext, al, 0);
         }
         tcg_out_insn_3401(s, insn, ext, rl, al, bl);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
 {
     TCGReg a1 = a0;
     if (is_ctz) {
-        a1 = TCG_REG_TMP;
+        a1 = TCG_REG_TMP0;
         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
     }
     if (const_b && b == (ext ? 64 : 32)) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
         AArch64Insn sel = I3506_CSEL;
 
         tcg_out_cmp(s, ext, a0, 0, 1);
-        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
+        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
 
         if (const_b) {
             if (b == -1) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
                 b = d;
             }
         }
-        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
+        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
     }
 }
 
@@ -XXX,XX +XXX,XX @@ bool tcg_target_has_memory_bswap(MemOp memop)
 }
 
 static const TCGLdstHelperParam ldst_helper_param = {
-    .ntmp = 1, .tmp = { TCG_REG_TMP }
+    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
 };
 
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 
     set_jmp_insn_offset(s, which);
     tcg_out32(s, I3206_B);
-    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
     set_jmp_reset_offset(s, which);
 }
 
@@ -XXX,XX +XXX,XX @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
         ptrdiff_t i_offset = i_addr - jmp_rx;
 
         /* Note that we asserted this in range in tcg_out_goto_tb. */
-        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
+        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
     }
     qatomic_set((uint32_t *)jmp_rw, insn);
     flush_idcache_range(jmp_rx, jmp_rw, 4);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
     case INDEX_op_rem_i64:
     case INDEX_op_rem_i32:
-        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
     case INDEX_op_remu_i64:
     case INDEX_op_remu_i32:
-        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
 
     case INDEX_op_shl_i64:
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         if (c2) {
             tcg_out_rotl(s, ext, a0, a1, a2);
         } else {
-            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
-            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
+            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
+            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
         }
         break;
 
@@ -XXX,XX +XXX,XX @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             break;
                         }
                     }
-                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
-                    a2 = TCG_VEC_TMP;
+                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
+                    a2 = TCG_VEC_TMP0;
                 }
                 if (is_scalar) {
                     insn = cmp_scalar_insn[cond];
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
-    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
-    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
 /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static const int tcg_target_reg_alloc_order[] = {
 
     TCG_REG_X8, TCG_REG_X9, TCG_REG_X10, TCG_REG_X11,
     TCG_REG_X12, TCG_REG_X13, TCG_REG_X14, TCG_REG_X15,
-    TCG_REG_X16, TCG_REG_X17,
 
     TCG_REG_X0, TCG_REG_X1, TCG_REG_X2, TCG_REG_X3,
     TCG_REG_X4, TCG_REG_X5, TCG_REG_X6, TCG_REG_X7,
 
+    /* X16 reserved as temporary */
+    /* X17 reserved as temporary */
     /* X18 reserved by system */
     /* X19 reserved for AREG0 */
     /* X29 reserved as fp */
@@ -XXX,XX +XXX,XX @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
-#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_REG_TMP0 TCG_REG_X16
+#define TCG_REG_TMP1 TCG_REG_X17
+#define TCG_REG_TMP2 TCG_REG_X30
 #define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP2);
     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
-- 
2.34.1

Adjust the softmmu tlb to use TMP[0-2], not any of the normally available
registers.  Since we handle overlap betwen inputs and helper arguments,
we can allow any allocatable reg.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |  2 --
 tcg/aarch64/tcg-target-con-str.h |  1 -
 tcg/aarch64/tcg-target.c.inc     | 45 ++++++++++++++------------------
 3 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@
  * tcg-target-con-str.h; the constraint combination is inclusive or.
  */
 C_O0_I1(r)
-C_O0_I2(lZ, l)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
-C_O1_I1(r, l)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
diff --git a/tcg/aarch64/tcg-target-con-str.h b/tcg/aarch64/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-str.h
+++ b/tcg/aarch64/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('l', ALL_QLDST_REGS)
 REGS('w', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
 #define ALL_GENERAL_REGS  0xffffffffu
 #define ALL_VECTOR_REGS   0xffffffff00000000ull
 
-#ifdef CONFIG_SOFTMMU
-#define ALL_QLDST_REGS \
-    (ALL_GENERAL_REGS & ~((1 << TCG_REG_X0) | (1 << TCG_REG_X1) | \
-                          (1 << TCG_REG_X2) | (1 << TCG_REG_X3)))
-#else
-#define ALL_QLDST_REGS   ALL_GENERAL_REGS
-#endif
-
 /* Match a constant valid for addition (12-bit, optionally shifted).  */
 static inline bool is_aimm(uint64_t val)
 {
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
-    TCGReg x3;
+    TCGReg addr_adj;
     TCGType mask_type;
     uint64_t compare_mask;
 
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     mask_type = (s->page_bits + s->tlb_dyn_max_bits > 32
                  ? TCG_TYPE_I64 : TCG_TYPE_I32);
 
-    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {x0,x1}.  */
+    /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {tmp0,tmp1}. */
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -512);
     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, mask) != 0);
     QEMU_BUILD_BUG_ON(offsetof(CPUTLBDescFast, table) != 8);
-    tcg_out_insn(s, 3314, LDP, TCG_REG_X0, TCG_REG_X1, TCG_AREG0,
+    tcg_out_insn(s, 3314, LDP, TCG_REG_TMP0, TCG_REG_TMP1, TCG_AREG0,
                  TLB_MASK_TABLE_OFS(mem_index), 1, 0);
 
     /* Extract the TLB index from the address into X0.  */
     tcg_out_insn(s, 3502S, AND_LSR, mask_type == TCG_TYPE_I64,
-                 TCG_REG_X0, TCG_REG_X0, addr_reg,
+                 TCG_REG_TMP0, TCG_REG_TMP0, addr_reg,
                  s->page_bits - CPU_TLB_ENTRY_BITS);
 
-    /* Add the tlb_table pointer, creating the CPUTLBEntry address into X1.  */
-    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_X1, TCG_REG_X1, TCG_REG_X0);
+    /* Add the tlb_table pointer, forming the CPUTLBEntry address in TMP1. */
+    tcg_out_insn(s, 3502, ADD, 1, TCG_REG_TMP1, TCG_REG_TMP1, TCG_REG_TMP0);
 
-    /* Load the tlb comparator into X0, and the fast path addend into X1.  */
-    tcg_out_ld(s, addr_type, TCG_REG_X0, TCG_REG_X1,
+    /* Load the tlb comparator into TMP0, and the fast path addend into TMP1. */
+    tcg_out_ld(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP1,
                is_ld ? offsetof(CPUTLBEntry, addr_read)
                      : offsetof(CPUTLBEntry, addr_write));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_X1, TCG_REG_X1,
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1,
                offsetof(CPUTLBEntry, addend));
 
     /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * cross pages using the address of the last byte of the access.
      */
     if (a_mask >= s_mask) {
-        x3 = addr_reg;
+        addr_adj = addr_reg;
     } else {
+        addr_adj = TCG_REG_TMP2;
         tcg_out_insn(s, 3401, ADDI, addr_type,
-                     TCG_REG_X3, addr_reg, s_mask - a_mask);
-        x3 = TCG_REG_X3;
+                     addr_adj, addr_reg, s_mask - a_mask);
     }
     compare_mask = (uint64_t)s->page_mask | a_mask;
 
-    /* Store the page mask part of the address into X3.  */
-    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_X3, x3, compare_mask);
+    /* Store the page mask part of the address into TMP2.  */
+    tcg_out_logicali(s, I3404_ANDI, addr_type, TCG_REG_TMP2,
+                     addr_adj, compare_mask);
 
     /* Perform the address comparison. */
-    tcg_out_cmp(s, addr_type, TCG_REG_X0, TCG_REG_X3, 0);
+    tcg_out_cmp(s, addr_type, TCG_REG_TMP0, TCG_REG_TMP2, 0);
 
     /* If not equal, we jump to the slow path. */
     ldst->label_ptr[0] = s->code_ptr;
     tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 
-    h->base = TCG_REG_X1,
+    h->base = TCG_REG_TMP1;
     h->index = addr_reg;
     h->index_ext = addr_type;
 #else
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_a64_i32:
     case INDEX_op_qemu_ld_a32_i64:
     case INDEX_op_qemu_ld_a64_i64:
-        return C_O1_I1(r, l);
+        return C_O1_I1(r, r);
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st_a32_i64:
     case INDEX_op_qemu_st_a64_i64:
-        return C_O0_I2(lZ, l);
+        return C_O0_I2(rZ, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

With FEAT_LSE2, LDP/STP suffices.  Without FEAT_LSE2, use LDXP+STXP
16-byte atomicity is required and LDP/STP otherwise.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |   2 +
 tcg/aarch64/tcg-target.h         |  11 ++-
 tcg/aarch64/tcg-target.c.inc     | 141 ++++++++++++++++++++++++++++++-
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I1(r)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
+C_O0_I3(rZ, rZ, r)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(w, w, wO)
 C_O1_I2(w, w, wZ)
 C_O1_I3(w, w, w, w)
 C_O1_I4(r, r, rA, rZ, rZ)
+C_O2_I1(r, r, r)
 C_O2_I4(r, r, rZ, rZ, rA, rMZ)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@ typedef enum {
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+/*
+ * Without FEAT_LSE2, we must use LDXP+STXP to implement atomic 128-bit load,
+ * which requires writable pages.  We must defer to the helper for user-only,
+ * but in system mode all ram is writable for the host.
+ */
+#ifdef CONFIG_USER_ONLY
+#define TCG_TARGET_HAS_qemu_ldst_i128   have_lse2
+#else
+#define TCG_TARGET_HAS_qemu_ldst_i128   1
+#endif
 
 #define TCG_TARGET_HAS_v64              1
 #define TCG_TARGET_HAS_v128             1
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum {
     I3305_LDR_v64   = 0x5c000000,
     I3305_LDR_v128  = 0x9c000000,
 
+    /* Load/store exclusive. */
+    I3306_LDXP      = 0xc8600000,
+    I3306_STXP      = 0xc8200000,
+
     /* Load/store register.  Described here as 3.3.12, but the helper
        that emits them can transform to 3.3.10 or 3.3.13.  */
     I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
@@ -XXX,XX +XXX,XX @@ typedef enum {
     I3406_ADR       = 0x10000000,
     I3406_ADRP      = 0x90000000,
 
+    /* Add/subtract extended register instructions. */
+    I3501_ADD       = 0x0b200000,
+
     /* Add/subtract shifted register instructions (without a shift).  */
     I3502_ADD       = 0x0b000000,
     I3502_ADDS      = 0x2b000000,
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
 }
 
+static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
+                              TCGReg rt, TCGReg rt2, TCGReg rn)
+{
+    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
                               TCGReg rt, int imm19)
 {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
 }
 
+static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
+                                     TCGType sf, TCGReg rd, TCGReg rn,
+                                     TCGReg rm, int opt, int imm3)
+{
+    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
+              imm3 << 10 | rn << 5 | rd);
+}
+
 /* This function is for both 3.5.2 (Add/Subtract shifted register), for
    the rare occasion when we actually want to supply a shift amount.  */
 static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType addr_type = s->addr_type;
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
     h->aa = atom_and_align_for_opc(s, opc,
                                    have_lse2 ? MO_ATOM_WITHIN16
                                              : MO_ATOM_IFALIGN,
-                                   false);
+                                   s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
     TCGReg addr_adj;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    TCGReg base;
+    bool use_pair;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    /* Compose the final address, as LDP/STP have no indexing. */
+    if (h.index == TCG_REG_XZR) {
+        base = h.base;
+    } else {
+        base = TCG_REG_TMP2;
+        if (h.index_ext == TCG_TYPE_I32) {
+            /* add base, base, index, uxtw */
+            tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, base,
+                         h.base, h.index, MO_32, 0);
+        } else {
+            /* add base, base, index */
+            tcg_out_insn(s, 3502, ADD, 1, base, h.base, h.index);
+        }
+    }
+
+    use_pair = h.aa.atom < MO_128 || have_lse2;
+
+    if (!use_pair) {
+        tcg_insn_unit *branch = NULL;
+        TCGReg ll, lh, sl, sh;
+
+        /*
+         * If we have already checked for 16-byte alignment, that's all
+         * we need. Otherwise we have determined that misaligned atomicity
+         * may be handled with two 8-byte loads.
+         */
+        if (h.aa.align < MO_128) {
+            /*
+             * TODO: align should be MO_64, so we only need test bit 3,
+             * which means we could use TBNZ instead of ANDS+B_C.
+             */
+            tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, 15);
+            branch = s->code_ptr;
+            tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
+            use_pair = true;
+        }
+
+        if (is_ld) {
+            /*
+             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+             *    ldxp lo, hi, [base]
+             *    stxp t0, lo, hi, [base]
+             *    cbnz t0, .-8
+             * Require no overlap between data{lo,hi} and base.
+             */
+            if (datalo == base || datahi == base) {
+                tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_TMP2, base);
+                base = TCG_REG_TMP2;
+            }
+            ll = sl = datalo;
+            lh = sh = datahi;
+        } else {
+            /*
+             * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+             * 1: ldxp t0, t1, [base]
+             *    stxp t0, lo, hi, [base]
+             *    cbnz t0, 1b
+             */
+            tcg_debug_assert(base != TCG_REG_TMP0 && base != TCG_REG_TMP1);
+            ll = TCG_REG_TMP0;
+            lh = TCG_REG_TMP1;
+            sl = datalo;
+            sh = datahi;
+        }
+
+        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, ll, lh, base);
+        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP0, sl, sh, base);
+        tcg_out_insn(s, 3201, CBNZ, 0, TCG_REG_TMP0, -2);
+
+        if (use_pair) {
+            /* "b .+8", branching across the one insn of use_pair. */
+            tcg_out_insn(s, 3206, B, 2);
+            reloc_pc19(branch, tcg_splitwx_to_rx(s->code_ptr));
+        }
+    }
+
+    if (use_pair) {
+        if (is_ld) {
+            tcg_out_insn(s, 3314, LDP, datalo, datahi, base, 0, 1, 0);
+        } else {
+            tcg_out_insn(s, 3314, STP, datalo, datahi, base, 0, 1, 0);
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static const tcg_insn_unit *tb_ret_addr;
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_a64_i64:
         tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_out_qemu_ldst_i128(s, a0, a1, a2, args[3], true);
+        break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_out_qemu_ldst_i128(s, REG0(0), REG0(1), a2, args[3], false);
+        break;
 
     case INDEX_op_bswap64_i64:
         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_a32_i64:
     case INDEX_op_qemu_ld_a64_i64:
         return C_O1_I1(r, r);
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(r, r, r);
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
     case INDEX_op_qemu_st_a32_i64:
     case INDEX_op_qemu_st_a64_i64:
         return C_O0_I2(rZ, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(rZ, rZ, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
Note that these instructions do not require 16-byte alignment.

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target-con-set.h |   2 +
 tcg/ppc/tcg-target-con-str.h |   1 +
 tcg/ppc/tcg-target.h         |   3 +-
 tcg/ppc/tcg-target.c.inc     | 108 +++++++++++++++++++++++++++++++----
 4 files changed, 101 insertions(+), 13 deletions(-)

diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-set.h
+++ b/tcg/ppc/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(v, r)
 C_O0_I3(r, r, r)
+C_O0_I3(o, m, r)
 C_O0_I4(r, r, ri, ri)
 C_O0_I4(r, r, r, r)
 C_O1_I1(r, r)
@@ -XXX,XX +XXX,XX @@ C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rZ, rZ)
 C_O1_I4(r, r, r, ri, ri)
 C_O2_I1(r, r, r)
+C_O2_I1(o, m, r)
 C_O2_I2(r, r, r, r)
 C_O2_I4(r, r, rI, rZM, r, r)
 C_O2_I4(r, r, r, r, rI, rZM)
diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target-con-str.h
+++ b/tcg/ppc/tcg-target-con-str.h
@@ -XXX,XX +XXX,XX @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
+REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
 REGS('v', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_mulsh_i64        1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128   \
+    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
 
 /*
  * While technically Altivec could support V64, it has no 64-bit store
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 
 #define B      OPCD( 18)
 #define BC     OPCD( 16)
+
 #define LBZ    OPCD( 34)
 #define LHZ    OPCD( 40)
 #define LHA    OPCD( 42)
 #define LWZ    OPCD( 32)
 #define LWZUX  XO31( 55)
-#define STB    OPCD( 38)
-#define STH    OPCD( 44)
-#define STW    OPCD( 36)
-
-#define STD    XO62(  0)
-#define STDU   XO62(  1)
-#define STDX   XO31(149)
-
 #define LD     XO58(  0)
 #define LDX    XO31( 21)
 #define LDU    XO58(  1)
 #define LDUX   XO31( 53)
 #define LWA    XO58(  2)
 #define LWAX   XO31(341)
+#define LQ     OPCD( 56)
+
+#define STB    OPCD( 38)
+#define STH    OPCD( 44)
+#define STW    OPCD( 36)
+#define STD    XO62(  0)
+#define STDU   XO62(  1)
+#define STDX   XO31(149)
+#define STQ    XO62(  2)
 
 #define ADDIC  OPCD( 12)
 #define ADDI   OPCD( 14)
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    TCGAtomAlign aa;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom <= MO_64;
 }
 
 /*
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp a_bits;
+    MemOp a_bits, s_bits;
 
     /*
      * Book II, Section 1.4, Single-Copy Atomicity, specifies:
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * As of 3.0, "the non-atomic access is performed as described in
      * the corresponding list", which matches MO_ATOM_SUBALIGN.
      */
+    s_bits = opc & MO_SIZE;
     h->aa = atom_and_align_for_opc(s, opc,
                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
                                                  : MO_ATOM_IFALIGN,
-                                   false);
+                                   s_bits == MO_128);
     a_bits = h->aa.align;
 
 #ifdef CONFIG_SOFTMMU
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    unsigned s_bits = opc & MO_SIZE;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    uint32_t insn;
+    TCGReg index;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, -1, oi, is_ld);
+
+    /* Compose the final address, as LQ/STQ have no indexing. */
+    index = h.index;
+    if (h.base != 0) {
+        index = TCG_REG_TMP1;
+        tcg_out32(s, ADD | TAB(index, h.base, h.index));
+    }
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (h.aa.atom == MO_128) {
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? LQ : STQ;
+        tcg_out32(s, insn | TAI(datahi, index, 0));
+    } else {
+        TCGReg d1, d2;
+
+        if (HOST_BIG_ENDIAN ^ need_bswap) {
+            d1 = datahi, d2 = datalo;
+        } else {
+            d1 = datalo, d2 = datahi;
+        }
+
+        if (need_bswap) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
+            insn = is_ld ? LDBRX : STDBRX;
+            tcg_out32(s, insn | TAB(d1, 0, index));
+            tcg_out32(s, insn | TAB(d2, index, TCG_REG_R0));
+        } else {
+            insn = is_ld ? LD : STD;
+            tcg_out32(s, insn | TAI(d1, index, 0));
+            tcg_out32(s, insn | TAI(d2, index, 8));
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
 {
     int i;
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
 
     case INDEX_op_qemu_st_a64_i32:
         if (TCG_TARGET_REG_BITS == 32) {
@@ -XXX,XX +XXX,XX @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_setcond_i32:
         tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a64_i64:
         return TCG_TARGET_REG_BITS == 64 ? C_O0_I2(r, r) : C_O0_I4(r, r, r, r);
 
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(o, m, r);
+
     case INDEX_op_add_vec:
     case INDEX_op_sub_vec:
     case INDEX_op_mul_vec:
-- 
2.34.1

Use LPQ/STPQ when 16-byte atomicity is required.
Note that these instructions require 16-byte alignment.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |   2 +
 tcg/s390x/tcg-target.h         |   2 +-
 tcg/s390x/tcg-target.c.inc     | 107 ++++++++++++++++++++++++++++++++-
 3 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -XXX,XX +XXX,XX @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(r, rA)
 C_O0_I2(v, r)
+C_O0_I3(o, m, r)
 C_O1_I1(r, r)
 C_O1_I1(v, r)
 C_O1_I1(v, v)
@@ -XXX,XX +XXX,XX @@ C_O1_I2(v, v, v)
 C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rI, r)
 C_O1_I4(r, r, rA, rI, r)
+C_O2_I1(o, m, r)
 C_O2_I2(o, m, 0, r)
 C_O2_I2(o, m, r, r)
 C_O2_I3(o, m, 0, 1, r)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_muluh_i64      0
 #define TCG_TARGET_HAS_mulsh_i64      0
 
-#define TCG_TARGET_HAS_qemu_ldst_i128 0
+#define TCG_TARGET_HAS_qemu_ldst_i128 1
 
 #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
 #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RXY_LLGF    = 0xe316,
     RXY_LLGH    = 0xe391,
     RXY_LMG     = 0xeb04,
+    RXY_LPQ     = 0xe38f,
     RXY_LRV     = 0xe31e,
     RXY_LRVG    = 0xe30f,
     RXY_LRVH    = 0xe31f,
@@ -XXX,XX +XXX,XX @@ typedef enum S390Opcode {
     RXY_STG     = 0xe324,
     RXY_STHY    = 0xe370,
     RXY_STMG    = 0xeb24,
+    RXY_STPQ    = 0xe38e,
     RXY_STRV    = 0xe33e,
     RXY_STRVG   = 0xe32f,
     RXY_STRVH   = 0xe33f,
@@ -XXX,XX +XXX,XX @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    TCGAtomAlign aa;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    aa = atom_and_align_for_opc(tcg_ctx, memop, MO_ATOM_IFALIGN, true);
+    return aa.atom <= MO_64;
 }
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
@@ -XXX,XX +XXX,XX @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
+    MemOp s_bits = opc & MO_SIZE;
     unsigned a_mask;
 
-    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, false);
+    h->aa = atom_and_align_for_opc(s, opc, MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->aa.align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     int mem_index = get_mmuidx(oi);
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
@@ -XXX,XX +XXX,XX @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabel *l1 = NULL, *l2 = NULL;
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    bool use_pair;
+    S390Opcode insn;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    use_pair = h.aa.atom < MO_128;
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (!use_pair) {
+        /*
+         * Atomicity requires we use LPQ.  If we've already checked for
+         * 16-byte alignment, that's all we need.  If we arrive with
+         * lesser alignment, we have determined that less than 16-byte
+         * alignment can be satisfied with two 8-byte loads.
+         */
+        if (h.aa.align < MO_128) {
+            use_pair = true;
+            l1 = gen_new_label();
+            l2 = gen_new_label();
+
+            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
+            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
+        }
+
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? RXY_LPQ : RXY_STPQ;
+        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
+
+        if (use_pair) {
+            tgen_branch(s, S390_CC_ALWAYS, l2);
+            tcg_out_label(s, l1);
+        }
+    }
+    if (use_pair) {
+        TCGReg d1, d2;
+
+        if (need_bswap) {
+            d1 = datalo, d2 = datahi;
+            insn = is_ld ? RXY_LRVG : RXY_STRVG;
+        } else {
+            d1 = datahi, d2 = datalo;
+            insn = is_ld ? RXY_LG : RXY_STG;
+        }
+
+        if (h.base == d1 || h.index == d1) {
+            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
+            h.base = TCG_TMP0;
+            h.index = TCG_REG_NONE;
+            h.disp = 0;
+        }
+        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
+        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
+    }
+    if (l2) {
+        tcg_out_label(s, l2);
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
 {
     /* Reuse the zeroing that exists for goto_ptr.  */
@@ -XXX,XX +XXX,XX @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_a64_i64:
         tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
         break;
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_ld16s_i64:
         tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
@@ -XXX,XX +XXX,XX @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_a32_i32:
     case INDEX_op_qemu_st_a64_i32:
         return C_O0_I2(r, r);
+    case INDEX_op_qemu_ld_a32_i128:
+    case INDEX_op_qemu_ld_a64_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_a32_i128:
+    case INDEX_op_qemu_st_a64_i128:
+        return C_O0_I3(o, m, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../generic/host/load-extract-al16-al8.h      | 45 +++++++++++++++++++
 accel/tcg/ldst_atomicity.c.inc                | 36 +--------------
 2 files changed, 47 insertions(+), 34 deletions(-)
 create mode 100644 host/include/generic/host/load-extract-al16-al8.h

diff --git a/host/include/generic/host/load-extract-al16-al8.h b/host/include/generic/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/generic/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, generic version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef HOST_LOAD_EXTRACT_AL16_AL8_H
+#define HOST_LOAD_EXTRACT_AL16_AL8_H
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int o = pi & 7;
+    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
+    Int128 r;
+
+    pv = (void *)(pi & ~7);
+    if (pi & 8) {
+        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
+        uint64_t a = qatomic_read__nocheck(p8);
+        uint64_t b = qatomic_read__nocheck(p8 + 1);
+
+        if (HOST_BIG_ENDIAN) {
+            r = int128_make128(b, a);
+        } else {
+            r = int128_make128(a, b);
+        }
+    } else {
+        r = atomic16_read_ro(pv);
+    }
+    return int128_getlo(int128_urshift(r, shr));
+}
+
+#endif /* HOST_LOAD_EXTRACT_AL16_AL8_H */
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
  * See the COPYING file in the top-level directory.
  */
 
+#include "host/load-extract-al16-al8.h"
+
 #ifdef CONFIG_ATOMIC64
 # define HAVE_al8          true
 #else
@@ -XXX,XX +XXX,XX @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
     return int128_getlo(r);
 }
 
-/**
- * load_atom_extract_al16_or_al8:
- * @p: host address
- * @s: object size in bytes, @s <= 8.
- *
- * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
- * cross an 16-byte boundary then the access must be 16-byte atomic,
- * otherwise the access must be 8-byte atomic.
- */
-static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
-load_atom_extract_al16_or_al8(void *pv, int s)
-{
-    uintptr_t pi = (uintptr_t)pv;
-    int o = pi & 7;
-    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
-    Int128 r;
-
-    pv = (void *)(pi & ~7);
-    if (pi & 8) {
-        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
-        uint64_t a = qatomic_read__nocheck(p8);
-        uint64_t b = qatomic_read__nocheck(p8 + 1);
-
-        if (HOST_BIG_ENDIAN) {
-            r = int128_make128(b, a);
-        } else {
-            r = int128_make128(a, b);
-        }
-    } else {
-        r = atomic16_read_ro(pv);
-    }
-    return int128_getlo(int128_urshift(r, shr));
-}
-
 /**
  * load_atom_4_by_2:
  * @pv: host address
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/generic/host/store-insert-al16.h | 50 +++++++++++++++++++
 accel/tcg/ldst_atomicity.c.inc                | 40 +--------------
 2 files changed, 51 insertions(+), 39 deletions(-)
 create mode 100644 host/include/generic/host/store-insert-al16.h

diff --git a/host/include/generic/host/store-insert-al16.h b/host/include/generic/host/store-insert-al16.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/generic/host/store-insert-al16.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic store insert into 128-bit, generic version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef HOST_STORE_INSERT_AL16_H
+#define HOST_STORE_INSERT_AL16_H
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
+{
+#if defined(CONFIG_ATOMIC128)
+    __uint128_t *pu;
+    Int128Alias old, new;
+
+    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
+    pu = __builtin_assume_aligned(ps, 16);
+    old.u = *pu;
+    msk = int128_not(msk);
+    do {
+        new.s = int128_and(old.s, msk);
+        new.s = int128_or(new.s, val);
+    } while (!__atomic_compare_exchange_n(pu, &old.u, new.u, true,
+                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
+#else
+    Int128 old, new, cmp;
+
+    ps = __builtin_assume_aligned(ps, 16);
+    old = *ps;
+    msk = int128_not(msk);
+    do {
+        cmp = old;
+        new = int128_and(old, msk);
+        new = int128_or(new, val);
+        old = atomic16_cmpxchg(ps, cmp, new);
+    } while (int128_ne(cmp, old));
+#endif
+}
+
+#endif /* HOST_STORE_INSERT_AL16_H */
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index XXXXXXX..XXXXXXX 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -XXX,XX +XXX,XX @@
  */
 
 #include "host/load-extract-al16-al8.h"
+#include "host/store-insert-al16.h"
 
 #ifdef CONFIG_ATOMIC64
 # define HAVE_al8          true
@@ -XXX,XX +XXX,XX @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
                                           __ATOMIC_RELAXED, __ATOMIC_RELAXED));
 }
 
-/**
- * store_atom_insert_al16:
- * @p: host address
- * @val: shifted value to store
- * @msk: mask for value to store
- *
- * Atomically store @val to @p masked by @msk.
- */
-static void ATTRIBUTE_ATOMIC128_OPT
-store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
-{
-#if defined(CONFIG_ATOMIC128)
-    __uint128_t *pu, old, new;
-
-    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-    pu = __builtin_assume_aligned(ps, 16);
-    old = *pu;
-    do {
-        new = (old & ~msk.u) | val.u;
-    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
-                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
-#elif defined(CONFIG_CMPXCHG128)
-    __uint128_t *pu, old, new;
-
-    /*
-     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
-     * defer to libatomic, so we must use __sync_*_compare_and_swap_16
-     * and accept the sequential consistency that comes with it.
-     */
-    pu = __builtin_assume_aligned(ps, 16);
-    do {
-        old = *pu;
-        new = (old & ~msk.u) | val.u;
-    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
-#else
-    qemu_build_not_reached();
-#endif
-}
-
 /**
  * store_bytes_leN:
  * @pv: host address
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../x86_64/host/load-extract-al16-al8.h       | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)
 create mode 100644 host/include/x86_64/host/load-extract-al16-al8.h

diff --git a/host/include/x86_64/host/load-extract-al16-al8.h b/host/include/x86_64/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/x86_64/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, x86_64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef X86_64_LOAD_EXTRACT_AL16_AL8_H
+#define X86_64_LOAD_EXTRACT_AL16_AL8_H
+
+#ifdef CONFIG_INT128_TYPE
+#include "host/cpuinfo.h"
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
+    int shr = (pi & 7) * 8;
+    Int128Alias r;
+
+    /*
+     * ptr_align % 16 is now only 0 or 8.
+     * If the host supports atomic loads with VMOVDQU, then always use that,
+     * making the branch highly predictable.  Otherwise we must use VMOVDQA
+     * when ptr_align % 16 == 0 for 16-byte atomicity.
+     */
+    if ((cpuinfo & CPUINFO_ATOMIC_VMOVDQU) || (pi & 8)) {
+        asm("vmovdqu %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    } else {
+        asm("vmovdqa %1, %0" : "=x" (r.i) : "m" (*ptr_align));
+    }
+    return int128_getlo(int128_urshift(r.s, shr));
+}
+#else
+/* Fallback definition that must be optimized away, or error.  */
+uint64_t QEMU_ERROR("unsupported atomic")
+    load_atom_extract_al16_or_al8(void *pv, int s);
+#endif
+
+#endif /* X86_64_LOAD_EXTRACT_AL16_AL8_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 .../aarch64/host/load-extract-al16-al8.h      | 40 +++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 host/include/aarch64/host/load-extract-al16-al8.h

diff --git a/host/include/aarch64/host/load-extract-al16-al8.h b/host/include/aarch64/host/load-extract-al16-al8.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/aarch64/host/load-extract-al16-al8.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic extract 64 from 128-bit, AArch64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef AARCH64_LOAD_EXTRACT_AL16_AL8_H
+#define AARCH64_LOAD_EXTRACT_AL16_AL8_H
+
+#include "host/cpuinfo.h"
+#include "tcg/debug-assert.h"
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @pv: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @pv, when pv % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    __int128_t *ptr_align = (__int128_t *)(pi & ~7);
+    int shr = (pi & 7) * 8;
+    uint64_t l, h;
+
+    /*
+     * With FEAT_LSE2, LDP is single-copy atomic if 16-byte aligned
+     * and single-copy atomic on the parts if 8-byte aligned.
+     * All we need do is align the pointer mod 8.
+     */
+    tcg_debug_assert(HAVE_ATOMIC128_RO);
+    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*ptr_align));
+    return (l >> shr) | (h << (-shr & 63));
+}
+
+#endif /* AARCH64_LOAD_EXTRACT_AL16_AL8_H */
-- 
2.34.1

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 host/include/aarch64/host/store-insert-al16.h | 47 +++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 host/include/aarch64/host/store-insert-al16.h

diff --git a/host/include/aarch64/host/store-insert-al16.h b/host/include/aarch64/host/store-insert-al16.h
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/host/include/aarch64/host/store-insert-al16.h
@@ -XXX,XX +XXX,XX @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * Atomic store insert into 128-bit, AArch64 version.
+ *
+ * Copyright (C) 2023 Linaro, Ltd.
+ */
+
+#ifndef AARCH64_STORE_INSERT_AL16_H
+#define AARCH64_STORE_INSERT_AL16_H
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128 val, Int128 msk)
+{
+    /*
+     * GCC only implements __sync* primitives for int128 on aarch64.
+     * We can do better without the barriers, and integrating the
+     * arithmetic into the load-exclusive/store-conditional pair.
+     */
+    uint64_t tl, th, vl, vh, ml, mh;
+    uint32_t fail;
+
+    qemu_build_assert(!HOST_BIG_ENDIAN);
+    vl = int128_getlo(val);
+    vh = int128_gethi(val);
+    ml = int128_getlo(msk);
+    mh = int128_gethi(msk);
+
+    asm("0: ldxp %[l], %[h], %[mem]\n\t"
+        "bic %[l], %[l], %[ml]\n\t"
+        "bic %[h], %[h], %[mh]\n\t"
+        "orr %[l], %[l], %[vl]\n\t"
+        "orr %[h], %[h], %[vh]\n\t"
+        "stxp %w[f], %[l], %[h], %[mem]\n\t"
+        "cbnz %w[f], 0b\n"
+        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
+        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
+}
+
+#endif /* AARCH64_STORE_INSERT_AL16_H */
-- 
2.34.1

The last use was removed by e77c89fb086a.

Fixes: e77c89fb086a ("cputlb: Remove static tlb sizing")
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.h | 1 -
 tcg/arm/tcg-target.h     | 1 -
 tcg/i386/tcg-target.h    | 1 -
 tcg/mips/tcg-target.h    | 1 -
 tcg/ppc/tcg-target.h     | 1 -
 tcg/riscv/tcg-target.h   | 1 -
 tcg/s390x/tcg-target.h   | 1 -
 tcg/sparc64/tcg-target.h | 1 -
 tcg/tci/tcg-target.h     | 1 -
 9 files changed, 9 deletions(-)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -XXX,XX +XXX,XX @@ extern int arm_arch;
 #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define MAX_CODE_GEN_BUFFER_SIZE  UINT32_MAX
 
 typedef enum {
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 #define TCG_TARGET_INSN_UNIT_SIZE  1
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #endif
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define TCG_TARGET_NB_REGS 32
 
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 
 #define TCG_TARGET_NB_REGS 64
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 
 typedef enum {
     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define TCG_TARGET_REG_BITS 64
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
 #define TCG_TARGET_NB_REGS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define S390_TCG_TARGET_H
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
 
 /* We have a +- 4GB range on the branches; leave some slop.  */
 #define MAX_CODE_GEN_BUFFER_SIZE  (3 * GiB)
diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/sparc64/tcg-target.h
+++ b/tcg/sparc64/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 #define SPARC_TCG_TARGET_H
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define TCG_TARGET_NB_REGS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  (2 * GiB)
 
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index XXXXXXX..XXXXXXX 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -XXX,XX +XXX,XX @@
 
 #define TCG_TARGET_INTERPRETER 1
 #define TCG_TARGET_INSN_UNIT_SIZE 4
-#define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
 #define MAX_CODE_GEN_BUFFER_SIZE  ((size_t)-1)
 
 #if UINTPTR_MAX == UINT32_MAX
-- 
2.34.1

Invert the exit code, for use with the testsuite.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 scripts/decodetree.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@
 formats = {}
 allpatterns = []
 anyextern = False
+testforerror = False
 
 translate_prefix = 'trans'
 translate_scope = 'static '
@@ -XXX,XX +XXX,XX @@ def error_with_file(file, lineno, *args):
     if output_file and output_fd:
         output_fd.close()
         os.remove(output_file)
-    exit(1)
+    exit(0 if testforerror else 1)
 # end error_with_file
 
 
@@ -XXX,XX +XXX,XX @@ def main():
     global bitop_width
     global variablewidth
     global anyextern
+    global testforerror
 
     decode_scope = 'static '
 
     long_opts = ['decode=', 'translate=', 'output=', 'insnwidth=',
-                 'static-decode=', 'varinsnwidth=']
+                 'static-decode=', 'varinsnwidth=', 'test-for-error']
     try:
         (opts, args) = getopt.gnu_getopt(sys.argv[1:], 'o:vw:', long_opts)
     except getopt.GetoptError as err:
@@ -XXX,XX +XXX,XX @@ def main():
                 bitop_width = 64
             elif insnwidth != 32:
                 error(0, 'cannot handle insns of width', insnwidth)
+        elif o == '--test-for-error':
+            testforerror = True
         else:
             assert False, 'unhandled option'
 
@@ -XXX,XX +XXX,XX @@ def main():
 
     if output_file:
         output_fd.close()
+    exit(1 if testforerror else 0)
 # end main
 
 
-- 
2.34.1

Test err_pattern_group_empty.decode failed with exception:

Traceback (most recent call last):
  File "./scripts/decodetree.py", line 1424, in <module> main()
  File "./scripts/decodetree.py", line 1342, in main toppat.build_tree()
  File "./scripts/decodetree.py", line 627, in build_tree
    self.tree = self.__build_tree(self.pats, self.fixedbits,
  File "./scripts/decodetree.py", line 607, in __build_tree
    fb = i.fixedbits & innermask
TypeError: unsupported operand type(s) for &: 'NoneType' and 'int'

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 scripts/decodetree.py | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
                 output(ind, '}\n')
             else:
                 p.output_code(i, extracted, p.fixedbits, p.fixedmask)
+
+    def build_tree(self):
+        if not self.pats:
+            error_with_file(self.file, self.lineno, 'empty pattern group')
+        super().build_tree()
+
 #end IncMultiPattern
 
 
-- 
2.34.1

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tests/decode/check.sh    | 24 ----------------
 tests/decode/meson.build | 59 ++++++++++++++++++++++++++++++++++++++++
 tests/meson.build        |  5 +---
 3 files changed, 60 insertions(+), 28 deletions(-)
 delete mode 100755 tests/decode/check.sh
 create mode 100644 tests/decode/meson.build

diff --git a/tests/decode/check.sh b/tests/decode/check.sh
deleted file mode 100755
index XXXXXXX..XXXXXXX
--- a/tests/decode/check.sh
+++ /dev/null
@@ -XXX,XX +XXX,XX @@
-#!/bin/sh
-# This work is licensed under the terms of the GNU LGPL, version 2 or later.
-# See the COPYING.LIB file in the top-level directory.
-
-PYTHON=$1
-DECODETREE=$2
-E=0
-
-# All of these tests should produce errors
-for i in err_*.decode; do
-    if $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
-        # Pass, aka failed to fail.
-        echo FAIL: $i 1>&2
-        E=1
-    fi
-done
-
-for i in succ_*.decode; do
-    if ! $PYTHON $DECODETREE $i > /dev/null 2> /dev/null; then
-        echo FAIL:$i 1>&2
-    fi
-done
-
-exit $E
diff --git a/tests/decode/meson.build b/tests/decode/meson.build
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@
+err_tests = [
+    'err_argset1.decode',
+    'err_argset2.decode',
+    'err_field1.decode',
+    'err_field2.decode',
+    'err_field3.decode',
+    'err_field4.decode',
+    'err_field5.decode',
+    'err_field6.decode',
+    'err_init1.decode',
+    'err_init2.decode',
+    'err_init3.decode',
+    'err_init4.decode',
+    'err_overlap1.decode',
+    'err_overlap2.decode',
+    'err_overlap3.decode',
+    'err_overlap4.decode',
+    'err_overlap5.decode',
+    'err_overlap6.decode',
+    'err_overlap7.decode',
+    'err_overlap8.decode',
+    'err_overlap9.decode',
+    'err_pattern_group_empty.decode',
+    'err_pattern_group_ident1.decode',
+    'err_pattern_group_ident2.decode',
+    'err_pattern_group_nest1.decode',
+    'err_pattern_group_nest2.decode',
+    'err_pattern_group_nest3.decode',
+    'err_pattern_group_overlap1.decode',
+    'err_width1.decode',
+    'err_width2.decode',
+    'err_width3.decode',
+    'err_width4.decode',
+]
+
+succ_tests = [
+    'succ_argset_type1.decode',
+    'succ_function.decode',
+    'succ_ident1.decode',
+    'succ_pattern_group_nest1.decode',
+    'succ_pattern_group_nest2.decode',
+    'succ_pattern_group_nest3.decode',
+    'succ_pattern_group_nest4.decode',
+]
+
+suite = 'decodetree'
+decodetree = find_program(meson.project_source_root() / 'scripts/decodetree.py')
+
+foreach t: err_tests
+    test(fs.replace_suffix(t, ''),
+         decodetree, args: ['-o', '/dev/null', '--test-for-error', files(t)],
+         suite: suite)
+endforeach
+
+foreach t: succ_tests
+    test(fs.replace_suffix(t, ''),
+         decodetree, args: ['-o', '/dev/null', files(t)],
+         suite: suite)
+endforeach
diff --git a/tests/meson.build b/tests/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -XXX,XX +XXX,XX @@ if have_tools and have_vhost_user and 'CONFIG_LINUX' in config_host
              dependencies: [qemuutil, vhost_user])
 endif
 
-test('decodetree', sh,
-     args: [ files('decode/check.sh'), config_host['PYTHON'], files('../scripts/decodetree.py') ],
-     workdir: meson.current_source_dir() / 'decode',
-     suite: 'decodetree')
+subdir('decode')
 
 if 'CONFIG_TCG' in config_all
   subdir('fp')
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Document the named field syntax that we want to implement for the
decodetree script.  This allows a field to be defined in terms of
some other field that the instruction pattern has already set, for
example:

%sz_imm 10:3 sz:3 !function=expand_sz_imm

to allow a function to be passed both an immediate field from the
instruction and also a sz value which might have been specified by
the instruction pattern directly (sz=1, etc) rather than being a
simple field within the instruction.

Note that the restriction on not having the format referring to the
pattern and the pattern referring to the format simultaneously is a
restriction of the decoder generator rather than inherently being a
silly thing to do.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-3-peter.maydell@linaro.org>
---
 docs/devel/decodetree.rst | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
index XXXXXXX..XXXXXXX 100644
--- a/docs/devel/decodetree.rst
+++ b/docs/devel/decodetree.rst
@@ -XXX,XX +XXX,XX @@ Fields
 
 Syntax::
 
-  field_def     := '%' identifier ( unnamed_field )* ( !function=identifier )?
+  field_def     := '%' identifier ( field )* ( !function=identifier )?
+  field         := unnamed_field | named_field
   unnamed_field := number ':' ( 's' ) number
+  named_field   := identifier ':' ( 's' ) number
 
 For *unnamed_field*, the first number is the least-significant bit position
 of the field and the second number is the length of the field.  If the 's' is
-present, the field is considered signed.  If multiple ``unnamed_fields`` are
-present, they are concatenated.  In this way one can define disjoint fields.
+present, the field is considered signed.
+
+A *named_field* refers to some other field in the instruction pattern
+or format. Regardless of the length of the other field where it is
+defined, it will be inserted into this field with the specified
+signedness and bit width.
+
+Field definitions that involve loops (i.e. where a field is defined
+directly or indirectly in terms of itself) are errors.
+
+A format can include fields that refer to named fields that are
+defined in the instruction pattern(s) that use the format.
+Conversely, an instruction pattern can include fields that refer to
+named fields that are defined in the format it uses. However you
+cannot currently do both at once (i.e. pattern P uses format F; F has
+a field A that refers to a named field B that is defined in P, and P
+has a field C that refers to a named field D that is defined in F).
+
+If multiple ``fields`` are present, they are concatenated.
+In this way one can define disjoint fields.
 
 If ``!function`` is specified, the concatenated result is passed through the
 named function, taking and returning an integral value.
 
-One may use ``!function`` with zero ``unnamed_fields``.  This case is called
+One may use ``!function`` with zero ``fields``.  This case is called
 a *parameter*, and the named function is only passed the ``DisasContext``
 and returns an integral value extracted from there.
 
-A field with no ``unnamed_fields`` and no ``!function`` is in error.
+A field with no ``fields`` and no ``!function`` is in error.
 
 Field examples:
 
@@ -XXX,XX +XXX,XX @@ Field examples:
 | %shimm8 5:s8 13:1         | expand_shimm8(sextract(i, 5, 8) << 1 |      |
 |   !function=expand_shimm8 |               extract(i, 13, 1))            |
 +---------------------------+---------------------------------------------+
+| %sz_imm 10:2 sz:3         | expand_sz_imm(extract(i, 10, 2) << 3 |      |
+|   !function=expand_sz_imm |               extract(a->sz, 0, 3))         |
++---------------------------+---------------------------------------------+
 
 Argument Sets
 =============
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

To support referring to other named fields in field definitions, we
need to pass the str_extract() method a function which tells it how
to emit the code for a previously initialized named field.  (In
Pattern::output_code() the other field will be "u.f_foo.field", and
in Format::output_extract() it is "a->field".)

Refactor the two callsites that currently do "output code to
initialize each field", and have them pass a lambda that defines how
to format the lvalue in each case.  This is then used both in
emitting the LHS of the assignment and also passed down to
str_extract() as a new argument (unused at the moment, but will be
used in the following patch).

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-4-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def __str__(self):
             s = ''
         return str(self.pos) + ':' + s + str(self.len)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         global bitop_width
         s = 's' if self.sign else ''
         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
@@ -XXX,XX +XXX,XX @@ def __init__(self, subs, mask):
     def __str__(self):
         return str(self.subs)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         global bitop_width
         ret = '0'
         pos = 0
         for f in reversed(self.subs):
-            ext = f.str_extract()
+            ext = f.str_extract(lvalue_formatter)
             if pos == 0:
                 ret = ext
             else:
@@ -XXX,XX +XXX,XX @@ def __init__(self, value):
     def __str__(self):
         return str(self.value)
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         return str(self.value)
 
     def __cmp__(self, other):
@@ -XXX,XX +XXX,XX @@ def __init__(self, func, base):
     def __str__(self):
         return self.func + '(' + str(self.base) + ')'
 
-    def str_extract(self):
-        return self.func + '(ctx, ' + self.base.str_extract() + ')'
+    def str_extract(self, lvalue_formatter):
+        return (self.func + '(ctx, '
+                + self.base.str_extract(lvalue_formatter) + ')')
 
     def __eq__(self, other):
         return self.func == other.func and self.base == other.base
@@ -XXX,XX +XXX,XX @@ def __init__(self, func):
     def __str__(self):
         return self.func
 
-    def str_extract(self):
+    def str_extract(self, lvalue_formatter):
         return self.func + '(ctx)'
 
     def __eq__(self, other):
@@ -XXX,XX +XXX,XX @@ def __str__(self):
 
     def str1(self, i):
         return str_indent(i) + self.__str__()
+
+    def output_fields(self, indent, lvalue_formatter):
+        for n, f in self.fields.items():
+            output(indent, lvalue_formatter(n), ' = ',
+                   f.str_extract(lvalue_formatter), ';\n')
 # end General
 
 
@@ -XXX,XX +XXX,XX @@ def extract_name(self):
     def output_extract(self):
         output('static void ', self.extract_name(), '(DisasContext *ctx, ',
                self.base.struct_name(), ' *a, ', insntype, ' insn)\n{\n')
-        for n, f in self.fields.items():
-            output('    a->', n, ' = ', f.str_extract(), ';\n')
+        self.output_fields(str_indent(4), lambda n: 'a->' + n)
         output('}\n\n')
 # end Format
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         if not extracted:
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', arg, ', insn);\n')
-        for n, f in self.fields.items():
-            output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
+        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
         output(ind, 'if (', translate_prefix, '_', self.name,
                '(ctx, &u.f_', arg, ')) return true;\n')
 
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

To support named fields, we will need to be able to do a topological
sort (so that we ensure that we output the assignment to field A
before the assignment to field B if field B refers to field A by
name). The good news is that there is a tsort in the python standard
library; the bad news is that it was only added in Python 3.9.

To bridge the gap between our current minimum supported Python
version and 3.9, provide a local implementation that has the
same API as the stdlib version for the parts we care about.
In future when QEMU's minimum Python version requirement reaches
3.9 we can delete this code and replace it with an 'import' line.

The core of this implementation is based on
https://code.activestate.com/recipes/578272-topological-sort/
which is MIT-licensed.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-5-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 74 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@
 re_fmt_ident = '@[a-zA-Z0-9_]*'
 re_pat_ident = '[a-zA-Z0-9_]*'
 
+# Local implementation of a topological sort. We use the same API that
+# the Python graphlib does, so that when QEMU moves forward to a
+# baseline of Python 3.9 or newer this code can all be dropped and
+# replaced with:
+#    from graphlib import TopologicalSorter, CycleError
+#
+# https://docs.python.org/3.9/library/graphlib.html#graphlib.TopologicalSorter
+#
+# We only implement the parts of TopologicalSorter we care about:
+#  ts = TopologicalSorter(graph=None)
+#    create the sorter. graph is a dictionary whose keys are
+#    nodes and whose values are lists of the predecessors of that node.
+#    (That is, if graph contains "A" -> ["B", "C"] then we must output
+#    B and C before A.)
+#  ts.static_order()
+#    returns a list of all the nodes in sorted order, or raises CycleError
+#  CycleError
+#    exception raised if there are cycles in the graph. The second
+#    element in the args attribute is a list of nodes which form a
+#    cycle; the first and last element are the same, eg [a, b, c, a]
+#    (Our implementation doesn't give the order correctly.)
+#
+# For our purposes we can assume that the data set is always small
+# (typically 10 nodes or less, actual links in the graph very rare),
+# so we don't need to worry about efficiency of implementation.
+#
+# The core of this implementation is from
+# https://code.activestate.com/recipes/578272-topological-sort/
+# (but updated to Python 3), and is under the MIT license.
+
+class CycleError(ValueError):
+    """Subclass of ValueError raised if cycles exist in the graph"""
+    pass
+
+class TopologicalSorter:
+    """Topologically sort a graph"""
+    def __init__(self, graph=None):
+        self.graph = graph
+
+    def static_order(self):
+        # We do the sort right here, unlike the stdlib version
+        from functools import reduce
+        data = {}
+        r = []
+
+        if not self.graph:
+            return []
+
+        # This code wants the values in the dict to be specifically sets
+        for k, v in self.graph.items():
+            data[k] = set(v)
+
+        # Find all items that don't depend on anything.
+        extra_items_in_deps = (reduce(set.union, data.values())
+                               - set(data.keys()))
+        # Add empty dependencies where needed
+        data.update({item:{} for item in extra_items_in_deps})
+        while True:
+            ordered = set(item for item, dep in data.items() if not dep)
+            if not ordered:
+                break
+            r.extend(ordered)
+            data = {item: (dep - ordered)
+                    for item, dep in data.items()
+                        if item not in ordered}
+        if data:
+            # This doesn't give as nice results as the stdlib, which
+            # gives you the cycle by listing the nodes in order. Here
+            # we only know the nodes in the cycle but not their order.
+            raise CycleError(f'nodes are in a cycle', list(data.keys()))
+
+        return r
+# end TopologicalSorter
+
 def error_with_file(file, lineno, *args):
     """Print an error message from file:line and args and exit."""
     global output_file
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Implement support for named fields, i.e.  where one field is defined
in terms of another, rather than directly in terms of bits extracted
from the instruction.

The new method referenced_fields() on all the Field classes returns a
list of fields that this field references.  This just passes through,
except for the new NamedField class.

We can then use referenced_fields() to:
 * construct a list of 'dangling references' for a format or
   pattern, which is the fields that the format/pattern uses but
   doesn't define itself
 * do a topological sort, so that we output "field = value"
   assignments in an order that means that we assign a field before
   we reference it in a subsequent assignment
 * check when we output the code for a pattern whether we need to
   fill in the format fields before or after the pattern fields, and
   do other error checking

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-6-peter.maydell@linaro.org>
---
 scripts/decodetree.py | 145 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 139 insertions(+), 6 deletions(-)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index XXXXXXX..XXXXXXX 100644
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
         s = 's' if self.sign else ''
         return f'{s}extract{bitop_width}(insn, {self.pos}, {self.len})'
 
+    def referenced_fields(self):
+        return []
+
     def __eq__(self, other):
         return self.sign == other.sign and self.mask == other.mask
 
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
             pos += f.len
         return ret
 
+    def referenced_fields(self):
+        l = []
+        for f in self.subs:
+            l.extend(f.referenced_fields())
+        return l
+
     def __ne__(self, other):
         if len(self.subs) != len(other.subs):
             return True
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str_extract(self, lvalue_formatter):
         return str(self.value)
 
+    def referenced_fields(self):
+        return []
+
     def __cmp__(self, other):
         return self.value - other.value
 # end ConstField
@@ -XXX,XX +XXX,XX @@ def str_extract(self, lvalue_formatter):
         return (self.func + '(ctx, '
                 + self.base.str_extract(lvalue_formatter) + ')')
 
+    def referenced_fields(self):
+        return self.base.referenced_fields()
+
     def __eq__(self, other):
         return self.func == other.func and self.base == other.base
 
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str_extract(self, lvalue_formatter):
         return self.func + '(ctx)'
 
+    def referenced_fields(self):
+        return []
+
     def __eq__(self, other):
         return self.func == other.func
 
@@ -XXX,XX +XXX,XX @@ def __ne__(self, other):
         return not self.__eq__(other)
 # end ParameterField
 
+class NamedField:
+    """Class representing a field already named in the pattern"""
+    def __init__(self, name, sign, len):
+        self.mask = 0
+        self.sign = sign
+        self.len = len
+        self.name = name
+
+    def __str__(self):
+        return self.name
+
+    def str_extract(self, lvalue_formatter):
+        global bitop_width
+        s = 's' if self.sign else ''
+        lvalue = lvalue_formatter(self.name)
+        return f'{s}extract{bitop_width}({lvalue}, 0, {self.len})'
+
+    def referenced_fields(self):
+        return [self.name]
+
+    def __eq__(self, other):
+        return self.name == other.name
+
+    def __ne__(self, other):
+        return not self.__eq__(other)
+# end NamedField
 
 class Arguments:
     """Class representing the extracted fields of a format"""
@@ -XXX,XX +XXX,XX @@ def output_def(self):
             output('} ', self.struct_name(), ';\n\n')
 # end Arguments
 
-
 class General:
     """Common code between instruction formats and instruction patterns"""
     def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
@@ -XXX,XX +XXX,XX @@ def __init__(self, name, lineno, base, fixb, fixm, udfm, fldm, flds, w):
         self.fieldmask = fldm
         self.fields = flds
         self.width = w
+        self.dangling = None
 
     def __str__(self):
         return self.name + ' ' + str_match_bits(self.fixedbits, self.fixedmask)
@@ -XXX,XX +XXX,XX @@ def __str__(self):
     def str1(self, i):
         return str_indent(i) + self.__str__()
 
+    def dangling_references(self):
+        # Return a list of all named references which aren't satisfied
+        # directly by this format/pattern. This will be either:
+        #  * a format referring to a field which is specified by the
+        #    pattern(s) using it
+        #  * a pattern referring to a field which is specified by the
+        #    format it uses
+        #  * a user error (referring to a field that doesn't exist at all)
+        if self.dangling is None:
+            # Compute this once and cache the answer
+            dangling = []
+            for n, f in self.fields.items():
+                for r in f.referenced_fields():
+                    if r not in self.fields:
+                        dangling.append(r)
+            self.dangling = dangling
+        return self.dangling
+
     def output_fields(self, indent, lvalue_formatter):
+        # We use a topological sort to ensure that any use of NamedField
+        # comes after the initialization of the field it is referencing.
+        graph = {}
         for n, f in self.fields.items():
-            output(indent, lvalue_formatter(n), ' = ',
-                   f.str_extract(lvalue_formatter), ';\n')
+            refs = f.referenced_fields()
+            graph[n] = refs
+
+        try:
+            ts = TopologicalSorter(graph)
+            for n in ts.static_order():
+                # We only want to emit assignments for the keys
+                # in our fields list, not for anything that ends up
+                # in the tsort graph only because it was referenced as
+                # a NamedField.
+                try:
+                    f = self.fields[n]
+                    output(indent, lvalue_formatter(n), ' = ',
+                           f.str_extract(lvalue_formatter), ';\n')
+                except KeyError:
+                    pass
+        except CycleError as e:
+            # The second element of args is a list of nodes which form
+            # a cycle (there might be others too, but only one is reported).
+            # Pretty-print it to tell the user.
+            cycle = ' => '.join(e.args[1])
+            error(self.lineno, 'field definitions form a cycle: ' + cycle)
 # end General
 
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         ind = str_indent(i)
         arg = self.base.base.name
         output(ind, '/* ', self.file, ':', str(self.lineno), ' */\n')
+        # We might have named references in the format that refer to fields
+        # in the pattern, or named references in the pattern that refer
+        # to fields in the format. This affects whether we extract the fields
+        # for the format before or after the ones for the pattern.
+        # For simplicity we don't allow cross references in both directions.
+        # This is also where we catch the syntax error of referring to
+        # a nonexistent field.
+        fmt_refs = self.base.dangling_references()
+        for r in fmt_refs:
+            if r not in self.fields:
+                error(self.lineno, f'format refers to undefined field {r}')
+        pat_refs = self.dangling_references()
+        for r in pat_refs:
+            if r not in self.base.fields:
+                error(self.lineno, f'pattern refers to undefined field {r}')
+        if pat_refs and fmt_refs:
+            error(self.lineno, ('pattern that uses fields defined in format '
+                                'cannot use format that uses fields defined '
+                                'in pattern'))
+        if fmt_refs:
+            # pattern fields first
+            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+            assert not extracted, "dangling fmt refs but it was already extracted"
         if not extracted:
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', arg, ', insn);\n')
-        self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+        if not fmt_refs:
+            # pattern fields last
+            self.output_fields(ind, lambda n: 'u.f_' + arg + '.' + n)
+
         output(ind, 'if (', translate_prefix, '_', self.name,
                '(ctx, &u.f_', arg, ')) return true;\n')
 
@@ -XXX,XX +XXX,XX @@ def output_code(self, i, extracted, outerbits, outermask):
         ind = str_indent(i)
 
         # If we identified all nodes below have the same format,
-        # extract the fields now.
-        if not extracted and self.base:
+        # extract the fields now. But don't do it if the format relies
+        # on named fields from the insn pattern, as those won't have
+        # been initialised at this point.
+        if not extracted and self.base and not self.base.dangling_references():
             output(ind, self.base.extract_name(),
                    '(ctx, &u.f_', self.base.base.name, ', insn);\n')
             extracted = True
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
     """Parse one instruction field from TOKS at LINENO"""
     global fields
     global insnwidth
+    global re_C_ident
 
     # A "simple" field will have only one entry;
     # a "multifield" will have several.
@@ -XXX,XX +XXX,XX @@ def parse_field(lineno, name, toks):
             func = func[1]
             continue
 
+        if re.fullmatch(re_C_ident + ':s[0-9]+', t):
+            # Signed named field
+            subtoks = t.split(':')
+            n = subtoks[0]
+            le = int(subtoks[1])
+            f = NamedField(n, True, le)
+            subs.append(f)
+            width += le
+            continue
+        if re.fullmatch(re_C_ident + ':[0-9]+', t):
+            # Unsigned named field
+            subtoks = t.split(':')
+            n = subtoks[0]
+            le = int(subtoks[1])
+            f = NamedField(n, False, le)
+            subs.append(f)
+            width += le
+            continue
+
         if re.fullmatch('[0-9]+:s[0-9]+', t):
             # Signed field extract
             subtoks = t.split(':s')
-- 
2.34.1

From: Peter Maydell <peter.maydell@linaro.org>

Add some tests for various cases of named-field use, both ones that
should work and ones that should be diagnosed as errors.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230523120447.728365-7-peter.maydell@linaro.org>
---
 tests/decode/err_field10.decode      |  7 +++++++
 tests/decode/err_field7.decode       |  7 +++++++
 tests/decode/err_field8.decode       |  8 ++++++++
 tests/decode/err_field9.decode       | 14 ++++++++++++++
 tests/decode/succ_named_field.decode | 19 +++++++++++++++++++
 tests/decode/meson.build             |  5 +++++
 6 files changed, 60 insertions(+)
 create mode 100644 tests/decode/err_field10.decode
 create mode 100644 tests/decode/err_field7.decode
 create mode 100644 tests/decode/err_field8.decode
 create mode 100644 tests/decode/err_field9.decode
 create mode 100644 tests/decode/succ_named_field.decode

diff --git a/tests/decode/err_field10.decode b/tests/decode/err_field10.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field10.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose formats which refer to undefined fields
+%field1        field2:3
+@fmt ........ ........ ........ ........ %field1
+insn 00000000 00000000 00000000 00000000 @fmt
diff --git a/tests/decode/err_field7.decode b/tests/decode/err_field7.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field7.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose fields whose definitions form a loop
+%field1        field2:3
+%field2        field1:4
+insn 00000000 00000000 00000000 00000000 %field1 %field2
diff --git a/tests/decode/err_field8.decode b/tests/decode/err_field8.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field8.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose patterns which refer to undefined fields
+&f1 f1 a
+%field1        field2:3
+@fmt ........ ........ ........ .... a:4 &f1
+insn 00000000 00000000 00000000 0000 .... @fmt f1=%field1
diff --git a/tests/decode/err_field9.decode b/tests/decode/err_field9.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/err_field9.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# Diagnose fields where the format refers to a field defined in the
+# pattern and the pattern refers to a field defined in the format.
+# This is theoretically not impossible to implement, but is not
+# supported by the script at this time.
+&abcd a b c d
+%refa        a:3
+%refc        c:4
+# Format defines 'c' and sets 'b' to an indirect ref to 'a'
+@fmt ........ ........ ........ c:8 &abcd b=%refa
+# Pattern defines 'a' and sets 'd' to an indirect ref to 'c'
+insn 00000000 00000000 00000000 ........ @fmt d=%refc a=6
diff --git a/tests/decode/succ_named_field.decode b/tests/decode/succ_named_field.decode
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/decode/succ_named_field.decode
@@ -XXX,XX +XXX,XX @@
+# This work is licensed under the terms of the GNU LGPL, version 2 or later.
+# See the COPYING.LIB file in the top-level directory.
+
+# field using a named_field
+%imm_sz	8:8 sz:3
+insn 00000000 00000000 ........ 00000000 imm_sz=%imm_sz sz=1
+
+# Ditto, via a format. Here a field in the format
+# references a named field defined in the insn pattern:
+&imm_a imm alpha
+%foo 0:16 alpha:4
+@foo 00000001 ........ ........ ........ &imm_a imm=%foo
+i1   ........ 00000000 ........ ........ @foo alpha=1
+i2   ........ 00000001 ........ ........ @foo alpha=2
+
+# Here the named field is defined in the format and referenced
+# from the insn pattern:
+@bar 00000010 ........ ........ ........ &imm_a alpha=4
+i3   ........ 00000000 ........ ........ @bar imm=%foo
diff --git a/tests/decode/meson.build b/tests/decode/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/decode/meson.build
+++ b/tests/decode/meson.build
@@ -XXX,XX +XXX,XX @@ err_tests = [
     'err_field4.decode',
     'err_field5.decode',
     'err_field6.decode',
+    'err_field7.decode',
+    'err_field8.decode',
+    'err_field9.decode',
+    'err_field10.decode',
     'err_init1.decode',
     'err_init2.decode',
     'err_init3.decode',
@@ -XXX,XX +XXX,XX @@ succ_tests = [
     'succ_argset_type1.decode',
     'succ_function.decode',
     'succ_ident1.decode',
+    'succ_named_field.decode',
     'succ_pattern_group_nest1.decode',
     'succ_pattern_group_nest2.decode',
     'succ_pattern_group_nest3.decode',
-- 
2.34.1