From nobody Wed Nov  5 10:40:46 2025
Delivered-To: importer@patchew.org
Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as
 permitted sender) client-ip=208.118.235.17;
 envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org;
 helo=lists.gnu.org;
Authentication-Results: mx.zohomail.com;
	dkim=fail;
	spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted
 sender)  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org;
	dmarc=fail(p=none dis=none)  header.from=linaro.org
Return-Path: <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Received: from lists.gnu.org (208.118.235.17 [208.118.235.17]) by
 mx.zohomail.com
	with SMTPS id 1533789318127220.11428576017488;
 Wed, 8 Aug 2018 21:35:18 -0700 (PDT)
Received: from localhost ([::1]:47288 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <qemu-devel-bounces+importer=patchew.org@nongnu.org>)
	id 1fncf9-0003Th-Rf
	for importer@patchew.org; Thu, 09 Aug 2018 00:35:03 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:54079)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <richard.henderson@linaro.org>) id 1fncT0-0000xu-Pa
	for qemu-devel@nongnu.org; Thu, 09 Aug 2018 00:22:35 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <richard.henderson@linaro.org>) id 1fncSx-0007MS-Gx
	for qemu-devel@nongnu.org; Thu, 09 Aug 2018 00:22:30 -0400
Received: from mail-pf1-x42d.google.com ([2607:f8b0:4864:20::42d]:32826)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <richard.henderson@linaro.org>)
	id 1fncSx-0007Le-4Q
	for qemu-devel@nongnu.org; Thu, 09 Aug 2018 00:22:27 -0400
Received: by mail-pf1-x42d.google.com with SMTP id d4-v6so2209689pfn.0
	for <qemu-devel@nongnu.org>; Wed, 08 Aug 2018 21:22:27 -0700 (PDT)
Received: from cloudburst.twiddle.net (97-113-8-179.tukw.qwest.net.
	[97.113.8.179]) by smtp.gmail.com with ESMTPSA id
	m30-v6sm7355799pff.121.2018.08.08.21.22.24
	(version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
	Wed, 08 Aug 2018 21:22:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google;
	h=from:to:cc:subject:date:message-id:in-reply-to:references;
	bh=suxGo2uB+X2J7hOEpgGfAbZy9Aiz9wXIbZligPkw1xA=;
	b=WAs3vScUjdD+Nst/qbDL41lSfsl8sOtocV8r1EUbCd/per39VY52E5DvrHOAj3WBiW
	SSTNvtOxdQYNn1/goFpFv6BnQ3iKGIh17YGDNRLRt79eqQirZiXVVGgMcoq5ECGbSvSE
	fHXIe44mBPmqjgXDNdJdSCyxTkeJV+LbaeopI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
	:references;
	bh=suxGo2uB+X2J7hOEpgGfAbZy9Aiz9wXIbZligPkw1xA=;
	b=WZlTz5iSsBrgoVIo7v832aSJozXRf/vQpzMmT4PmDVZMSqVRgehM+MNOldL4PfL5xm
	9yEg4AIHTyOGkz2+pZyhmRYx+4H8PUaqH14DlGBIR5Vshbm8xDNC7Z8ozjGeRIFd6hA9
	6DfJ9CzKd8gdp50ai5T6Bvt2i68ZgS5XHvRgQLCsucv1dhnLNYfchw+z07h4mBxiSLrZ
	lmJkLyYsqdUO7gNsTjOL529i8sB4YRLlnStd0/OEtG6dUZyqDvzqZJW6sro3sXqs+Yha
	Dk7yfto4PeMPWO+cc3Wg3uAPPtcXCpXWqfM8vZzWDe1Brl9KXjZe8fieix+BWw8IHw7O
	zPZg==
X-Gm-Message-State: AOUpUlE/WQUp0d4NK+H6StiF1/Ugz2DeWKn1okaXPY28TC2hqYqnf5Ph
	YzQp0kyFRO2DlRVhRG/wOkQ25TtjFiw=
X-Google-Smtp-Source: 
 AA+uWPyC2zwgtFYCxnHT20W3y8H1NXluk+BDsivOl57THieVu8UCMYipxtvmg3Ndum5Ttj9t4UKBJQ==
X-Received: by 2002:a63:da04:: with SMTP id
 c4-v6mr495312pgh.398.1533788545529;
	Wed, 08 Aug 2018 21:22:25 -0700 (PDT)
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Date: Wed,  8 Aug 2018 21:21:58 -0700
Message-Id: <20180809042206.15726-13-richard.henderson@linaro.org>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20180809042206.15726-1-richard.henderson@linaro.org>
References: <20180809042206.15726-1-richard.henderson@linaro.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:4864:20::42d
Subject: [Qemu-devel] [PATCH 12/20] target/arm: Rewrite helper_sve_ld1*_r
 using pages
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: laurent.desnogues@gmail.com, peter.maydell@linaro.org,
	alex.bennee@linaro.org
Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+importer=patchew.org@nongnu.org>
X-ZohoMail-DKIM: fail (Header signature does not verify)
X-ZohoMail: RDMRC_1  RDKM_2  RSF_0  Z_629925259 SPT_0
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

Uses tlb_vaddr_to_host for correct operation with softmmu.
Optimize for accesses within a single page or pair of pages.

Perf report comparison for cortex-strings test-strlen
with aarch64-linux-user:

before:
   1.59%  qemu-aarch64  qemu-aarch64  [.] do_sve_ld1bb_r
   0.86%  qemu-aarch64  qemu-aarch64  [.] do_sve_ldff1bb_r
after:
   0.09%  qemu-aarch64  qemu-aarch64  [.] helper_sve_ldff1bb_r
   0.01%  qemu-aarch64  qemu-aarch64  [.] sve_ld1bb_host

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/sve_helper.c | 839 ++++++++++++++++++++++++++++++++--------
 1 file changed, 675 insertions(+), 164 deletions(-)

diff --git a/target/arm/sve_helper.c b/target/arm/sve_helper.c
index e03f954a26..4ca9412e20 100644
--- a/target/arm/sve_helper.c
+++ b/target/arm/sve_helper.c
@@ -1688,6 +1688,45 @@ static void swap_memmove(void *vd, void *vs, size_t =
n)
     }
 }
=20
+/* Similarly for memset of 0.  */
+static void swap_memzero(void *vd, size_t n)
+{
+    uintptr_t d =3D (uintptr_t)vd;
+    uintptr_t o =3D (d | n) & 7;
+    size_t i;
+
+    if (likely(n =3D=3D 0)) {
+        return;
+    }
+#ifndef HOST_WORDS_BIGENDIAN
+    o =3D 0;
+#endif
+    switch (o) {
+    case 0:
+        memset(vd, 0, n);
+        break;
+
+    case 4:
+        for (i =3D 0; i < n; i +=3D 4) {
+            *(uint32_t *)H1_4(d + i) =3D 0;
+        }
+        break;
+
+    case 2:
+    case 6:
+        for (i =3D 0; i < n; i +=3D 2) {
+            *(uint16_t *)H1_2(d + i) =3D 0;
+        }
+        break;
+
+    default:
+        for (i =3D 0; i < n; i++) {
+            *(uint8_t *)H1(d + i) =3D 0;
+        }
+        break;
+    }
+}
+
 void HELPER(sve_ext)(void *vd, void *vn, void *vm, uint32_t desc)
 {
     intptr_t opr_sz =3D simd_oprsz(desc);
@@ -3927,32 +3966,438 @@ void HELPER(sve_fcmla_zpzzz_d)(CPUARMState *env, v=
oid *vg, uint32_t desc)
 /*
  * Load contiguous data, protected by a governing predicate.
  */
-#define DO_LD1(NAME, FN, TYPEE, TYPEM, H)                  \
-static void do_##NAME(CPUARMState *env, void *vd, void *vg, \
-                      target_ulong addr, intptr_t oprsz,   \
-                      uintptr_t ra)                        \
-{                                                          \
-    intptr_t i =3D 0;                                        \
-    do {                                                   \
-        uint16_t pg =3D *(uint16_t *)(vg + H1_2(i >> 3));    \
-        do {                                               \
-            TYPEM m =3D 0;                                   \
-            if (pg & 1) {                                  \
-                m =3D FN(env, addr, ra);                     \
-            }                                              \
-            *(TYPEE *)(vd + H(i)) =3D m;                     \
-            i +=3D sizeof(TYPEE), pg >>=3D sizeof(TYPEE);      \
-            addr +=3D sizeof(TYPEM);                         \
-        } while (i & 15);                                  \
-    } while (i < oprsz);                                   \
-}                                                          \
-void HELPER(NAME)(CPUARMState *env, void *vg,              \
-                  target_ulong addr, uint32_t desc)        \
-{                                                          \
-    do_##NAME(env, &env->vfp.zregs[simd_data(desc)], vg,   \
-              addr, simd_oprsz(desc), GETPC());            \
+
+/* Load elements into VD, controlled by VG, from HOST+MEM_OFS.
+ * Memory is valid through MEM_MAX.  The register element indicies
+ * are inferred from MEM_OFS, as modified by the types for which
+ * the helper is built.  Return the MEM_OFS of the first element
+ * not loaded (which is MEM_MAX if they are all loaded).
+ *
+ * For softmmu, we have fully validated the guest page.  For user-only,
+ * we cannot fully validate without taking the mmap lock, but since we
+ * know the access is within one host page, if any access is valid they
+ * all must be valid.  However, it may be that no access is valid and
+ * they have all been predicated false.
+ */
+typedef intptr_t sve_ld1_host_fn(void *vd, void *vg, void *host,
+                                 intptr_t mem_ofs, intptr_t mem_max);
+
+/* Load one element into VD+REG_OFF from (ENV,VADDR,RA).
+ * The controlling predicate is known to be true.
+ */
+typedef void sve_ld1_tlb_fn(CPUARMState *env, void *vd, intptr_t reg_off,
+                            target_ulong vaddr, int mmu_idx, uintptr_t ra);
+
+/*
+ * Generate the above primitives.
+ */
+
+#define DO_LD_HOST(NAME, H, TYPEE, TYPEM, HOST) \
+static intptr_t sve_##NAME##_host(void *vd, void *vg, void *host,         =
  \
+                                  intptr_t mem_off, const intptr_t mem_max=
) \
+{                                                                         =
  \
+    intptr_t reg_off =3D mem_off * (sizeof(TYPEE) / sizeof(TYPEM));       =
    \
+    uint64_t *pg =3D vg;                                                  =
    \
+    while (mem_off + sizeof(TYPEM) <=3D mem_max) {                        =
    \
+        TYPEM val =3D 0;                                                  =
    \
+        if (likely((pg[reg_off >> 6] >> (reg_off & 63)) & 1)) {           =
  \
+            val =3D HOST(host + mem_off);                                 =
    \
+        }                                                                 =
  \
+        *(TYPEE *)(vd + H(reg_off)) =3D val;                              =
    \
+        mem_off +=3D sizeof(TYPEM), reg_off +=3D sizeof(TYPEE);           =
      \
+    }                                                                     =
  \
+    return mem_off;                                                       =
  \
 }
=20
+#ifdef CONFIG_SOFTMMU
+#define DO_LD_TLB(NAME, H, TYPEE, TYPEM, HOST, MOEND, TLB) \
+static void sve_##NAME##_tlb(CPUARMState *env, void *vd, intptr_t reg_off,=
  \
+                             target_ulong addr, int mmu_idx, uintptr_t ra)=
  \
+{                                                                         =
  \
+    TCGMemOpIdx oi =3D make_memop_idx(ctz32(sizeof(TYPEM)) | MOEND, mmu_id=
x); \
+    TYPEM val =3D TLB(env, addr, oi, ra);                                 =
    \
+    *(TYPEE *)(vd + H(reg_off)) =3D val;                                  =
    \
+}
+#else
+#define DO_LD_TLB(NAME, H, TYPEE, TYPEM, HOST, MOEND, TLB)                =
  \
+static void sve_##NAME##_tlb(CPUARMState *env, void *vd, intptr_t reg_off,=
  \
+                             target_ulong addr, int mmu_idx, uintptr_t ra)=
  \
+{                                                                         =
  \
+    TYPEM val =3D HOST(g2h(addr));                                        =
    \
+    *(TYPEE *)(vd + H(reg_off)) =3D val;                                  =
    \
+}
+#endif
+
+DO_LD_TLB(ld1bb, H1, uint8_t, uint8_t, ldub_p, 0, helper_ret_ldub_mmu)
+
+#define DO_LD_PRIM_1(NAME, H, TE, TM)                   \
+    DO_LD_HOST(NAME, H, TE, TM, ldub_p)                 \
+    DO_LD_TLB(NAME, H, TE, TM, ldub_p, 0, helper_ret_ldub_mmu)
+
+DO_LD_PRIM_1(ld1bhu, H1_2, uint16_t, uint8_t)
+DO_LD_PRIM_1(ld1bhs, H1_2, uint16_t,  int8_t)
+DO_LD_PRIM_1(ld1bsu, H1_4, uint32_t, uint8_t)
+DO_LD_PRIM_1(ld1bss, H1_4, uint32_t,  int8_t)
+DO_LD_PRIM_1(ld1bdu,     , uint64_t, uint8_t)
+DO_LD_PRIM_1(ld1bds,     , uint64_t,  int8_t)
+
+#define DO_LD_PRIM_2(NAME, end, MOEND, H, TE, TM, PH, PT)  \
+    DO_LD_HOST(NAME##_##end, H, TE, TM, PH##_##end##_p)    \
+    DO_LD_TLB(NAME##_##end, H, TE, TM, PH##_##end##_p,     \
+              MOEND, helper_##end##_##PT##_mmu)
+
+DO_LD_PRIM_2(ld1hh,  le, MO_LE, H1_2, uint16_t, uint16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hsu, le, MO_LE, H1_4, uint32_t, uint16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hss, le, MO_LE, H1_4, uint32_t,  int16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hdu, le, MO_LE,     , uint64_t, uint16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hds, le, MO_LE,     , uint64_t,  int16_t, lduw, lduw)
+
+DO_LD_PRIM_2(ld1ss,  le, MO_LE, H1_4, uint32_t, uint32_t, ldl, ldul)
+DO_LD_PRIM_2(ld1sdu, le, MO_LE,     , uint64_t, uint32_t, ldl, ldul)
+DO_LD_PRIM_2(ld1sds, le, MO_LE,     , uint64_t,  int32_t, ldl, ldul)
+
+DO_LD_PRIM_2(ld1dd,  le, MO_LE,     , uint64_t, uint64_t, ldq, ldq)
+
+DO_LD_PRIM_2(ld1hh,  be, MO_BE, H1_2, uint16_t, uint16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hsu, be, MO_BE, H1_4, uint32_t, uint16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hss, be, MO_BE, H1_4, uint32_t,  int16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hdu, be, MO_BE,     , uint64_t, uint16_t, lduw, lduw)
+DO_LD_PRIM_2(ld1hds, be, MO_BE,     , uint64_t,  int16_t, lduw, lduw)
+
+DO_LD_PRIM_2(ld1ss,  be, MO_BE, H1_4, uint32_t, uint32_t, ldl, ldul)
+DO_LD_PRIM_2(ld1sdu, be, MO_BE,     , uint64_t, uint32_t, ldl, ldul)
+DO_LD_PRIM_2(ld1sds, be, MO_BE,     , uint64_t,  int32_t, ldl, ldul)
+
+DO_LD_PRIM_2(ld1dd,  be, MO_BE,     , uint64_t, uint64_t, ldq, ldq)
+
+#undef DO_LD_TLB
+#undef DO_LD_HOST
+#undef DO_LD_PRIM_1
+#undef DO_LD_PRIM_2
+
+/*
+ * Special case contiguous loads of bytes to accellerate strings.
+ *
+ * The assumption is that the governing predicate will be mostly true.
+ * When it is not all true, it has been set by whilelo and so has a
+ * block of true elements followed by a block of false elements.
+ * Thus anything we can do to handle as many bytes as possible in one
+ * step will pay dividends.
+ *
+ * Because of how vector registers are represented in CPUARMState,
+ * each block of 8 can be read with a little-endian load to be stored
+ * into the vector register in host-endian order.
+ *
+ * TODO: For LE host and LE guest (by far the most common combination),
+ * the only difference for other non-extending loads is the controlling
+ * predicate.  Even for other combinations, it might be fastest to use
+ * this primitive to block load all of the data and then reorder the
+ * bytes afterward.
+ */
+
+/* For user-only, conditionally load and mask from HOST, returning 0
+ * if the predicate is false.  This is required because, as described
+ * above, we have not fully validated the page, and faults are not
+ * permitted when the predicate is false.
+ * For softmmu, we never arrive here with invalid host memory; just mask.
+ */
+static inline uint64_t ldq_le_pred_b(uint8_t pg, void *host)
+{
+#ifdef CONFIG_USER_ONLY
+    if (pg =3D=3D 0) {
+        return 0;
+    }
+#endif
+    return ldq_le_p(host) & expand_pred_b(pg);
+}
+
+static inline uint8_t ldub_pred(uint8_t pg, void *host)
+{
+#ifdef CONFIG_USER_ONLY
+    return pg & 1 ? ldub_p(host) : 0;
+#else
+    return ldub_p(host) & -(pg & 1);
+#endif
+}
+
+static intptr_t sve_ld1bb_host(void *vd, void *vg, void *host,
+                               intptr_t off, const intptr_t max)
+{
+    uint64_t *d =3D vd;
+    uint8_t *g =3D vg;
+
+    /* Assuming OFF and MAX may be misaligned, but also the most common
+     * case is an entire vector register: OFF =3D=3D 0, MAX % 16 =3D=3D 0.
+     */
+    if (likely(off + 8 <=3D max)) {
+        const intptr_t max_div_8 =3D max >> 3;
+        intptr_t off_div_8 =3D off >> 3;
+        uint64_t data;
+
+        if (unlikely(off & 63)) {
+            /* Align for a loop-of-8.  We know from the range check
+             * above that we have enough remaining to load 8 bytes.
+             */
+            if (unlikely(off & 7)) {
+                int off_7 =3D off & 7;
+                uint8_t pg =3D g[H1(off_div_8)] >> off_7;
+
+                off_7 *=3D 8;
+                data =3D ldq_le_pred_b(pg, host + off);
+                data =3D deposit64(d[off_div_8], off_7, 64 - off_7, data);
+                d[off_div_8] =3D data;
+
+                off_div_8 +=3D 1;
+            }
+
+            /* If there are not sufficient bytes to align for 64
+             * and also execute that loop at least once, skip to tail.
+             */
+            if (ROUND_UP(off_div_8, 8) + 8 > max_div_8) {
+                goto skip_64;
+            }
+
+            /* Align for the loop-of-64.  */
+            if (unlikely(off_div_8 & 7)) {
+                do {
+                    uint8_t pg =3D g[off_div_8];
+                    data =3D ldq_le_pred_b(pg, host + off_div_8 * 8);
+                    d[off_div_8] =3D data;
+                } while (++off_div_8 & 7);
+            }
+        }
+
+        /* While we have blocks of 64 remaining, we can perform tests
+         * against large blocks of predicates at once.
+         */
+        for (; off_div_8 + 8 <=3D max_div_8; off_div_8 +=3D 8) {
+            uint64_t pg =3D *(uint64_t *)(g + off_div_8);
+            if (likely(pg =3D=3D -1ULL)) {
+#ifndef HOST_WORDS_BIGENDIAN
+                memcpy(d + off_div_8, host + off_div_8 * 8, 64);
+#else
+                intptr_t j;
+                for (j =3D 0; j < 8; j++) {
+                    data =3D ldq_le_p(host + (off_div_8 + j) * 8);
+                    d[off_div_8 + j] =3D data;
+                }
+#endif
+            } else if (pg =3D=3D 0) {
+                memset(d + off_div_8, 0, 64);
+            } else {
+                intptr_t j;
+                for (j =3D 0; j < 8; j++) {
+                    data =3D ldq_le_pred_b(pg >> (j * 8),
+                                         host + (off_div_8 + j) * 8);
+                    d[off_div_8 + j] =3D data;
+                }
+            }
+        }
+
+ skip_64:
+        /* Final tail or a copy smaller than 64 bytes.  */
+        for (; off_div_8 < max_div_8; off_div_8++) {
+            uint8_t pg =3D g[H1(off_div_8)];
+            data =3D ldq_le_pred_b(pg, host + off_div_8 * 8);
+            d[off_div_8] =3D data;
+        }
+
+        /* Restore using OFF.  */
+        off =3D off_div_8 * 8;
+    }
+
+    /* Final tail or a really small copy.  */
+    if (unlikely(off < max)) {
+        do {
+            uint8_t pg =3D g[H1(off >> 3)] >> (off & 7);
+            ((uint8_t *)vd)[H1(off)] =3D ldub_pred(pg, host + off);
+        } while (++off < max);
+    }
+
+    return max;
+}
+
+/* Skip through a sequence of inactive elements in the guarding predicate =
VG,
+ * beginning at REG_OFF bounded by REG_MAX.  Return the offset of the acti=
ve
+ * element >=3D REG_OFF, or REG_MAX if there were no active elements at al=
l.
+ */
+static intptr_t find_next_active(uint64_t *vg, intptr_t reg_off,
+                                 intptr_t reg_max, int esz)
+{
+    uint64_t pg_mask =3D pred_esz_masks[esz];
+    uint64_t pg =3D (vg[reg_off >> 6] & pg_mask) >> (reg_off & 63);
+
+    /* In normal usage, the first element is active.  */
+    if (likely(pg & 1)) {
+        return reg_off;
+    }
+
+    if (pg =3D=3D 0) {
+        reg_off &=3D -64;
+        do {
+            reg_off +=3D 64;
+            if (unlikely(reg_off >=3D reg_max)) {
+                /* The entire predicate was false.  */
+                return reg_max;
+            }
+            pg =3D vg[reg_off >> 6] & pg_mask;
+        } while (pg =3D=3D 0);
+    }
+    reg_off +=3D ctz64(pg);
+
+    /* We should never see an out of range predicate bit set.  */
+    tcg_debug_assert(reg_off < reg_max);
+    return reg_off;
+}
+
+/* Return the maximum offset <=3D MEM_MAX which is still within the page
+ * referenced by BASE+MEM_OFF.
+ */
+static intptr_t max_for_page(target_ulong base, intptr_t mem_off,
+                             intptr_t mem_max)
+{
+    target_ulong addr =3D base + mem_off;
+    intptr_t split =3D -(intptr_t)(addr | TARGET_PAGE_MASK);
+    return MIN(split, mem_max - mem_off) + mem_off;
+}
+
+static inline void set_helper_retaddr(uintptr_t ra)
+{
+#ifdef CONFIG_USER_ONLY
+    helper_retaddr =3D ra;
+#endif
+}
+
+static inline bool test_host_page(void *host)
+{
+#ifdef CONFIG_USER_ONLY
+    return true;
+#else
+    return likely(host !=3D NULL);
+#endif
+}
+
+/*
+ * Common helper for all contiguous one-register predicated loads.
+ */
+static void sve_ld1_r(CPUARMState *env, void *vg, const target_ulong addr,
+                      uint32_t desc, const uintptr_t retaddr,
+                      const int esz, const int msz,
+                      sve_ld1_host_fn *host_fn,
+                      sve_ld1_tlb_fn *tlb_fn)
+{
+    void *vd =3D &env->vfp.zregs[simd_data(desc)];
+    const int diffsz =3D esz - msz;
+    const intptr_t reg_max =3D simd_oprsz(desc);
+    const intptr_t mem_max =3D reg_max >> diffsz;
+    const int mmu_idx =3D cpu_mmu_index(env, false);
+    ARMVectorReg scratch;
+    void *host, *result;
+    intptr_t split;
+
+    set_helper_retaddr(retaddr);
+
+    host =3D tlb_vaddr_to_host(env, addr, MMU_DATA_LOAD, mmu_idx);
+    if (test_host_page(host)) {
+        split =3D max_for_page(addr, 0, mem_max);
+        if (likely(split =3D=3D mem_max)) {
+            /* The load is entirely within a valid page.  For softmmu,
+             * no faults.  For user-only, if the first byte does not
+             * fault then none of them will fault, so Vd will never be
+             * partially modified.
+             */
+            host_fn(vd, vg, host, 0, mem_max);
+            set_helper_retaddr(0);
+            return;
+        }
+    }
+
+    /* Perform the predicated read into a temporary, thus ensuring
+     * if the load of the last element faults, Vd is not modified.
+     */
+    result =3D &scratch;
+#ifdef CONFIG_USER_ONLY
+    host_fn(vd, vg, host, 0, mem_max);
+#else
+    memset(result, 0, reg_max);
+    for (intptr_t reg_off =3D find_next_active(vg, 0, reg_max, esz);
+         reg_off < reg_max;
+         reg_off =3D find_next_active(vg, reg_off, reg_max, esz)) {
+        intptr_t mem_off =3D reg_off >> diffsz;
+
+        split =3D max_for_page(addr, mem_off, mem_max);
+        if (msz =3D=3D 0 || split - mem_off >=3D (1 << msz)) {
+            /* At least one whole element on this page.  */
+            host =3D tlb_vaddr_to_host(env, addr + mem_off,
+                                     MMU_DATA_LOAD, mmu_idx);
+            if (host) {
+                mem_off =3D host_fn(result, vg, host - mem_off, mem_off, s=
plit);
+                reg_off =3D mem_off << diffsz;
+                continue;
+            }
+        }
+
+        /* Perform one normal read.  This may fault, longjmping out to the
+         * main loop in order to raise an exception.  It may succeed, and
+         * as a side-effect load the TLB entry for the next round.  Finall=
y,
+         * in the extremely unlikely case we're performing this operation
+         * on I/O memory, it may succeed but not bring in the TLB entry.
+         * But even then we have still made forward progress.
+         */
+        tlb_fn(env, result, reg_off, addr + mem_off, mmu_idx, retaddr);
+        reg_off +=3D 1 << esz;
+    }
+#endif
+
+    set_helper_retaddr(0);
+    memcpy(vd, result, reg_max);
+}
+
+#define DO_LD1_1(NAME, ESZ) \
+void HELPER(sve_##NAME##_r)(CPUARMState *env, void *vg,        \
+                            target_ulong addr, uint32_t desc)  \
+{                                                              \
+    sve_ld1_r(env, vg, addr, desc, GETPC(), ESZ, 0,            \
+              sve_##NAME##_host, sve_##NAME##_tlb);            \
+}
+
+/* TODO: Propagate the endian check back to the translator.  */
+#define DO_LD1_2(NAME, ESZ, MSZ) \
+void HELPER(sve_##NAME##_r)(CPUARMState *env, void *vg,        \
+                            target_ulong addr, uint32_t desc)  \
+{                                                              \
+    if (arm_cpu_data_is_big_endian(env)) {                     \
+        sve_ld1_r(env, vg, addr, desc, GETPC(), ESZ, MSZ,      \
+                  sve_##NAME##_be_host, sve_##NAME##_be_tlb);  \
+    } else {                                                   \
+        sve_ld1_r(env, vg, addr, desc, GETPC(), ESZ, MSZ,      \
+                  sve_##NAME##_le_host, sve_##NAME##_le_tlb);  \
+    }                                                          \
+}
+
+DO_LD1_1(ld1bb,  0)
+DO_LD1_1(ld1bhu, 1)
+DO_LD1_1(ld1bhs, 1)
+DO_LD1_1(ld1bsu, 2)
+DO_LD1_1(ld1bss, 2)
+DO_LD1_1(ld1bdu, 3)
+DO_LD1_1(ld1bds, 3)
+
+DO_LD1_2(ld1hh,  1, 1)
+DO_LD1_2(ld1hsu, 2, 1)
+DO_LD1_2(ld1hss, 2, 1)
+DO_LD1_2(ld1hdu, 3, 1)
+DO_LD1_2(ld1hds, 3, 1)
+
+DO_LD1_2(ld1ss,  2, 2)
+DO_LD1_2(ld1sdu, 3, 2)
+DO_LD1_2(ld1sds, 3, 2)
+
+DO_LD1_2(ld1dd,  3, 3)
+
+#undef DO_LD1_1
+#undef DO_LD1_2
+
 #define DO_LD2(NAME, FN, TYPEE, TYPEM, H)                  \
 void HELPER(NAME)(CPUARMState *env, void *vg,              \
                   target_ulong addr, uint32_t desc)        \
@@ -4037,52 +4482,40 @@ void HELPER(NAME)(CPUARMState *env, void *vg,      =
        \
     }                                                      \
 }
=20
-DO_LD1(sve_ld1bhu_r, cpu_ldub_data_ra, uint16_t, uint8_t, H1_2)
-DO_LD1(sve_ld1bhs_r, cpu_ldsb_data_ra, uint16_t, int8_t, H1_2)
-DO_LD1(sve_ld1bsu_r, cpu_ldub_data_ra, uint32_t, uint8_t, H1_4)
-DO_LD1(sve_ld1bss_r, cpu_ldsb_data_ra, uint32_t, int8_t, H1_4)
-DO_LD1(sve_ld1bdu_r, cpu_ldub_data_ra, uint64_t, uint8_t, )
-DO_LD1(sve_ld1bds_r, cpu_ldsb_data_ra, uint64_t, int8_t, )
-
-DO_LD1(sve_ld1hsu_r, cpu_lduw_data_ra, uint32_t, uint16_t, H1_4)
-DO_LD1(sve_ld1hss_r, cpu_ldsw_data_ra, uint32_t, int16_t, H1_4)
-DO_LD1(sve_ld1hdu_r, cpu_lduw_data_ra, uint64_t, uint16_t, )
-DO_LD1(sve_ld1hds_r, cpu_ldsw_data_ra, uint64_t, int16_t, )
-
-DO_LD1(sve_ld1sdu_r, cpu_ldl_data_ra, uint64_t, uint32_t, )
-DO_LD1(sve_ld1sds_r, cpu_ldl_data_ra, uint64_t, int32_t, )
-
-DO_LD1(sve_ld1bb_r, cpu_ldub_data_ra, uint8_t, uint8_t, H1)
 DO_LD2(sve_ld2bb_r, cpu_ldub_data_ra, uint8_t, uint8_t, H1)
 DO_LD3(sve_ld3bb_r, cpu_ldub_data_ra, uint8_t, uint8_t, H1)
 DO_LD4(sve_ld4bb_r, cpu_ldub_data_ra, uint8_t, uint8_t, H1)
=20
-DO_LD1(sve_ld1hh_r, cpu_lduw_data_ra, uint16_t, uint16_t, H1_2)
 DO_LD2(sve_ld2hh_r, cpu_lduw_data_ra, uint16_t, uint16_t, H1_2)
 DO_LD3(sve_ld3hh_r, cpu_lduw_data_ra, uint16_t, uint16_t, H1_2)
 DO_LD4(sve_ld4hh_r, cpu_lduw_data_ra, uint16_t, uint16_t, H1_2)
=20
-DO_LD1(sve_ld1ss_r, cpu_ldl_data_ra, uint32_t, uint32_t, H1_4)
 DO_LD2(sve_ld2ss_r, cpu_ldl_data_ra, uint32_t, uint32_t, H1_4)
 DO_LD3(sve_ld3ss_r, cpu_ldl_data_ra, uint32_t, uint32_t, H1_4)
 DO_LD4(sve_ld4ss_r, cpu_ldl_data_ra, uint32_t, uint32_t, H1_4)
=20
-DO_LD1(sve_ld1dd_r, cpu_ldq_data_ra, uint64_t, uint64_t, )
 DO_LD2(sve_ld2dd_r, cpu_ldq_data_ra, uint64_t, uint64_t, )
 DO_LD3(sve_ld3dd_r, cpu_ldq_data_ra, uint64_t, uint64_t, )
 DO_LD4(sve_ld4dd_r, cpu_ldq_data_ra, uint64_t, uint64_t, )
=20
-#undef DO_LD1
 #undef DO_LD2
 #undef DO_LD3
 #undef DO_LD4
=20
 /*
  * Load contiguous data, first-fault and no-fault.
+ *
+ * For user-only, one could argue that we should hold the mmap_lock during
+ * the operation so that there is no race between page_check_range and the
+ * load operation.  However, unmapping pages out from under operating thre=
ad
+ * is extrodinarily unlikely.  This theoretical race condition also affects
+ * linux-user/ in its get_user/put_user macros.
+ *
+ * TODO: Construct some helpers, written in assembly, that interact with
+ * handle_cpu_signal to produce memory ops which can properly report errors
+ * without racing.
  */
=20
-#ifdef CONFIG_USER_ONLY
-
 /* Fault on byte I.  All bits in FFR from I are cleared.  The vector
  * result from I is CONSTRAINED UNPREDICTABLE; we choose the MERGE
  * option, which leaves subsequent data unchanged.
@@ -4092,147 +4525,225 @@ static void record_fault(CPUARMState *env, uintpt=
r_t i, uintptr_t oprsz)
     uint64_t *ffr =3D env->vfp.pregs[FFR_PRED_NUM].p;
=20
     if (i & 63) {
-        ffr[i / 64] &=3D MAKE_64BIT_MASK(0, i & 63);
+        ffr[i >> 6] &=3D MAKE_64BIT_MASK(0, i & 63);
         i =3D ROUND_UP(i, 64);
     }
     for (; i < oprsz; i +=3D 64) {
-        ffr[i / 64] =3D 0;
+        ffr[i >> 6] =3D 0;
     }
 }
=20
-/* Hold the mmap lock during the operation so that there is no race
- * between page_check_range and the load operation.  We expect the
- * usual case to have no faults at all, so we check the whole range
- * first and if successful defer to the normal load operation.
- *
- * TODO: Change mmap_lock to a rwlock so that multiple readers
- * can run simultaneously.  This will probably help other uses
- * within QEMU as well.
+/*
+ * Common helper for all contiguous first-fault loads.
  */
-#define DO_LDFF1(PART, FN, TYPEE, TYPEM, H)                             \
-static void do_sve_ldff1##PART(CPUARMState *env, void *vd, void *vg,    \
-                               target_ulong addr, intptr_t oprsz,       \
-                               bool first, uintptr_t ra)                \
-{                                                                       \
-    intptr_t i =3D 0;                                                     \
-    do {                                                                \
-        uint16_t pg =3D *(uint16_t *)(vg + H1_2(i >> 3));                 \
-        do {                                                            \
-            TYPEM m =3D 0;                                                \
-            if (pg & 1) {                                               \
-                if (!first &&                                           \
-                    unlikely(page_check_range(addr, sizeof(TYPEM),      \
-                                              PAGE_READ))) {            \
-                    record_fault(env, i, oprsz);                        \
-                    return;                                             \
-                }                                                       \
-                m =3D FN(env, addr, ra);                                  \
-                first =3D false;                                          \
-            }                                                           \
-            *(TYPEE *)(vd + H(i)) =3D m;                                  \
-            i +=3D sizeof(TYPEE), pg >>=3D sizeof(TYPEE);                 =
  \
-            addr +=3D sizeof(TYPEM);                                      \
-        } while (i & 15);                                               \
-    } while (i < oprsz);                                                \
-}                                                                       \
-void HELPER(sve_ldff1##PART)(CPUARMState *env, void *vg,                \
-                             target_ulong addr, uint32_t desc)          \
-{                                                                       \
-    intptr_t oprsz =3D simd_oprsz(desc);                                  \
-    unsigned rd =3D simd_data(desc);                                      \
-    void *vd =3D &env->vfp.zregs[rd];                                     \
-    mmap_lock();                                                        \
-    if (likely(page_check_range(addr, oprsz, PAGE_READ) =3D=3D 0)) {      =
  \
-        do_sve_ld1##PART(env, vd, vg, addr, oprsz, GETPC());            \
-    } else {                                                            \
-        do_sve_ldff1##PART(env, vd, vg, addr, oprsz, true, GETPC());    \
-    }                                                                   \
-    mmap_unlock();                                                      \
-}
+static void sve_ldff1_r(CPUARMState *env, void *vg, const target_ulong add=
r,
+                        uint32_t desc, const uintptr_t retaddr,
+                        const int esz, const int msz,
+                        sve_ld1_host_fn *host_fn,
+                        sve_ld1_tlb_fn *tlb_fn)
+{
+    void *vd =3D &env->vfp.zregs[simd_data(desc)];
+    const int diffsz =3D esz - msz;
+    const intptr_t reg_max =3D simd_oprsz(desc);
+    const intptr_t mem_max =3D reg_max >> diffsz;
+    const int mmu_idx =3D cpu_mmu_index(env, false);
+    intptr_t split, reg_off, mem_off;
+    void *host;
=20
-/* No-fault loads are like first-fault loads without the
- * first faulting special case.
- */
-#define DO_LDNF1(PART)                                                  \
-void HELPER(sve_ldnf1##PART)(CPUARMState *env, void *vg,                \
-                             target_ulong addr, uint32_t desc)          \
-{                                                                       \
-    intptr_t oprsz =3D simd_oprsz(desc);                                  \
-    unsigned rd =3D simd_data(desc);                                      \
-    void *vd =3D &env->vfp.zregs[rd];                                     \
-    mmap_lock();                                                        \
-    if (likely(page_check_range(addr, oprsz, PAGE_READ) =3D=3D 0)) {      =
  \
-        do_sve_ld1##PART(env, vd, vg, addr, oprsz, GETPC());            \
-    } else {                                                            \
-        do_sve_ldff1##PART(env, vd, vg, addr, oprsz, false, GETPC());   \
-    }                                                                   \
-    mmap_unlock();                                                      \
-}
+    set_helper_retaddr(retaddr);
=20
+    split =3D max_for_page(addr, 0, mem_max);
+    if (likely(split =3D=3D mem_max)) {
+        /* The entire operation is within one page.  */
+        host =3D tlb_vaddr_to_host(env, addr, MMU_DATA_LOAD, mmu_idx);
+        if (test_host_page(host)) {
+            mem_off =3D host_fn(vd, vg, host, 0, mem_max);
+            tcg_debug_assert(mem_off =3D=3D mem_max);
+            set_helper_retaddr(0);
+            return;
+        }
+    }
+
+    /* Skip to the first true predicate.  */
+    reg_off =3D find_next_active(vg, 0, reg_max, esz);
+    if (unlikely(reg_off =3D=3D reg_max)) {
+        /* The entire predicate was false; no load occurs.  */
+        set_helper_retaddr(0);
+        memset(vd, 0, reg_max);
+        return;
+    }
+    mem_off =3D reg_off >> diffsz;
+
+#ifdef CONFIG_USER_ONLY
+    /* The page(s) containing this first element at ADDR+MEM_OFF must
+     * be valid.  Considering that this first element may be misaligned
+     * and cross a page boundary itself, take the rest of the page from
+     * the last byte of the element.
+     */
+    split =3D max_for_page(addr, mem_off + (1 << msz) - 1, mem_max);
+    mem_off =3D host_fn(vd, vg, g2h(addr), mem_off, split);
+
+    /* After any fault, zero any leading predicated false elts.  */
+    swap_memzero(vd, reg_off);
+    reg_off =3D mem_off << diffsz;
 #else
+    /* Perform one normal read, which will fault or not.
+     * But it is likely to bring the page into the tlb.
+     */
+    tlb_fn(env, vd, reg_off, addr + mem_off, mmu_idx, retaddr);
=20
-/* TODO: System mode is not yet supported.
- * This would probably use tlb_vaddr_to_host.
- */
-#define DO_LDFF1(PART, FN, TYPEE, TYPEM, H)                     \
-void HELPER(sve_ldff1##PART)(CPUARMState *env, void *vg,        \
-                  target_ulong addr, uint32_t desc)             \
-{                                                               \
-    g_assert_not_reached();                                     \
-}
-
-#define DO_LDNF1(PART)                                          \
-void HELPER(sve_ldnf1##PART)(CPUARMState *env, void *vg,        \
-                  target_ulong addr, uint32_t desc)             \
-{                                                               \
-    g_assert_not_reached();                                     \
-}
+    /* After any fault, zero any leading predicated false elts.  */
+    swap_memzero(vd, reg_off);
+    mem_off +=3D 1 << msz;
+    reg_off +=3D 1 << esz;
=20
+    /* Try again to read the balance of the page.  */
+    split =3D max_for_page(addr, mem_off - 1, mem_max);
+    if (split >=3D (1 << msz)) {
+        host =3D tlb_vaddr_to_host(env, addr + mem_off, MMU_DATA_LOAD, mmu=
_idx);
+        if (host) {
+            mem_off =3D host_fn(vd, vg, host - mem_off, mem_off, split);
+            reg_off =3D mem_off << diffsz;
+        }
+    }
 #endif
=20
-DO_LDFF1(bb_r,  cpu_ldub_data_ra, uint8_t, uint8_t, H1)
-DO_LDFF1(bhu_r, cpu_ldub_data_ra, uint16_t, uint8_t, H1_2)
-DO_LDFF1(bhs_r, cpu_ldsb_data_ra, uint16_t, int8_t, H1_2)
-DO_LDFF1(bsu_r, cpu_ldub_data_ra, uint32_t, uint8_t, H1_4)
-DO_LDFF1(bss_r, cpu_ldsb_data_ra, uint32_t, int8_t, H1_4)
-DO_LDFF1(bdu_r, cpu_ldub_data_ra, uint64_t, uint8_t, )
-DO_LDFF1(bds_r, cpu_ldsb_data_ra, uint64_t, int8_t, )
+    set_helper_retaddr(0);
+    record_fault(env, reg_off, reg_max);
+}
=20
-DO_LDFF1(hh_r,  cpu_lduw_data_ra, uint16_t, uint16_t, H1_2)
-DO_LDFF1(hsu_r, cpu_lduw_data_ra, uint32_t, uint16_t, H1_4)
-DO_LDFF1(hss_r, cpu_ldsw_data_ra, uint32_t, int8_t, H1_4)
-DO_LDFF1(hdu_r, cpu_lduw_data_ra, uint64_t, uint16_t, )
-DO_LDFF1(hds_r, cpu_ldsw_data_ra, uint64_t, int16_t, )
+/*
+ * Common helper for all contiguous no-fault loads.
+ */
+static void sve_ldnf1_r(CPUARMState *env, void *vg, const target_ulong add=
r,
+                        uint32_t desc, const int esz, const int msz,
+                        sve_ld1_host_fn *host_fn)
+{
+    void *vd =3D &env->vfp.zregs[simd_data(desc)];
+    const int diffsz =3D esz - msz;
+    const intptr_t reg_max =3D simd_oprsz(desc);
+    const intptr_t mem_max =3D reg_max >> diffsz;
+    intptr_t split, reg_off, mem_off;
+    void *host;
=20
-DO_LDFF1(ss_r,  cpu_ldl_data_ra, uint32_t, uint32_t, H1_4)
-DO_LDFF1(sdu_r, cpu_ldl_data_ra, uint64_t, uint32_t, )
-DO_LDFF1(sds_r, cpu_ldl_data_ra, uint64_t, int32_t, )
+#ifdef CONFIG_USER_ONLY
+    /* Do not set helper_retaddr as there should be no fault.  */
+    host =3D g2h(addr);
+    if (likely(page_check_range(addr, mem_max, PAGE_READ) =3D=3D 0)) {
+        /* The entire operation is valid.  */
+        host_fn(vd, vg, host, 0, mem_max);
+        return;
+    }
+#else
+    const int mmu_idx =3D extract32(desc, SIMD_DATA_SHIFT, 4);
+    /* Unless we can load the entire vector from the same page,
+     * we need to search for the first active element.
+     */
+    split =3D max_for_page(addr, 0, mem_max);
+    if (likely(split =3D=3D mem_max)) {
+        host =3D tlb_vaddr_to_host(env, addr, MMU_DATA_LOAD, mmu_idx);
+        if (host) {
+            host_fn(vd, vg, host, 0, mem_max);
+            return;
+        }
+    }
+#endif
=20
-DO_LDFF1(dd_r,  cpu_ldq_data_ra, uint64_t, uint64_t, )
+    /* There will be no fault, so we may modify in advance.  */
+    memset(vd, 0, reg_max);
=20
-#undef DO_LDFF1
+    /* Skip to the first true predicate.  */
+    reg_off =3D find_next_active(vg, 0, reg_max, esz);
+    if (unlikely(reg_off =3D=3D reg_max)) {
+        /* The entire predicate was false; no load occurs.  */
+        return;
+    }
+    mem_off =3D reg_off >> diffsz;
=20
-DO_LDNF1(bb_r)
-DO_LDNF1(bhu_r)
-DO_LDNF1(bhs_r)
-DO_LDNF1(bsu_r)
-DO_LDNF1(bss_r)
-DO_LDNF1(bdu_r)
-DO_LDNF1(bds_r)
+#ifdef CONFIG_USER_ONLY
+    if (page_check_range(addr + mem_off, 1 << msz, PAGE_READ) =3D=3D 0) {
+        /* At least one load is valid; take the rest of the page.  */
+        split =3D max_for_page(addr, mem_off + (1 << msz) - 1, mem_max);
+        mem_off =3D host_fn(vd, vg, host, mem_off, split);
+        reg_off =3D mem_off << diffsz;
+    }
+#else
+    /* If the address is not in the TLB, we have no way to bring the
+     * entry into the TLB without also risking a fault.  Note that
+     * the corollary is that we never load from an address not in RAM.
+     * ??? This last may be out of spec.
+     */
+    host =3D tlb_vaddr_to_host(env, addr + mem_off, MMU_DATA_LOAD, mmu_idx=
);
+    split =3D max_for_page(addr, mem_off, mem_max);
+    if (host && split >=3D (1 << msz)) {
+        mem_off =3D host_fn(vd, vg, host - mem_off, mem_off, split);
+        reg_off =3D mem_off << diffsz;
+    }
+#endif
=20
-DO_LDNF1(hh_r)
-DO_LDNF1(hsu_r)
-DO_LDNF1(hss_r)
-DO_LDNF1(hdu_r)
-DO_LDNF1(hds_r)
+    record_fault(env, reg_off, reg_max);
+}
=20
-DO_LDNF1(ss_r)
-DO_LDNF1(sdu_r)
-DO_LDNF1(sds_r)
+#define DO_LDFF1_LDNF1_1(PART, ESZ) \
+void HELPER(sve_ldff1##PART##_r)(CPUARMState *env, void *vg,            \
+                                 target_ulong addr, uint32_t desc)      \
+{                                                                       \
+    sve_ldff1_r(env, vg, addr, desc, GETPC(), ESZ, 0,                   \
+                sve_ld1##PART##_host, sve_ld1##PART##_tlb);             \
+}                                                                       \
+void HELPER(sve_ldnf1##PART##_r)(CPUARMState *env, void *vg,            \
+                                 target_ulong addr, uint32_t desc)      \
+{                                                                       \
+    sve_ldnf1_r(env, vg, addr, desc, ESZ, 0, sve_ld1##PART##_host);     \
+}
=20
-DO_LDNF1(dd_r)
+/* TODO: Propagate the endian check back to the translator.  */
+#define DO_LDFF1_LDNF1_2(PART, ESZ, MSZ) \
+void HELPER(sve_ldff1##PART##_r)(CPUARMState *env, void *vg,            \
+                                 target_ulong addr, uint32_t desc)      \
+{                                                                       \
+    if (arm_cpu_data_is_big_endian(env)) {                              \
+        sve_ldff1_r(env, vg, addr, desc, GETPC(), ESZ, MSZ,             \
+                    sve_ld1##PART##_be_host, sve_ld1##PART##_be_tlb);   \
+    } else {                                                            \
+        sve_ldff1_r(env, vg, addr, desc, GETPC(), ESZ, MSZ,             \
+                    sve_ld1##PART##_le_host, sve_ld1##PART##_le_tlb);   \
+    }                                                                   \
+}                                                                       \
+void HELPER(sve_ldnf1##PART##_r)(CPUARMState *env, void *vg,            \
+                                 target_ulong addr, uint32_t desc)      \
+{                                                                       \
+    if (arm_cpu_data_is_big_endian(env)) {                              \
+        sve_ldnf1_r(env, vg, addr, desc, ESZ, MSZ,                      \
+                    sve_ld1##PART##_be_host);                           \
+    } else {                                                            \
+        sve_ldnf1_r(env, vg, addr, desc, ESZ, MSZ,                      \
+                    sve_ld1##PART##_le_host);                           \
+    }                                                                   \
+}
=20
-#undef DO_LDNF1
+DO_LDFF1_LDNF1_1(bb,  0)
+DO_LDFF1_LDNF1_1(bhu, 1)
+DO_LDFF1_LDNF1_1(bhs, 1)
+DO_LDFF1_LDNF1_1(bsu, 2)
+DO_LDFF1_LDNF1_1(bss, 2)
+DO_LDFF1_LDNF1_1(bdu, 3)
+DO_LDFF1_LDNF1_1(bds, 3)
+
+DO_LDFF1_LDNF1_2(hh,  1, 1)
+DO_LDFF1_LDNF1_2(hsu, 2, 1)
+DO_LDFF1_LDNF1_2(hss, 2, 1)
+DO_LDFF1_LDNF1_2(hdu, 3, 1)
+DO_LDFF1_LDNF1_2(hds, 3, 1)
+
+DO_LDFF1_LDNF1_2(ss,  2, 2)
+DO_LDFF1_LDNF1_2(sdu, 3, 2)
+DO_LDFF1_LDNF1_2(sds, 3, 2)
+
+DO_LDFF1_LDNF1_2(dd,  3, 3)
+
+#undef DO_LDFF1_LDNF1_1
+#undef DO_LDFF1_LDNF1_2
=20
 /*
  * Store contiguous data, protected by a governing predicate.
--=20
2.17.1