Series comparison

-[PATCH v7 00/10]
+[PATCH v10 0/3]
-v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
+Hi,
 v6: https://patchew.org/QEMU/20240424225705.929812-1-richard.henderson@linaro.org/
-Changes for v7:
+This new version removed the translate_fn() from patch 1 because it
-  - Generalize test_buffer_is_zero_next_accel and initialization (phil)
+wasn't removing the sign-extension for pentry as we thought it would.
 A more detailed explanation is given in the commit msg of patch 1.
+We're now retrieving the 'lowaddr' value from load_elf_ram_sym() and
+using it when we're running a 32-bit CPU. This worked with 32 bit
+'virt' machine booting with the -kernel option.
-r~
+If this approach doesn't work for the Xvisor use case, IMO  we should
 just filter kernel_load_addr bits directly as we were doing a handful of
 versions ago.
+Patches are based on current riscv-to-apply.next.
-Alexander Monakov (5):
+Changes from v9:
-  util/bufferiszero: Remove SSE4.1 variant
+- patch 1:
-  util/bufferiszero: Remove AVX512 variant
+  - removed the translate_fn() callback
-  util/bufferiszero: Reorganize for early test for acceleration
+  - return 'kernel_low' when running a 32-bit CPU
-  util/bufferiszero: Remove useless prefetches
+- v9 link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg04509.html
   util/bufferiszero: Optimize SSE2 and AVX2 variants
-Richard Henderson (5):
+Daniel Henrique Barboza (3):
-  util/bufferiszero: Improve scalar variant
+  hw/riscv: handle 32 bit CPUs kernel_addr in riscv_load_kernel()
-  util/bufferiszero: Introduce biz_accel_fn typedef
+  hw/riscv/boot.c: consolidate all kernel init in riscv_load_kernel()
-  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
+  hw/riscv/boot.c: make riscv_load_initrd() static
   util/bufferiszero: Add simd acceleration for aarch64
   tests/bench: Add bufferiszero-bench
- include/qemu/cutils.h            |  32 ++-
+ hw/riscv/boot.c            | 96 +++++++++++++++++++++++---------------
- tests/bench/bufferiszero-bench.c |  47 ++++
+ hw/riscv/microchip_pfsoc.c | 12 +----
- util/bufferiszero.c              | 465 ++++++++++++++++---------------
+ hw/riscv/opentitan.c       |  4 +-
- tests/bench/meson.build          |   1 +
+ hw/riscv/sifive_e.c        |  4 +-
-files changed, 324 insertions(+), 221 deletions(-)
+ hw/riscv/sifive_u.c        | 12 +----
- create mode 100644 tests/bench/bufferiszero-bench.c
+ hw/riscv/spike.c           | 14 ++----
  hw/riscv/virt.c            | 12 +----
  include/hw/riscv/boot.h    |  3 +-
 files changed, 76 insertions(+), 81 deletions(-)
 --
-.34.1
+.39.1

-[PATCH v7 01/10] util/bufferiszero: Remove SSE4.1 variant
+Deleted patch
-From: Alexander Monakov <amonakov@ispras.ru>
-The SSE4.1 variant is virtually identical to the SSE2 variant, except
-for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
-if an SSE register is all zeroes. The PTEST instruction decodes to two
-uops, so it can be handled only by the complex decoder, and since
-CMP+JNE are macro-fused, both sequences decode to three uops. The uops
-comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
-PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
-standpoint.
-Hence, the use of PTEST brings no benefit from throughput standpoint.
-Its latency is not important, since it feeds only a conditional jump,
-which terminates the dependency chain.
-I never observed PTEST variants to be faster on real hardware.
-Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
-Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
-Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20240206204809.9859-2-amonakov@ispras.ru>
----
- util/bufferiszero.c | 29 -----------------------------
-file changed, 29 deletions(-)
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
-+++ b/util/bufferiszero.c
-@@ -XXX,XX +XXX,XX @@ buffer_zero_sse2(const void *buf, size_t len)
- }
- #ifdef CONFIG_AVX2_OPT
--static bool __attribute__((target("sse4")))
--buffer_zero_sse4(const void *buf, size_t len)
--{
--    __m128i t = _mm_loadu_si128(buf);
--    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
--    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
--
--    /* Loop over 16-byte aligned blocks of 64.  */
--    while (likely(p <= e)) {
--        __builtin_prefetch(p);
--        if (unlikely(!_mm_testz_si128(t, t))) {
--            return false;
--        }
--        t = p[-4] | p[-3] | p[-2] | p[-1];
--        p += 4;
--    }
--
--    /* Finish the aligned tail.  */
--    t |= e[-3];
--    t |= e[-2];
--    t |= e[-1];
--
--    /* Finish the unaligned tail.  */
--    t |= _mm_loadu_si128(buf + len - 16);
--
--    return _mm_testz_si128(t, t);
--}
--
- static bool __attribute__((target("avx2")))
- buffer_zero_avx2(const void *buf, size_t len)
- {
-@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
- #endif
- #ifdef CONFIG_AVX2_OPT
-         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
--        { CPUINFO_SSE4,     64, buffer_zero_sse4 },
- #endif
-         { CPUINFO_SSE2,     64, buffer_zero_sse2 },
-         { CPUINFO_ALWAYS,    0, buffer_zero_int },
---
-.34.1

-[PATCH v7 02/10] util/bufferiszero: Remove AVX512 variant
+[PATCH v10 1/3] hw/riscv: handle 32 bit CPUs kernel_addr in riscv_load_kernel()
-From: Alexander Monakov <amonakov@ispras.ru>
+load_elf_ram_sym() will sign-extend 32 bit addresses. If a 32 bit QEMU
 guest happens to be running in a hypervisor that are using 64 bits to
 encode its address, kernel_entry can be padded with '1's and create
 problems [1].
-Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
+Using a translate_fn() callback in load_elf_ram_sym() to filter the
-routines are invoked much more rarely in normal use when most buffers
+padding from the address doesn't work. A more detailed explanation can
-are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
+be found in [2]. The short version is that glue(load_elf, SZ), from
-frequency and voltage transition periods during which the CPU operates
+include/hw/elf_ops.h, will calculate 'pentry' (mapped into the
-at reduced performance, as described in
+'kernel_load_base' var in riscv_load_Kernel()) before using
-https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
+translate_fn(), and will not recalculate it after executing it. This
 means that the callback does not prevent the padding from
 kernel_load_base to appear.
-Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
+Let's instead use a kernel_low var to capture the 'lowaddr' value from
-Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
+load_elf_ram_sim(), and return it when we're dealing with 32 bit CPUs.
-Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20240206204809.9859-4-amonakov@ispras.ru>
+[1] https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg02281.html
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+[2] https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg00099.html
 Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
 ---
- util/bufferiszero.c | 38 +++-----------------------------------
+ hw/riscv/boot.c            | 15 +++++++++++----
-file changed, 3 insertions(+), 35 deletions(-)
+ hw/riscv/microchip_pfsoc.c |  3 ++-
  hw/riscv/opentitan.c       |  3 ++-
  hw/riscv/sifive_e.c        |  3 ++-
  hw/riscv/sifive_u.c        |  3 ++-
  hw/riscv/spike.c           |  3 ++-
  hw/riscv/virt.c            |  3 ++-
  include/hw/riscv/boot.h    |  1 +
 files changed, 24 insertions(+), 10 deletions(-)
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
+diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
+--- a/hw/riscv/boot.c
-+++ b/util/bufferiszero.c
++++ b/hw/riscv/boot.c
-@@ -XXX,XX +XXX,XX @@ buffer_zero_int(const void *buf, size_t len)
+@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
  }
  target_ulong riscv_load_kernel(MachineState *machine,
 +                               RISCVHartArrayState *harts,
                                 target_ulong kernel_start_addr,
                                 symbol_fn_t sym_cb)
  {
      const char *kernel_filename = machine->kernel_filename;
 -    uint64_t kernel_load_base, kernel_entry;
 +    uint64_t kernel_load_base, kernel_entry, kernel_low;
      g_assert(kernel_filename != NULL);
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_kernel(MachineState *machine,
       * the (expected) load address load address. This allows kernels to have
       * separate SBI and ELF entry points (used by FreeBSD, for example).
       */
 -    if (load_elf_ram_sym(kernel_filename, NULL, NULL, NULL,
 -                         NULL, &kernel_load_base, NULL, NULL, 0,
 +    if (load_elf_ram_sym(kernel_filename, NULL, NULL, NULL, NULL,
 +                         &kernel_load_base, &kernel_low, NULL, 0,
                           EM_RISCV, 1, 0, NULL, true, sym_cb) > 0) {
 -        return kernel_load_base;
 +        kernel_entry = kernel_load_base;
 +
 +        if (riscv_is_32bit(harts)) {
 +            kernel_entry = kernel_low;
 +        }
 +
 +        return kernel_entry;
      }
      if (load_uimage_as(kernel_filename, &kernel_entry, NULL, NULL,
 diff --git a/hw/riscv/microchip_pfsoc.c b/hw/riscv/microchip_pfsoc.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/riscv/microchip_pfsoc.c
 +++ b/hw/riscv/microchip_pfsoc.c
@@ -XXX,XX +XXX,XX @@ static void microchip_icicle_kit_machine_init(MachineState *machine)
          kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc.u_cpus,
                                                           firmware_end_addr);
 -        kernel_entry = riscv_load_kernel(machine, kernel_start_addr, NULL);
 +        kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
 +                                         kernel_start_addr, NULL);
          if (machine->initrd_filename) {
              riscv_load_initrd(machine, kernel_entry);
 diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/riscv/opentitan.c
 +++ b/hw/riscv/opentitan.c
@@ -XXX,XX +XXX,XX @@ static void opentitan_board_init(MachineState *machine)
      }
      if (machine->kernel_filename) {
 -        riscv_load_kernel(machine, memmap[IBEX_DEV_RAM].base, NULL);
 +        riscv_load_kernel(machine, &s->soc.cpus,
 +                          memmap[IBEX_DEV_RAM].base, NULL);
      }
  }
--#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
-+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+index XXXXXXX..XXXXXXX 100644
- #include <immintrin.h>
+--- a/hw/riscv/sifive_e.c
++++ b/hw/riscv/sifive_e.c
- /* Note that each of these vectorized functions require len >= 64.  */
+@@ -XXX,XX +XXX,XX @@ static void sifive_e_machine_init(MachineState *machine)
-@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
+                           memmap[SIFIVE_E_DEV_MROM].base, &address_space_memory);
      if (machine->kernel_filename) {
 -        riscv_load_kernel(machine, memmap[SIFIVE_E_DEV_DTIM].base, NULL);
 +        riscv_load_kernel(machine, &s->soc.cpus,
 +                          memmap[SIFIVE_E_DEV_DTIM].base, NULL);
      }
  }
- #endif /* CONFIG_AVX2_OPT */
+diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
--#ifdef CONFIG_AVX512F_OPT
+index XXXXXXX..XXXXXXX 100644
--static bool __attribute__((target("avx512f")))
+--- a/hw/riscv/sifive_u.c
--buffer_zero_avx512(const void *buf, size_t len)
++++ b/hw/riscv/sifive_u.c
--{
+@@ -XXX,XX +XXX,XX @@ static void sifive_u_machine_init(MachineState *machine)
--    /* Begin with an unaligned head of 64 bytes.  */
+         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc.u_cpus,
--    __m512i t = _mm512_loadu_si512(buf);
+                                                          firmware_end_addr);
--    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
--    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
+-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr, NULL);
--
++        kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
--    /* Loop over 64-byte aligned blocks of 256.  */
++                                         kernel_start_addr, NULL);
--    while (p <= e) {
--        __builtin_prefetch(p);
+         if (machine->initrd_filename) {
--        if (unlikely(_mm512_test_epi64_mask(t, t))) {
+             riscv_load_initrd(machine, kernel_entry);
--            return false;
+diff --git a/hw/riscv/spike.c b/hw/riscv/spike.c
--        }
+index XXXXXXX..XXXXXXX 100644
--        t = p[-4] | p[-3] | p[-2] | p[-1];
+--- a/hw/riscv/spike.c
--        p += 4;
++++ b/hw/riscv/spike.c
--    }
+@@ -XXX,XX +XXX,XX @@ static void spike_board_init(MachineState *machine)
--
+         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc[0],
--    t |= _mm512_loadu_si512(buf + len - 4 * 64);
+                                                          firmware_end_addr);
--    t |= _mm512_loadu_si512(buf + len - 3 * 64);
--    t |= _mm512_loadu_si512(buf + len - 2 * 64);
+-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr,
--    t |= _mm512_loadu_si512(buf + len - 1 * 64);
++        kernel_entry = riscv_load_kernel(machine, &s->soc[0],
--
++                                         kernel_start_addr,
--    return !_mm512_test_epi64_mask(t, t);
+                                          htif_symbol_callback);
--
--}
+         if (machine->initrd_filename) {
--#endif /* CONFIG_AVX512F_OPT */
+diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
--
+index XXXXXXX..XXXXXXX 100644
- /*
+--- a/hw/riscv/virt.c
-  * Make sure that these variables are appropriately initialized when
++++ b/hw/riscv/virt.c
-  * SSE2 is enabled on the compiler command-line, but the compiler is
+@@ -XXX,XX +XXX,XX @@ static void virt_machine_done(Notifier *notifier, void *data)
-  * too old to support CONFIG_AVX2_OPT.
+         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc[0],
-  */
+                                                          firmware_end_addr);
--#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
-+#if defined(CONFIG_AVX2_OPT)
+-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr, NULL);
- # define INIT_USED     0
++        kernel_entry = riscv_load_kernel(machine, &s->soc[0],
- # define INIT_LENGTH   0
++                                         kernel_start_addr, NULL);
- # define INIT_ACCEL    buffer_zero_int
-@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
+         if (machine->initrd_filename) {
-         unsigned len;
+             riscv_load_initrd(machine, kernel_entry);
-         bool (*fn)(const void *, size_t);
+diff --git a/include/hw/riscv/boot.h b/include/hw/riscv/boot.h
-     } all[] = {
+index XXXXXXX..XXXXXXX 100644
--#ifdef CONFIG_AVX512F_OPT
+--- a/include/hw/riscv/boot.h
--        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
++++ b/include/hw/riscv/boot.h
--#endif
+@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
- #ifdef CONFIG_AVX2_OPT
+                                  hwaddr firmware_load_addr,
-         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
+                                  symbol_fn_t sym_cb);
- #endif
+ target_ulong riscv_load_kernel(MachineState *machine,
-@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
++                               RISCVHartArrayState *harts,
-     return 0;
+                                target_ulong firmware_end_addr,
- }
+                                symbol_fn_t sym_cb);
+ void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry);
 -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
 +#if defined(CONFIG_AVX2_OPT)
  static void __attribute__((constructor)) init_accel(void)
  {
      used_accel = select_accel_cpuinfo(cpuinfo_init());
 --
-.34.1
+.39.1

-[PATCH v7 03/10] util/bufferiszero: Reorganize for early test for acceleration
+Deleted patch
-From: Alexander Monakov <amonakov@ispras.ru>
-Test for length >= 256 inline, where is is often a constant.
-Before calling into the accelerated routine, sample three bytes
-from the buffer, which handles most non-zero buffers.
-Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
-Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
-Message-Id: <20240206204809.9859-3-amonakov@ispras.ru>
-[rth: Use __builtin_constant_p; move the indirect call out of line.]
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- include/qemu/cutils.h | 32 ++++++++++++++++-
- util/bufferiszero.c   | 84 +++++++++++++++++--------------------------
-files changed, 63 insertions(+), 53 deletions(-)
-diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
-index XXXXXXX..XXXXXXX 100644
---- a/include/qemu/cutils.h
-+++ b/include/qemu/cutils.h
-@@ -XXX,XX +XXX,XX @@ char *freq_to_str(uint64_t freq_hz);
- /* used to print char* safely */
- #define STR_OR_NULL(str) ((str) ? (str) : "null")
--bool buffer_is_zero(const void *buf, size_t len);
-+/*
-+ * Check if a buffer is all zeroes.
-+ */
-+
-+bool buffer_is_zero_ool(const void *vbuf, size_t len);
-+bool buffer_is_zero_ge256(const void *vbuf, size_t len);
- bool test_buffer_is_zero_next_accel(void);
-+static inline bool buffer_is_zero_sample3(const char *buf, size_t len)
-+{
-+    /*
-+     * For any reasonably sized buffer, these three samples come from
-+     * three different cachelines.  In qemu-img usage, we find that
-+     * each byte eliminates more than half of all buffer testing.
-+     * It is therefore critical to performance that the byte tests
-+     * short-circuit, so that we do not pull in additional cache lines.
-+     * Do not "optimize" this to !(a | b | c).
-+     */
-+    return !buf[0] && !buf[len - 1] && !buf[len / 2];
-+}
-+
-+#ifdef __OPTIMIZE__
-+static inline bool buffer_is_zero(const void *buf, size_t len)
-+{
-+    return (__builtin_constant_p(len) && len >= 256
-+            ? buffer_is_zero_sample3(buf, len) &&
-+              buffer_is_zero_ge256(buf, len)
-+            : buffer_is_zero_ool(buf, len));
-+}
-+#else
-+#define buffer_is_zero  buffer_is_zero_ool
-+#endif
-+
- /*
-  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
-  * Input is limited to 14-bit numbers
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
-+++ b/util/bufferiszero.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu/bswap.h"
- #include "host/cpuinfo.h"
--static bool
--buffer_zero_int(const void *buf, size_t len)
-+static bool (*buffer_is_zero_accel)(const void *, size_t);
-+
-+static bool buffer_is_zero_integer(const void *buf, size_t len)
- {
-     if (unlikely(len < 8)) {
-         /* For a very small buffer, simply accumulate all the bytes.  */
-@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
- }
- #endif /* CONFIG_AVX2_OPT */
--/*
-- * Make sure that these variables are appropriately initialized when
-- * SSE2 is enabled on the compiler command-line, but the compiler is
-- * too old to support CONFIG_AVX2_OPT.
-- */
--#if defined(CONFIG_AVX2_OPT)
--# define INIT_USED     0
--# define INIT_LENGTH   0
--# define INIT_ACCEL    buffer_zero_int
--#else
--# ifndef __SSE2__
--#  error "ISA selection confusion"
--# endif
--# define INIT_USED     CPUINFO_SSE2
--# define INIT_LENGTH   64
--# define INIT_ACCEL    buffer_zero_sse2
--#endif
--
--static unsigned used_accel = INIT_USED;
--static unsigned length_to_accel = INIT_LENGTH;
--static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
--
- static unsigned __attribute__((noinline))
- select_accel_cpuinfo(unsigned info)
- {
-     /* Array is sorted in order of algorithm preference. */
-     static const struct {
-         unsigned bit;
--        unsigned len;
-         bool (*fn)(const void *, size_t);
-     } all[] = {
- #ifdef CONFIG_AVX2_OPT
--        { CPUINFO_AVX2,    128, buffer_zero_avx2 },
-+        { CPUINFO_AVX2,    buffer_zero_avx2 },
- #endif
--        { CPUINFO_SSE2,     64, buffer_zero_sse2 },
--        { CPUINFO_ALWAYS,    0, buffer_zero_int },
-+        { CPUINFO_SSE2,    buffer_zero_sse2 },
-+        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
-     };
-     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
-         if (info & all[i].bit) {
--            length_to_accel = all[i].len;
--            buffer_accel = all[i].fn;
-+            buffer_is_zero_accel = all[i].fn;
-             return all[i].bit;
-         }
-     }
-     return 0;
- }
--#if defined(CONFIG_AVX2_OPT)
-+static unsigned used_accel;
-+
- static void __attribute__((constructor)) init_accel(void)
- {
-     used_accel = select_accel_cpuinfo(cpuinfo_init());
- }
--#endif /* CONFIG_AVX2_OPT */
-+
-+#define INIT_ACCEL NULL
- bool test_buffer_is_zero_next_accel(void)
- {
-@@ -XXX,XX +XXX,XX @@ bool test_buffer_is_zero_next_accel(void)
-     used_accel |= used;
-     return used;
- }
--
--static bool select_accel_fn(const void *buf, size_t len)
--{
--    if (likely(len >= length_to_accel)) {
--        return buffer_accel(buf, len);
--    }
--    return buffer_zero_int(buf, len);
--}
--
- #else
--#define select_accel_fn  buffer_zero_int
- bool test_buffer_is_zero_next_accel(void)
- {
-     return false;
- }
-+
-+#define INIT_ACCEL buffer_is_zero_integer
- #endif
--/*
-- * Checks if a buffer is all zeroes
-- */
--bool buffer_is_zero(const void *buf, size_t len)
-+static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
-+
-+bool buffer_is_zero_ool(const void *buf, size_t len)
- {
-     if (unlikely(len == 0)) {
-         return true;
-     }
-+    if (!buffer_is_zero_sample3(buf, len)) {
-+        return false;
-+    }
-+    /* All bytes are covered for any len <= 3.  */
-+    if (unlikely(len <= 3)) {
-+        return true;
-+    }
--    /* Fetch the beginning of the buffer while we select the accelerator.  */
--    __builtin_prefetch(buf);
--
--    /* Use an optimized zero check if possible.  Note that this also
--       includes a check for an unrolled loop over 64-bit integers.  */
--    return select_accel_fn(buf, len);
-+    if (likely(len >= 256)) {
-+        return buffer_is_zero_accel(buf, len);
-+    }
-+    return buffer_is_zero_integer(buf, len);
-+}
-+
-+bool buffer_is_zero_ge256(const void *buf, size_t len)
-+{
-+    return buffer_is_zero_accel(buf, len);
- }
---
-.34.1

-[PATCH v7 04/10] util/bufferiszero: Remove useless prefetches
+Deleted patch
-From: Alexander Monakov <amonakov@ispras.ru>
-Use of prefetching in bufferiszero.c is quite questionable:
-- prefetches are issued just a few CPU cycles before the corresponding
-  line would be hit by demand loads;
-- they are done for simple access patterns, i.e. where hardware
-  prefetchers can perform better;
-- they compete for load ports in loops that should be limited by load
-  port throughput rather than ALU throughput.
-Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
-Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
-Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20240206204809.9859-5-amonakov@ispras.ru>
----
- util/bufferiszero.c | 3 ---
-file changed, 3 deletions(-)
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
-+++ b/util/bufferiszero.c
-@@ -XXX,XX +XXX,XX @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
-         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
-         for (; p + 8 <= e; p += 8) {
--            __builtin_prefetch(p + 8);
-             if (t) {
-                 return false;
-             }
-@@ -XXX,XX +XXX,XX @@ buffer_zero_sse2(const void *buf, size_t len)
-     /* Loop over 16-byte aligned blocks of 64.  */
-     while (likely(p <= e)) {
--        __builtin_prefetch(p);
-         t = _mm_cmpeq_epi8(t, zero);
-         if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
-             return false;
-@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
-     /* Loop over 32-byte aligned blocks of 128.  */
-     while (p <= e) {
--        __builtin_prefetch(p);
-         if (unlikely(!_mm256_testz_si256(t, t))) {
-             return false;
-         }
---
-.34.1

-[PATCH v7 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants
+Deleted patch
-From: Alexander Monakov <amonakov@ispras.ru>
-Increase unroll factor in SIMD loops from 4x to 8x in order to move
-their bottlenecks from ALU port contention to load issue rate (two loads
-per cycle on popular x86 implementations).
-Avoid using out-of-bounds pointers in loop boundary conditions.
-Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
-PTEST, which is not profitable there (like in the removed SSE4 variant).
-Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
-Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
-Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
-Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>
----
- util/bufferiszero.c | 111 +++++++++++++++++++++++++++++---------------
-file changed, 73 insertions(+), 38 deletions(-)
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
-+++ b/util/bufferiszero.c
-@@ -XXX,XX +XXX,XX @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
- #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
- #include <immintrin.h>
--/* Note that each of these vectorized functions require len >= 64.  */
-+/* Helper for preventing the compiler from reassociating
-+   chains of binary vector operations.  */
-+#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
-+
-+/* Note that these vectorized functions may assume len >= 256.  */
- static bool __attribute__((target("sse2")))
- buffer_zero_sse2(const void *buf, size_t len)
- {
--    __m128i t = _mm_loadu_si128(buf);
--    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
--    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
--    __m128i zero = _mm_setzero_si128();
-+    /* Unaligned loads at head/tail.  */
-+    __m128i v = *(__m128i_u *)(buf);
-+    __m128i w = *(__m128i_u *)(buf + len - 16);
-+    /* Align head/tail to 16-byte boundaries.  */
-+    const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
-+    const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
-+    __m128i zero = { 0 };
--    /* Loop over 16-byte aligned blocks of 64.  */
--    while (likely(p <= e)) {
--        t = _mm_cmpeq_epi8(t, zero);
--        if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
-+    /* Collect a partial block at tail end.  */
-+    v |= e[-1]; w |= e[-2];
-+    SSE_REASSOC_BARRIER(v, w);
-+    v |= e[-3]; w |= e[-4];
-+    SSE_REASSOC_BARRIER(v, w);
-+    v |= e[-5]; w |= e[-6];
-+    SSE_REASSOC_BARRIER(v, w);
-+    v |= e[-7]; v |= w;
-+
-+    /*
-+     * Loop over complete 128-byte blocks.
-+     * With the head and tail removed, e - p >= 14, so the loop
-+     * must iterate at least once.
-+     */
-+    do {
-+        v = _mm_cmpeq_epi8(v, zero);
-+        if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) {
-             return false;
-         }
--        t = p[-4] | p[-3] | p[-2] | p[-1];
--        p += 4;
--    }
-+        v = p[0]; w = p[1];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= p[2]; w |= p[3];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= p[4]; w |= p[5];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= p[6]; w |= p[7];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= w;
-+        p += 8;
-+    } while (p < e - 7);
--    /* Finish the aligned tail.  */
--    t |= e[-3];
--    t |= e[-2];
--    t |= e[-1];
--
--    /* Finish the unaligned tail.  */
--    t |= _mm_loadu_si128(buf + len - 16);
--
--    return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF;
-+    return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF;
- }
- #ifdef CONFIG_AVX2_OPT
- static bool __attribute__((target("avx2")))
- buffer_zero_avx2(const void *buf, size_t len)
- {
--    /* Begin with an unaligned head of 32 bytes.  */
--    __m256i t = _mm256_loadu_si256(buf);
--    __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
--    __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
-+    /* Unaligned loads at head/tail.  */
-+    __m256i v = *(__m256i_u *)(buf);
-+    __m256i w = *(__m256i_u *)(buf + len - 32);
-+    /* Align head/tail to 32-byte boundaries.  */
-+    const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
-+    const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32);
-+    __m256i zero = { 0 };
--    /* Loop over 32-byte aligned blocks of 128.  */
--    while (p <= e) {
--        if (unlikely(!_mm256_testz_si256(t, t))) {
-+    /* Collect a partial block at tail end.  */
-+    v |= e[-1]; w |= e[-2];
-+    SSE_REASSOC_BARRIER(v, w);
-+    v |= e[-3]; w |= e[-4];
-+    SSE_REASSOC_BARRIER(v, w);
-+    v |= e[-5]; w |= e[-6];
-+    SSE_REASSOC_BARRIER(v, w);
-+    v |= e[-7]; v |= w;
-+
-+    /* Loop over complete 256-byte blocks.  */
-+    for (; p < e - 7; p += 8) {
-+        /* PTEST is not profitable here.  */
-+        v = _mm256_cmpeq_epi8(v, zero);
-+        if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) {
-             return false;
-         }
--        t = p[-4] | p[-3] | p[-2] | p[-1];
--        p += 4;
--    } ;
-+        v = p[0]; w = p[1];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= p[2]; w |= p[3];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= p[4]; w |= p[5];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= p[6]; w |= p[7];
-+        SSE_REASSOC_BARRIER(v, w);
-+        v |= w;
-+    }
--    /* Finish the last block of 128 unaligned.  */
--    t |= _mm256_loadu_si256(buf + len - 4 * 32);
--    t |= _mm256_loadu_si256(buf + len - 3 * 32);
--    t |= _mm256_loadu_si256(buf + len - 2 * 32);
--    t |= _mm256_loadu_si256(buf + len - 1 * 32);
--
--    return _mm256_testz_si256(t, t);
-+    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF;
- }
- #endif /* CONFIG_AVX2_OPT */
---
-.34.1

-[PATCH v7 06/10] util/bufferiszero: Improve scalar variant
+[PATCH v10 2/3] hw/riscv/boot.c: consolidate all kernel init in riscv_load_kernel()
-Split less-than and greater-than 256 cases.
+The microchip_icicle_kit, sifive_u, spike and virt boards are now doing
-Use unaligned accesses for head and tail.
+the same steps when '-kernel' is used:
-Avoid using out-of-bounds pointers in loop boundary conditions.
+- execute load_kernel()
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+- load init_rd()
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
+- write kernel_cmdline
 Let's fold everything inside riscv_load_kernel() to avoid code
 repetition. To not change the behavior of boards that aren't calling
 riscv_load_init(), add an 'load_initrd' flag to riscv_load_kernel() and
 allow these boards to opt out from initrd loading.
 Cc: Palmer Dabbelt <palmer@dabbelt.com>
 Reviewed-by: Bin Meng <bmeng@tinylab.org>
 Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
 Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
 ---
- util/bufferiszero.c | 85 +++++++++++++++++++++++++++------------------
+ hw/riscv/boot.c            | 21 ++++++++++++++++++---
-file changed, 51 insertions(+), 34 deletions(-)
+ hw/riscv/microchip_pfsoc.c | 11 +----------
+ hw/riscv/opentitan.c       |  3 ++-
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
+ hw/riscv/sifive_e.c        |  3 ++-
-index XXXXXXX..XXXXXXX 100644
+ hw/riscv/sifive_u.c        | 11 +----------
---- a/util/bufferiszero.c
+ hw/riscv/spike.c           | 11 +----------
-+++ b/util/bufferiszero.c
+ hw/riscv/virt.c            | 11 +----------
-@@ -XXX,XX +XXX,XX @@
+ include/hw/riscv/boot.h    |  1 +
+files changed, 27 insertions(+), 45 deletions(-)
- static bool (*buffer_is_zero_accel)(const void *, size_t);
+diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
--static bool buffer_is_zero_integer(const void *buf, size_t len)
+index XXXXXXX..XXXXXXX 100644
-+static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
+--- a/hw/riscv/boot.c
 +++ b/hw/riscv/boot.c
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
  target_ulong riscv_load_kernel(MachineState *machine,
                                 RISCVHartArrayState *harts,
                                 target_ulong kernel_start_addr,
 +                               bool load_initrd,
                                 symbol_fn_t sym_cb)
  {
--    if (unlikely(len < 8)) {
+     const char *kernel_filename = machine->kernel_filename;
--        /* For a very small buffer, simply accumulate all the bytes.  */
+     uint64_t kernel_load_base, kernel_entry, kernel_low;
--        const unsigned char *p = buf;
++    void *fdt = machine->fdt;
--        const unsigned char *e = buf + len;
--        unsigned char t = 0;
+     g_assert(kernel_filename != NULL);
-+    uint64_t t;
-+    const uint64_t *p, *e;
+@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_kernel(MachineState *machine,
+             kernel_entry = kernel_low;
--        do {
+         }
--            t |= *p++;
--        } while (p < e);
+-        return kernel_entry;
--
++        goto out;
--        return t == 0;
+     }
--    } else {
--        /* Otherwise, use the unaligned memory access functions to
+     if (load_uimage_as(kernel_filename, &kernel_entry, NULL, NULL,
--           handle the beginning and end of the buffer, with a couple
+                        NULL, NULL, NULL) > 0) {
--           of loops handling the middle aligned section.  */
+-        return kernel_entry;
--        uint64_t t = ldq_he_p(buf);
++        goto out;
--        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
+     }
--        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
--
+     if (load_image_targphys_as(kernel_filename, kernel_start_addr,
--        for (; p + 8 <= e; p += 8) {
+                                current_machine->ram_size, NULL) > 0) {
--            if (t) {
+-        return kernel_start_addr;
--                return false;
++        kernel_entry = kernel_start_addr;
--            }
++        goto out;
--            t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
+     }
--        }
--        while (p < e) {
+     error_report("could not load kernel '%s'", kernel_filename);
--            t |= *p++;
+     exit(1);
 -        }
 -        t |= ldq_he_p(buf + len - 8);
 -
 -        return t == 0;
 +    /*
 +     * Use unaligned memory access functions to handle
 +     * the beginning and end of the buffer.
 +     */
 +    if (unlikely(len <= 8)) {
 +        return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
      }
 +
-+    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
++out:
-+    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
++    if (load_initrd && machine->initrd_filename) {
-+    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
++        riscv_load_initrd(machine, kernel_entry);
 +    }
 +
-+    /* Read 0 to 31 aligned words from the middle. */
++    if (fdt && machine->kernel_cmdline && *machine->kernel_cmdline) {
-+    while (p < e) {
++        qemu_fdt_setprop_string(fdt, "/chosen", "bootargs",
-+        t |= *p++;
++                                machine->kernel_cmdline);
 +    }
-+    return t == 0;
-+}
 +
-+static bool buffer_is_zero_int_ge256(const void *buf, size_t len)
++    return kernel_entry;
 +{
 +    /*
 +     * Use unaligned memory access functions to handle
 +     * the beginning and end of the buffer.
 +     */
 +    uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
 +    const uint64_t *p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
 +    const uint64_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
 +
 +    /* Collect a partial block at the tail end. */
 +    t |= e[-7] | e[-6] | e[-5] | e[-4] | e[-3] | e[-2] | e[-1];
 +
 +    /*
 +     * Loop over 64 byte blocks.
 +     * With the head and tail removed, e - p >= 30,
 +     * so the loop must iterate at least 3 times.
 +     */
 +    do {
 +        if (t) {
 +            return false;
 +        }
 +        t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
 +        p += 8;
 +    } while (p < e - 7);
 +
 +    return t == 0;
  }
- #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+ void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry)
-@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
+diff --git a/hw/riscv/microchip_pfsoc.c b/hw/riscv/microchip_pfsoc.c
-         { CPUINFO_AVX2,    buffer_zero_avx2 },
+index XXXXXXX..XXXXXXX 100644
- #endif
+--- a/hw/riscv/microchip_pfsoc.c
-         { CPUINFO_SSE2,    buffer_zero_sse2 },
++++ b/hw/riscv/microchip_pfsoc.c
--        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
+@@ -XXX,XX +XXX,XX @@ static void microchip_icicle_kit_machine_init(MachineState *machine)
-+        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
+                                                          firmware_end_addr);
-     };
+         kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
-     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
+-                                         kernel_start_addr, NULL);
-@@ -XXX,XX +XXX,XX @@ bool test_buffer_is_zero_next_accel(void)
+-
-     return false;
+-        if (machine->initrd_filename) {
 -            riscv_load_initrd(machine, kernel_entry);
 -        }
 -
 -        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
 -            qemu_fdt_setprop_string(machine->fdt, "/chosen",
 -                                    "bootargs", machine->kernel_cmdline);
 -        }
 +                                         kernel_start_addr, true, NULL);
          /* Compute the fdt load address in dram */
          fdt_load_addr = riscv_compute_fdt_addr(memmap[MICROCHIP_PFSOC_DRAM_LO].base,
 diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/riscv/opentitan.c
 +++ b/hw/riscv/opentitan.c
@@ -XXX,XX +XXX,XX @@ static void opentitan_board_init(MachineState *machine)
      if (machine->kernel_filename) {
          riscv_load_kernel(machine, &s->soc.cpus,
 -                          memmap[IBEX_DEV_RAM].base, NULL);
 +                          memmap[IBEX_DEV_RAM].base,
 +                          false, NULL);
      }
  }
--#define INIT_ACCEL buffer_is_zero_integer
+diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
-+#define INIT_ACCEL buffer_is_zero_int_ge256
+index XXXXXXX..XXXXXXX 100644
- #endif
+--- a/hw/riscv/sifive_e.c
++++ b/hw/riscv/sifive_e.c
- static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+@@ -XXX,XX +XXX,XX @@ static void sifive_e_machine_init(MachineState *machine)
-@@ -XXX,XX +XXX,XX @@ bool buffer_is_zero_ool(const void *buf, size_t len)
-     if (likely(len >= 256)) {
+     if (machine->kernel_filename) {
-         return buffer_is_zero_accel(buf, len);
+         riscv_load_kernel(machine, &s->soc.cpus,
-     }
+-                          memmap[SIFIVE_E_DEV_DTIM].base, NULL);
--    return buffer_is_zero_integer(buf, len);
++                          memmap[SIFIVE_E_DEV_DTIM].base,
-+    return buffer_is_zero_int_lt256(buf, len);
++                          false, NULL);
      }
  }
- bool buffer_is_zero_ge256(const void *buf, size_t len)
+diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/riscv/sifive_u.c
 +++ b/hw/riscv/sifive_u.c
@@ -XXX,XX +XXX,XX @@ static void sifive_u_machine_init(MachineState *machine)
                                                           firmware_end_addr);
          kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
 -                                         kernel_start_addr, NULL);
 -
 -        if (machine->initrd_filename) {
 -            riscv_load_initrd(machine, kernel_entry);
 -        }
 -
 -        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
 -            qemu_fdt_setprop_string(machine->fdt, "/chosen", "bootargs",
 -                                    machine->kernel_cmdline);
 -        }
 +                                         kernel_start_addr, true, NULL);
      } else {
         /*
          * If dynamic firmware is used, it doesn't know where is the next mode
 diff --git a/hw/riscv/spike.c b/hw/riscv/spike.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/riscv/spike.c
 +++ b/hw/riscv/spike.c
@@ -XXX,XX +XXX,XX @@ static void spike_board_init(MachineState *machine)
          kernel_entry = riscv_load_kernel(machine, &s->soc[0],
                                           kernel_start_addr,
 -                                         htif_symbol_callback);
 -
 -        if (machine->initrd_filename) {
 -            riscv_load_initrd(machine, kernel_entry);
 -        }
 -
 -        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
 -            qemu_fdt_setprop_string(machine->fdt, "/chosen", "bootargs",
 -                                    machine->kernel_cmdline);
 -        }
 +                                         true, htif_symbol_callback);
      } else {
         /*
          * If dynamic firmware is used, it doesn't know where is the next mode
 diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
 index XXXXXXX..XXXXXXX 100644
 --- a/hw/riscv/virt.c
 +++ b/hw/riscv/virt.c
@@ -XXX,XX +XXX,XX @@ static void virt_machine_done(Notifier *notifier, void *data)
                                                           firmware_end_addr);
          kernel_entry = riscv_load_kernel(machine, &s->soc[0],
 -                                         kernel_start_addr, NULL);
 -
 -        if (machine->initrd_filename) {
 -            riscv_load_initrd(machine, kernel_entry);
 -        }
 -
 -        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
 -            qemu_fdt_setprop_string(machine->fdt, "/chosen", "bootargs",
 -                                    machine->kernel_cmdline);
 -        }
 +                                         kernel_start_addr, true, NULL);
      } else {
         /*
          * If dynamic firmware is used, it doesn't know where is the next mode
 diff --git a/include/hw/riscv/boot.h b/include/hw/riscv/boot.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/hw/riscv/boot.h
 +++ b/include/hw/riscv/boot.h
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
  target_ulong riscv_load_kernel(MachineState *machine,
                                 RISCVHartArrayState *harts,
                                 target_ulong firmware_end_addr,
 +                               bool load_initrd,
                                 symbol_fn_t sym_cb);
  void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry);
  uint64_t riscv_compute_fdt_addr(hwaddr dram_start, uint64_t dram_size,
 --
-.34.1
+.39.1

-[PATCH v7 07/10] util/bufferiszero: Introduce biz_accel_fn typedef
+Deleted patch
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- util/bufferiszero.c | 7 ++++---
-file changed, 4 insertions(+), 3 deletions(-)
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
-+++ b/util/bufferiszero.c
-@@ -XXX,XX +XXX,XX @@
- #include "qemu/bswap.h"
- #include "host/cpuinfo.h"
--static bool (*buffer_is_zero_accel)(const void *, size_t);
-+typedef bool (*biz_accel_fn)(const void *, size_t);
-+static biz_accel_fn buffer_is_zero_accel;
- static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
- {
-@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
-     /* Array is sorted in order of algorithm preference. */
-     static const struct {
-         unsigned bit;
--        bool (*fn)(const void *, size_t);
-+        biz_accel_fn fn;
-     } all[] = {
- #ifdef CONFIG_AVX2_OPT
-         { CPUINFO_AVX2,    buffer_zero_avx2 },
-@@ -XXX,XX +XXX,XX @@ bool test_buffer_is_zero_next_accel(void)
- #define INIT_ACCEL buffer_is_zero_int_ge256
- #endif
--static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
-+static biz_accel_fn buffer_is_zero_accel = INIT_ACCEL;
- bool buffer_is_zero_ool(const void *buf, size_t len)
- {
---
-.34.1

-[PATCH v7 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel
+Deleted patch
-Because the three alternatives are monotonic, we don't need
-to keep a couple of bitmasks, just identify the strongest
-alternative at startup.
-Generalize test_buffer_is_zero_next_accel and init_accel
-by always defining an accel_table array.
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- util/bufferiszero.c | 81 ++++++++++++++++++++-------------------------
-file changed, 35 insertions(+), 46 deletions(-)
-diff --git a/util/bufferiszero.c b/util/bufferiszero.c
-index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
-+++ b/util/bufferiszero.c
-@@ -XXX,XX +XXX,XX @@
- #include "host/cpuinfo.h"
- typedef bool (*biz_accel_fn)(const void *, size_t);
--static biz_accel_fn buffer_is_zero_accel;
- static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
- {
-@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
- }
- #endif /* CONFIG_AVX2_OPT */
--static unsigned __attribute__((noinline))
--select_accel_cpuinfo(unsigned info)
--{
--    /* Array is sorted in order of algorithm preference. */
--    static const struct {
--        unsigned bit;
--        biz_accel_fn fn;
--    } all[] = {
-+static biz_accel_fn const accel_table[] = {
-+    buffer_is_zero_int_ge256,
-+    buffer_zero_sse2,
- #ifdef CONFIG_AVX2_OPT
--        { CPUINFO_AVX2,    buffer_zero_avx2 },
-+    buffer_zero_avx2,
- #endif
--        { CPUINFO_SSE2,    buffer_zero_sse2 },
--        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
--    };
-+};
--    for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
--        if (info & all[i].bit) {
--            buffer_is_zero_accel = all[i].fn;
--            return all[i].bit;
--        }
-+static unsigned best_accel(void)
-+{
-+    unsigned info = cpuinfo_init();
-+
-+#ifdef CONFIG_AVX2_OPT
-+    if (info & CPUINFO_AVX2) {
-+        return 2;
-     }
--    return 0;
-+#endif
-+    return info & CPUINFO_SSE2 ? 1 : 0;
- }
--static unsigned used_accel;
--
--static void __attribute__((constructor)) init_accel(void)
--{
--    used_accel = select_accel_cpuinfo(cpuinfo_init());
--}
--
--#define INIT_ACCEL NULL
--
--bool test_buffer_is_zero_next_accel(void)
--{
--    /*
--     * Accumulate the accelerators that we've already tested, and
--     * remove them from the set to test this round.  We'll get back
--     * a zero from select_accel_cpuinfo when there are no more.
--     */
--    unsigned used = select_accel_cpuinfo(cpuinfo & ~used_accel);
--    used_accel |= used;
--    return used;
--}
- #else
--bool test_buffer_is_zero_next_accel(void)
--{
--    return false;
--}
--
--#define INIT_ACCEL buffer_is_zero_int_ge256
-+#define best_accel() 0
-+static biz_accel_fn const accel_table[1] = {
-+    buffer_is_zero_int_ge256
-+};
- #endif
--static biz_accel_fn buffer_is_zero_accel = INIT_ACCEL;
-+static biz_accel_fn buffer_is_zero_accel;
-+static unsigned accel_index;
- bool buffer_is_zero_ool(const void *buf, size_t len)
- {
-@@ -XXX,XX +XXX,XX @@ bool buffer_is_zero_ge256(const void *buf, size_t len)
- {
-     return buffer_is_zero_accel(buf, len);
- }
-+
-+bool test_buffer_is_zero_next_accel(void)
-+{
-+    if (accel_index != 0) {
-+        buffer_is_zero_accel = accel_table[--accel_index];
-+        return true;
-+    }
-+    return false;
-+}
-+
-+static void __attribute__((constructor)) init_accel(void)
-+{
-+    accel_index = best_accel();
-+    buffer_is_zero_accel = accel_table[accel_index];
-+}
---
-.34.1

-[PATCH v7 09/10] util/bufferiszero: Add simd acceleration for aarch64
+[PATCH v10 3/3] hw/riscv/boot.c: make riscv_load_initrd() static
-Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
+The only remaining caller is riscv_load_kernel_and_initrd() which
-double-check with the compiler flags for __ARM_NEON and don't bother with
+belongs to the same file.
 a runtime check.  Otherwise, model the loop after the x86 SSE2 function.
-Use UMAXV for the vector reduction.  This is 3 cycles on cortex-a76 and
+Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
-cycles on neoverse-n1.
+Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
 Reviewed-by: Bin Meng <bmeng@tinylab.org>
 Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
 ---
  hw/riscv/boot.c         | 80 ++++++++++++++++++++---------------------
  include/hw/riscv/boot.h |  1 -
 files changed, 40 insertions(+), 41 deletions(-)
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
+diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
 Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
 ---
  util/bufferiszero.c | 67 +++++++++++++++++++++++++++++++++++++++++++++
 file changed, 67 insertions(+)
 diff --git a/util/bufferiszero.c b/util/bufferiszero.c
 index XXXXXXX..XXXXXXX 100644
---- a/util/bufferiszero.c
+--- a/hw/riscv/boot.c
-+++ b/util/bufferiszero.c
++++ b/hw/riscv/boot.c
-@@ -XXX,XX +XXX,XX @@ static unsigned best_accel(void)
+@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
-     return info & CPUINFO_SSE2 ? 1 : 0;
+     exit(1);
  }
-+#elif defined(__aarch64__) && defined(__ARM_NEON)
++static void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry)
-+#include <arm_neon.h>
++{
 +    const char *filename = machine->initrd_filename;
 +    uint64_t mem_size = machine->ram_size;
 +    void *fdt = machine->fdt;
 +    hwaddr start, end;
 +    ssize_t size;
 +
-+/*
++    g_assert(filename != NULL);
 + * Helper for preventing the compiler from reassociating
 + * chains of binary vector operations.
 + */
 +#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
 +
 +static bool buffer_is_zero_simd(const void *buf, size_t len)
 +{
 +    uint32x4_t t0, t1, t2, t3;
 +
 +    /* Align head/tail to 16-byte boundaries.  */
 +    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
 +    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
 +
 +    /* Unaligned loads at head/tail.  */
 +    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
 +
 +    /* Collect a partial block at tail end.  */
 +    t1 = e[-7] | e[-6];
 +    t2 = e[-5] | e[-4];
 +    t3 = e[-3] | e[-2];
 +    t0 |= e[-1];
 +    REASSOC_BARRIER(t0, t1);
 +    REASSOC_BARRIER(t2, t3);
 +    t0 |= t1;
 +    t2 |= t3;
 +    REASSOC_BARRIER(t0, t2);
 +    t0 |= t2;
 +
 +    /*
-+     * Loop over complete 128-byte blocks.
++     * We want to put the initrd far enough into RAM that when the
-+     * With the head and tail removed, e - p >= 14, so the loop
++     * kernel is uncompressed it will not clobber the initrd. However
-+     * must iterate at least once.
++     * on boards without much RAM we must ensure that we still leave
 +     * enough room for a decent sized initrd, and on boards with large
 +     * amounts of RAM we must avoid the initrd being so far up in RAM
 +     * that it is outside lowmem and inaccessible to the kernel.
 +     * So for boards with less  than 256MB of RAM we put the initrd
 +     * halfway into RAM, and for boards with 256MB of RAM or more we put
 +     * the initrd at 128MB.
 +     */
-+    do {
++    start = kernel_entry + MIN(mem_size / 2, 128 * MiB);
-+        /*
++
-+         * Reduce via UMAXV.  Whatever the actual result,
++    size = load_ramdisk(filename, start, mem_size - start);
-+         * it will only be zero if all input bytes are zero.
++    if (size == -1) {
-+         */
++        size = load_image_targphys(filename, start, mem_size - start);
-+        if (unlikely(vmaxvq_u32(t0) != 0)) {
++        if (size == -1) {
-+            return false;
++            error_report("could not load ramdisk '%s'", filename);
 +            exit(1);
 +        }
++    }
 +
-+        t0 = p[0] | p[1];
++    /* Some RISC-V machines (e.g. opentitan) don't have a fdt. */
-+        t1 = p[2] | p[3];
++    if (fdt) {
-+        t2 = p[4] | p[5];
++        end = start + size;
-+        t3 = p[6] | p[7];
++        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-start", start);
-+        REASSOC_BARRIER(t0, t1);
++        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-end", end);
-+        REASSOC_BARRIER(t2, t3);
++    }
 +        t0 |= t1;
 +        t2 |= t3;
 +        REASSOC_BARRIER(t0, t2);
 +        t0 |= t2;
 +        p += 8;
 +    } while (p < e - 7);
 +
 +    return vmaxvq_u32(t0) == 0;
 +}
 +
-+#define best_accel() 1
+ target_ulong riscv_load_kernel(MachineState *machine,
-+static biz_accel_fn const accel_table[] = {
+                                RISCVHartArrayState *harts,
-+    buffer_is_zero_int_ge256,
+                                target_ulong kernel_start_addr,
-+    buffer_is_zero_simd,
+@@ -XXX,XX +XXX,XX @@ out:
-+};
+     return kernel_entry;
- #else
+ }
- #define best_accel() 0
- static biz_accel_fn const accel_table[1] = {
+-void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry)
 -{
 -    const char *filename = machine->initrd_filename;
 -    uint64_t mem_size = machine->ram_size;
 -    void *fdt = machine->fdt;
 -    hwaddr start, end;
 -    ssize_t size;
 -
 -    g_assert(filename != NULL);
 -
 -    /*
 -     * We want to put the initrd far enough into RAM that when the
 -     * kernel is uncompressed it will not clobber the initrd. However
 -     * on boards without much RAM we must ensure that we still leave
 -     * enough room for a decent sized initrd, and on boards with large
 -     * amounts of RAM we must avoid the initrd being so far up in RAM
 -     * that it is outside lowmem and inaccessible to the kernel.
 -     * So for boards with less  than 256MB of RAM we put the initrd
 -     * halfway into RAM, and for boards with 256MB of RAM or more we put
 -     * the initrd at 128MB.
 -     */
 -    start = kernel_entry + MIN(mem_size / 2, 128 * MiB);
 -
 -    size = load_ramdisk(filename, start, mem_size - start);
 -    if (size == -1) {
 -        size = load_image_targphys(filename, start, mem_size - start);
 -        if (size == -1) {
 -            error_report("could not load ramdisk '%s'", filename);
 -            exit(1);
 -        }
 -    }
 -
 -    /* Some RISC-V machines (e.g. opentitan) don't have a fdt. */
 -    if (fdt) {
 -        end = start + size;
 -        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-start", start);
 -        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-end", end);
 -    }
 -}
 -
  /*
   * This function makes an assumption that the DRAM interval
   * 'dram_base' + 'dram_size' is contiguous.
 diff --git a/include/hw/riscv/boot.h b/include/hw/riscv/boot.h
 index XXXXXXX..XXXXXXX 100644
 --- a/include/hw/riscv/boot.h
 +++ b/include/hw/riscv/boot.h
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_kernel(MachineState *machine,
                                 target_ulong firmware_end_addr,
                                 bool load_initrd,
                                 symbol_fn_t sym_cb);
 -void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry);
  uint64_t riscv_compute_fdt_addr(hwaddr dram_start, uint64_t dram_size,
                                  MachineState *ms);
  void riscv_load_fdt(hwaddr fdt_addr, void *fdt);
 --
-.34.1
+.39.1

-[PATCH v7 10/10] tests/bench: Add bufferiszero-bench
+Deleted patch
-Benchmark each acceleration function vs an aligned buffer of zeros.
-Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
-Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
----
- tests/bench/bufferiszero-bench.c | 47 ++++++++++++++++++++++++++++++++
- tests/bench/meson.build          |  1 +
-files changed, 48 insertions(+)
- create mode 100644 tests/bench/bufferiszero-bench.c
-diff --git a/tests/bench/bufferiszero-bench.c b/tests/bench/bufferiszero-bench.c
-new file mode 100644
-index XXXXXXX..XXXXXXX
---- /dev/null
-+++ b/tests/bench/bufferiszero-bench.c
-@@ -XXX,XX +XXX,XX @@
-+/*
-+ * QEMU buffer_is_zero speed benchmark
-+ *
-+ * This work is licensed under the terms of the GNU GPL, version 2 or
-+ * (at your option) any later version.  See the COPYING file in the
-+ * top-level directory.
-+ */
-+#include "qemu/osdep.h"
-+#include "qemu/cutils.h"
-+#include "qemu/units.h"
-+
-+static void test(const void *opaque)
-+{
-+    size_t max = 64 * KiB;
-+    void *buf = g_malloc0(max);
-+    int accel_index = 0;
-+
-+    do {
-+        if (accel_index != 0) {
-+            g_test_message("%s", "");  /* gnu_printf Werror for simple "" */
-+        }
-+        for (size_t len = 1 * KiB; len <= max; len *= 4) {
-+            double total = 0.0;
-+
-+            g_test_timer_start();
-+            do {
-+                buffer_is_zero_ge256(buf, len);
-+                total += len;
-+            } while (g_test_timer_elapsed() < 0.5);
-+
-+            total /= MiB;
-+            g_test_message("buffer_is_zero #%d: %2zuKB %8.0f MB/sec",
-+                           accel_index, len / (size_t)KiB,
-+                           total / g_test_timer_last());
-+        }
-+        accel_index++;
-+    } while (test_buffer_is_zero_next_accel());
-+
-+    g_free(buf);
-+}
-+
-+int main(int argc, char **argv)
-+{
-+    g_test_init(&argc, &argv, NULL);
-+    g_test_add_data_func("/cutils/bufferiszero/speed", NULL, test);
-+    return g_test_run();
-+}
-diff --git a/tests/bench/meson.build b/tests/bench/meson.build
-index XXXXXXX..XXXXXXX 100644
---- a/tests/bench/meson.build
-+++ b/tests/bench/meson.build
-@@ -XXX,XX +XXX,XX @@ benchs = {}
- if have_block
-   benchs += {
-+     'bufferiszero-bench': [],
-      'benchmark-crypto-hash': [crypto],
-      'benchmark-crypto-hmac': [crypto],
-      'benchmark-crypto-cipher': [crypto],
---
-.34.1

v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
v6: https://patchew.org/QEMU/20240424225705.929812-1-richard.henderson@linaro.org/

Changes for v7:
  - Generalize test_buffer_is_zero_next_accel and initialization (phil)

Alexander Monakov (5):
  util/bufferiszero: Remove SSE4.1 variant
  util/bufferiszero: Remove AVX512 variant
  util/bufferiszero: Reorganize for early test for acceleration
  util/bufferiszero: Remove useless prefetches
  util/bufferiszero: Optimize SSE2 and AVX2 variants

Richard Henderson (5):
  util/bufferiszero: Improve scalar variant
  util/bufferiszero: Introduce biz_accel_fn typedef
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  util/bufferiszero: Add simd acceleration for aarch64
  tests/bench: Add bufferiszero-bench

include/qemu/cutils.h            |  32 ++-
 tests/bench/bufferiszero-bench.c |  47 ++++
 util/bufferiszero.c              | 465 ++++++++++++++++---------------
 tests/bench/meson.build          |   1 +
 4 files changed, 324 insertions(+), 221 deletions(-)
 create mode 100644 tests/bench/bufferiszero-bench.c

-- 
2.34.1

From: Alexander Monakov <amonakov@ispras.ru>

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-2-amonakov@ispras.ru>
---
 util/bufferiszero.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@ buffer_zero_sse2(const void *buf, size_t len)
 }
 
 #ifdef CONFIG_AVX2_OPT
-static bool __attribute__((target("sse4")))
-buffer_zero_sse4(const void *buf, size_t len)
-{
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        __builtin_prefetch(p);
-        if (unlikely(!_mm_testz_si128(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_testz_si128(t, t);
-}
-
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
 #endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
-        { CPUINFO_SSE4,     64, buffer_zero_sse4 },
 #endif
         { CPUINFO_SSE2,     64, buffer_zero_sse2 },
         { CPUINFO_ALWAYS,    0, buffer_zero_int },
-- 
2.34.1

From: Alexander Monakov <amonakov@ispras.ru>

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-4-amonakov@ispras.ru>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@ buffer_zero_int(const void *buf, size_t len)
     }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-    /* Begin with an unaligned head of 64 bytes.  */
-    __m512i t = _mm512_loadu_si512(buf);
-    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-    /* Loop over 64-byte aligned blocks of 256.  */
-    while (p <= e) {
-        __builtin_prefetch(p);
-        if (unlikely(_mm512_test_epi64_mask(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    t |= _mm512_loadu_si512(buf + len - 4 * 64);
-    t |= _mm512_loadu_si512(buf + len - 3 * 64);
-    t |= _mm512_loadu_si512(buf + len - 2 * 64);
-    t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-    return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 /*
  * Make sure that these variables are appropriately initialized when
  * SSE2 is enabled on the compiler command-line, but the compiler is
  * too old to support CONFIG_AVX2_OPT.
  */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 # define INIT_USED     0
 # define INIT_LENGTH   0
 # define INIT_ACCEL    buffer_zero_int
@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
         unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
 #endif
@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
     return 0;
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
-- 
2.34.1

From: Alexander Monakov <amonakov@ispras.ru>

Test for length >= 256 inline, where is is often a constant.
Before calling into the accelerated routine, sample three bytes
from the buffer, which handles most non-zero buffers.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Message-Id: <20240206204809.9859-3-amonakov@ispras.ru>
[rth: Use __builtin_constant_p; move the indirect call out of line.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/qemu/cutils.h | 32 ++++++++++++++++-
 util/bufferiszero.c   | 84 +++++++++++++++++--------------------------
 2 files changed, 63 insertions(+), 53 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index XXXXXXX..XXXXXXX 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -XXX,XX +XXX,XX @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool buffer_is_zero(const void *buf, size_t len);
+/*
+ * Check if a buffer is all zeroes.
+ */
+
+bool buffer_is_zero_ool(const void *vbuf, size_t len);
+bool buffer_is_zero_ge256(const void *vbuf, size_t len);
 bool test_buffer_is_zero_next_accel(void);
 
+static inline bool buffer_is_zero_sample3(const char *buf, size_t len)
+{
+    /*
+     * For any reasonably sized buffer, these three samples come from
+     * three different cachelines.  In qemu-img usage, we find that
+     * each byte eliminates more than half of all buffer testing.
+     * It is therefore critical to performance that the byte tests
+     * short-circuit, so that we do not pull in additional cache lines.
+     * Do not "optimize" this to !(a | b | c).
+     */
+    return !buf[0] && !buf[len - 1] && !buf[len / 2];
+}
+
+#ifdef __OPTIMIZE__
+static inline bool buffer_is_zero(const void *buf, size_t len)
+{
+    return (__builtin_constant_p(len) && len >= 256
+            ? buffer_is_zero_sample3(buf, len) &&
+              buffer_is_zero_ge256(buf, len)
+            : buffer_is_zero_ool(buf, len));
+}
+#else
+#define buffer_is_zero  buffer_is_zero_ool
+#endif
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool
-buffer_zero_int(const void *buf, size_t len)
+static bool (*buffer_is_zero_accel)(const void *, size_t);
+
+static bool buffer_is_zero_integer(const void *buf, size_t len)
 {
     if (unlikely(len < 8)) {
         /* For a very small buffer, simply accumulate all the bytes.  */
@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-/*
- * Make sure that these variables are appropriately initialized when
- * SSE2 is enabled on the compiler command-line, but the compiler is
- * too old to support CONFIG_AVX2_OPT.
- */
-#if defined(CONFIG_AVX2_OPT)
-# define INIT_USED     0
-# define INIT_LENGTH   0
-# define INIT_ACCEL    buffer_zero_int
-#else
-# ifndef __SSE2__
-#  error "ISA selection confusion"
-# endif
-# define INIT_USED     CPUINFO_SSE2
-# define INIT_LENGTH   64
-# define INIT_ACCEL    buffer_zero_sse2
-#endif
-
-static unsigned used_accel = INIT_USED;
-static unsigned length_to_accel = INIT_LENGTH;
-static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    128, buffer_zero_avx2 },
+        { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
-        { CPUINFO_SSE2,     64, buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,    0, buffer_zero_int },
+        { CPUINFO_SSE2,    buffer_zero_sse2 },
+        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
         if (info & all[i].bit) {
-            length_to_accel = all[i].len;
-            buffer_accel = all[i].fn;
+            buffer_is_zero_accel = all[i].fn;
             return all[i].bit;
         }
     }
     return 0;
 }
 
-#if defined(CONFIG_AVX2_OPT)
+static unsigned used_accel;
+
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
 }
-#endif /* CONFIG_AVX2_OPT */
+
+#define INIT_ACCEL NULL
 
 bool test_buffer_is_zero_next_accel(void)
 {
@@ -XXX,XX +XXX,XX @@ bool test_buffer_is_zero_next_accel(void)
     used_accel |= used;
     return used;
 }
-
-static bool select_accel_fn(const void *buf, size_t len)
-{
-    if (likely(len >= length_to_accel)) {
-        return buffer_accel(buf, len);
-    }
-    return buffer_zero_int(buf, len);
-}
-
 #else
-#define select_accel_fn  buffer_zero_int
 bool test_buffer_is_zero_next_accel(void)
 {
     return false;
 }
+
+#define INIT_ACCEL buffer_is_zero_integer
 #endif
 
-/*
- * Checks if a buffer is all zeroes
- */
-bool buffer_is_zero(const void *buf, size_t len)
+static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+
+bool buffer_is_zero_ool(const void *buf, size_t len)
 {
     if (unlikely(len == 0)) {
         return true;
     }
+    if (!buffer_is_zero_sample3(buf, len)) {
+        return false;
+    }
+    /* All bytes are covered for any len <= 3.  */
+    if (unlikely(len <= 3)) {
+        return true;
+    }
 
-    /* Fetch the beginning of the buffer while we select the accelerator.  */
-    __builtin_prefetch(buf);
-
-    /* Use an optimized zero check if possible.  Note that this also
-       includes a check for an unrolled loop over 64-bit integers.  */
-    return select_accel_fn(buf, len);
+    if (likely(len >= 256)) {
+        return buffer_is_zero_accel(buf, len);
+    }
+    return buffer_is_zero_integer(buf, len);
+}
+
+bool buffer_is_zero_ge256(const void *buf, size_t len)
+{
+    return buffer_is_zero_accel(buf, len);
 }
-- 
2.34.1

From: Alexander Monakov <amonakov@ispras.ru>

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
  line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
  prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
  port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-5-amonakov@ispras.ru>
---
 util/bufferiszero.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
         for (; p + 8 <= e; p += 8) {
-            __builtin_prefetch(p + 8);
             if (t) {
                 return false;
             }
@@ -XXX,XX +XXX,XX @@ buffer_zero_sse2(const void *buf, size_t len)
 
     /* Loop over 16-byte aligned blocks of 64.  */
     while (likely(p <= e)) {
-        __builtin_prefetch(p);
         t = _mm_cmpeq_epi8(t, zero);
         if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
             return false;
@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
 
     /* Loop over 32-byte aligned blocks of 128.  */
     while (p <= e) {
-        __builtin_prefetch(p);
         if (unlikely(!_mm256_testz_si256(t, t))) {
             return false;
         }
-- 
2.34.1

From: Alexander Monakov <amonakov@ispras.ru>

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>
---
 util/bufferiszero.c | 111 +++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 38 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
-/* Note that each of these vectorized functions require len >= 64.  */
+/* Helper for preventing the compiler from reassociating
+   chains of binary vector operations.  */
+#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
+
+/* Note that these vectorized functions may assume len >= 256.  */
 
 static bool __attribute__((target("sse2")))
 buffer_zero_sse2(const void *buf, size_t len)
 {
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-    __m128i zero = _mm_setzero_si128();
+    /* Unaligned loads at head/tail.  */
+    __m128i v = *(__m128i_u *)(buf);
+    __m128i w = *(__m128i_u *)(buf + len - 16);
+    /* Align head/tail to 16-byte boundaries.  */
+    const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+    __m128i zero = { 0 };
 
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        t = _mm_cmpeq_epi8(t, zero);
-        if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        v = _mm_cmpeq_epi8(v, zero);
+        if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+        p += 8;
+    } while (p < e - 7);
 
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF;
+    return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF;
 }
 
 #ifdef CONFIG_AVX2_OPT
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
-    /* Begin with an unaligned head of 32 bytes.  */
-    __m256i t = _mm256_loadu_si256(buf);
-    __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
-    __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
+    /* Unaligned loads at head/tail.  */
+    __m256i v = *(__m256i_u *)(buf);
+    __m256i w = *(__m256i_u *)(buf + len - 32);
+    /* Align head/tail to 32-byte boundaries.  */
+    const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
+    const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32);
+    __m256i zero = { 0 };
 
-    /* Loop over 32-byte aligned blocks of 128.  */
-    while (p <= e) {
-        if (unlikely(!_mm256_testz_si256(t, t))) {
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /* Loop over complete 256-byte blocks.  */
+    for (; p < e - 7; p += 8) {
+        /* PTEST is not profitable here.  */
+        v = _mm256_cmpeq_epi8(v, zero);
+        if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    } ;
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+    }
 
-    /* Finish the last block of 128 unaligned.  */
-    t |= _mm256_loadu_si256(buf + len - 4 * 32);
-    t |= _mm256_loadu_si256(buf + len - 3 * 32);
-    t |= _mm256_loadu_si256(buf + len - 2 * 32);
-    t |= _mm256_loadu_si256(buf + len - 1 * 32);
-
-    return _mm256_testz_si256(t, t);
+    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF;
 }
 #endif /* CONFIG_AVX2_OPT */
 
-- 
2.34.1

Split less-than and greater-than 256 cases.
Use unaligned accesses for head and tail.
Avoid using out-of-bounds pointers in loop boundary conditions.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 85 +++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@
 
 static bool (*buffer_is_zero_accel)(const void *, size_t);
 
-static bool buffer_is_zero_integer(const void *buf, size_t len)
+static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
-    if (unlikely(len < 8)) {
-        /* For a very small buffer, simply accumulate all the bytes.  */
-        const unsigned char *p = buf;
-        const unsigned char *e = buf + len;
-        unsigned char t = 0;
+    uint64_t t;
+    const uint64_t *p, *e;
 
-        do {
-            t |= *p++;
-        } while (p < e);
-
-        return t == 0;
-    } else {
-        /* Otherwise, use the unaligned memory access functions to
-           handle the beginning and end of the buffer, with a couple
-           of loops handling the middle aligned section.  */
-        uint64_t t = ldq_he_p(buf);
-        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
-
-        for (; p + 8 <= e; p += 8) {
-            if (t) {
-                return false;
-            }
-            t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
-        }
-        while (p < e) {
-            t |= *p++;
-        }
-        t |= ldq_he_p(buf + len - 8);
-
-        return t == 0;
+    /*
+     * Use unaligned memory access functions to handle
+     * the beginning and end of the buffer.
+     */
+    if (unlikely(len <= 8)) {
+        return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
     }
+
+    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
+    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
+
+    /* Read 0 to 31 aligned words from the middle. */
+    while (p < e) {
+        t |= *p++;
+    }
+    return t == 0;
+}
+
+static bool buffer_is_zero_int_ge256(const void *buf, size_t len)
+{
+    /*
+     * Use unaligned memory access functions to handle
+     * the beginning and end of the buffer.
+     */
+    uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    const uint64_t *p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
+    const uint64_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
+
+    /* Collect a partial block at the tail end. */
+    t |= e[-7] | e[-6] | e[-5] | e[-4] | e[-3] | e[-2] | e[-1];
+
+    /*
+     * Loop over 64 byte blocks.
+     * With the head and tail removed, e - p >= 30,
+     * so the loop must iterate at least 3 times.
+     */
+    do {
+        if (t) {
+            return false;
+        }
+        t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
+        p += 8;
+    } while (p < e - 7);
+
+    return t == 0;
 }
 
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
         { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
         { CPUINFO_SSE2,    buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
+        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
@@ -XXX,XX +XXX,XX @@ bool test_buffer_is_zero_next_accel(void)
     return false;
 }
 
-#define INIT_ACCEL buffer_is_zero_integer
+#define INIT_ACCEL buffer_is_zero_int_ge256
 #endif
 
 static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
@@ -XXX,XX +XXX,XX @@ bool buffer_is_zero_ool(const void *buf, size_t len)
     if (likely(len >= 256)) {
         return buffer_is_zero_accel(buf, len);
     }
-    return buffer_is_zero_integer(buf, len);
+    return buffer_is_zero_int_lt256(buf, len);
 }
 
 bool buffer_is_zero_ge256(const void *buf, size_t len)
-- 
2.34.1

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool (*buffer_is_zero_accel)(const void *, size_t);
+typedef bool (*biz_accel_fn)(const void *, size_t);
+static biz_accel_fn buffer_is_zero_accel;
 
 static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
@@ -XXX,XX +XXX,XX @@ select_accel_cpuinfo(unsigned info)
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        bool (*fn)(const void *, size_t);
+        biz_accel_fn fn;
     } all[] = {
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    buffer_zero_avx2 },
@@ -XXX,XX +XXX,XX @@ bool test_buffer_is_zero_next_accel(void)
 #define INIT_ACCEL buffer_is_zero_int_ge256
 #endif
 
-static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+static biz_accel_fn buffer_is_zero_accel = INIT_ACCEL;
 
 bool buffer_is_zero_ool(const void *buf, size_t len)
 {
-- 
2.34.1

Because the three alternatives are monotonic, we don't need
to keep a couple of bitmasks, just identify the strongest
alternative at startup.

Generalize test_buffer_is_zero_next_accel and init_accel
by always defining an accel_table array.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 81 ++++++++++++++++++++-------------------------
 1 file changed, 35 insertions(+), 46 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@
 #include "host/cpuinfo.h"
 
 typedef bool (*biz_accel_fn)(const void *, size_t);
-static biz_accel_fn buffer_is_zero_accel;
 
 static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
@@ -XXX,XX +XXX,XX @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-static unsigned __attribute__((noinline))
-select_accel_cpuinfo(unsigned info)
-{
-    /* Array is sorted in order of algorithm preference. */
-    static const struct {
-        unsigned bit;
-        biz_accel_fn fn;
-    } all[] = {
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_zero_sse2,
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    buffer_zero_avx2 },
+    buffer_zero_avx2,
 #endif
-        { CPUINFO_SSE2,    buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
-    };
+};
 
-    for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
-        if (info & all[i].bit) {
-            buffer_is_zero_accel = all[i].fn;
-            return all[i].bit;
-        }
+static unsigned best_accel(void)
+{
+    unsigned info = cpuinfo_init();
+
+#ifdef CONFIG_AVX2_OPT
+    if (info & CPUINFO_AVX2) {
+        return 2;
     }
-    return 0;
+#endif
+    return info & CPUINFO_SSE2 ? 1 : 0;
 }
 
-static unsigned used_accel;
-
-static void __attribute__((constructor)) init_accel(void)
-{
-    used_accel = select_accel_cpuinfo(cpuinfo_init());
-}
-
-#define INIT_ACCEL NULL
-
-bool test_buffer_is_zero_next_accel(void)
-{
-    /*
-     * Accumulate the accelerators that we've already tested, and
-     * remove them from the set to test this round.  We'll get back
-     * a zero from select_accel_cpuinfo when there are no more.
-     */
-    unsigned used = select_accel_cpuinfo(cpuinfo & ~used_accel);
-    used_accel |= used;
-    return used;
-}
 #else
-bool test_buffer_is_zero_next_accel(void)
-{
-    return false;
-}
-
-#define INIT_ACCEL buffer_is_zero_int_ge256
+#define best_accel() 0
+static biz_accel_fn const accel_table[1] = {
+    buffer_is_zero_int_ge256
+};
 #endif
 
-static biz_accel_fn buffer_is_zero_accel = INIT_ACCEL;
+static biz_accel_fn buffer_is_zero_accel;
+static unsigned accel_index;
 
 bool buffer_is_zero_ool(const void *buf, size_t len)
 {
@@ -XXX,XX +XXX,XX @@ bool buffer_is_zero_ge256(const void *buf, size_t len)
 {
     return buffer_is_zero_accel(buf, len);
 }
+
+bool test_buffer_is_zero_next_accel(void)
+{
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
+}
+
+static void __attribute__((constructor)) init_accel(void)
+{
+    accel_index = best_accel();
+    buffer_is_zero_accel = accel_table[accel_index];
+}
-- 
2.34.1

Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check.  Otherwise, model the loop after the x86 SSE2 function.

Use UMAXV for the vector reduction.  This is 3 cycles on cortex-a76 and
2 cycles on neoverse-n1.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 67 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index XXXXXXX..XXXXXXX 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -XXX,XX +XXX,XX @@ static unsigned best_accel(void)
     return info & CPUINFO_SSE2 ? 1 : 0;
 }
 
+#elif defined(__aarch64__) && defined(__ARM_NEON)
+#include <arm_neon.h>
+
+/*
+ * Helper for preventing the compiler from reassociating
+ * chains of binary vector operations.
+ */
+#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
+
+static bool buffer_is_zero_simd(const void *buf, size_t len)
+{
+    uint32x4_t t0, t1, t2, t3;
+
+    /* Align head/tail to 16-byte boundaries.  */
+    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+
+    /* Unaligned loads at head/tail.  */
+    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
+
+    /* Collect a partial block at tail end.  */
+    t1 = e[-7] | e[-6];
+    t2 = e[-5] | e[-4];
+    t3 = e[-3] | e[-2];
+    t0 |= e[-1];
+    REASSOC_BARRIER(t0, t1);
+    REASSOC_BARRIER(t2, t3);
+    t0 |= t1;
+    t2 |= t3;
+    REASSOC_BARRIER(t0, t2);
+    t0 |= t2;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        /*
+         * Reduce via UMAXV.  Whatever the actual result,
+         * it will only be zero if all input bytes are zero.
+         */
+        if (unlikely(vmaxvq_u32(t0) != 0)) {
+            return false;
+        }
+
+        t0 = p[0] | p[1];
+        t1 = p[2] | p[3];
+        t2 = p[4] | p[5];
+        t3 = p[6] | p[7];
+        REASSOC_BARRIER(t0, t1);
+        REASSOC_BARRIER(t2, t3);
+        t0 |= t1;
+        t2 |= t3;
+        REASSOC_BARRIER(t0, t2);
+        t0 |= t2;
+        p += 8;
+    } while (p < e - 7);
+
+    return vmaxvq_u32(t0) == 0;
+}
+
+#define best_accel() 1
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_is_zero_simd,
+};
 #else
 #define best_accel() 0
 static biz_accel_fn const accel_table[1] = {
-- 
2.34.1

Benchmark each acceleration function vs an aligned buffer of zeros.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tests/bench/bufferiszero-bench.c | 47 ++++++++++++++++++++++++++++++++
 tests/bench/meson.build          |  1 +
 2 files changed, 48 insertions(+)
 create mode 100644 tests/bench/bufferiszero-bench.c

diff --git a/tests/bench/bufferiszero-bench.c b/tests/bench/bufferiszero-bench.c
new file mode 100644
index XXXXXXX..XXXXXXX
--- /dev/null
+++ b/tests/bench/bufferiszero-bench.c
@@ -XXX,XX +XXX,XX @@
+/*
+ * QEMU buffer_is_zero speed benchmark
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.  See the COPYING file in the
+ * top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/units.h"
+
+static void test(const void *opaque)
+{
+    size_t max = 64 * KiB;
+    void *buf = g_malloc0(max);
+    int accel_index = 0;
+
+    do {
+        if (accel_index != 0) {
+            g_test_message("%s", "");  /* gnu_printf Werror for simple "" */
+        }
+        for (size_t len = 1 * KiB; len <= max; len *= 4) {
+            double total = 0.0;
+
+            g_test_timer_start();
+            do {
+                buffer_is_zero_ge256(buf, len);
+                total += len;
+            } while (g_test_timer_elapsed() < 0.5);
+
+            total /= MiB;
+            g_test_message("buffer_is_zero #%d: %2zuKB %8.0f MB/sec",
+                           accel_index, len / (size_t)KiB,
+                           total / g_test_timer_last());
+        }
+        accel_index++;
+    } while (test_buffer_is_zero_next_accel());
+
+    g_free(buf);
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_data_func("/cutils/bufferiszero/speed", NULL, test);
+    return g_test_run();
+}
diff --git a/tests/bench/meson.build b/tests/bench/meson.build
index XXXXXXX..XXXXXXX 100644
--- a/tests/bench/meson.build
+++ b/tests/bench/meson.build
@@ -XXX,XX +XXX,XX @@ benchs = {}
 
 if have_block
   benchs += {
+     'bufferiszero-bench': [],
      'benchmark-crypto-hash': [crypto],
      'benchmark-crypto-hmac': [crypto],
      'benchmark-crypto-cipher': [crypto],
-- 
2.34.1

Hi,

This new version removed the translate_fn() from patch 1 because it
wasn't removing the sign-extension for pentry as we thought it would.
A more detailed explanation is given in the commit msg of patch 1.

We're now retrieving the 'lowaddr' value from load_elf_ram_sym() and
using it when we're running a 32-bit CPU. This worked with 32 bit
'virt' machine booting with the -kernel option.

If this approach doesn't work for the Xvisor use case, IMO  we should
just filter kernel_load_addr bits directly as we were doing a handful of
versions ago.

Patches are based on current riscv-to-apply.next.

Changes from v9:
- patch 1:
  - removed the translate_fn() callback
  - return 'kernel_low' when running a 32-bit CPU
- v9 link: https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg04509.html

Daniel Henrique Barboza (3):
  hw/riscv: handle 32 bit CPUs kernel_addr in riscv_load_kernel()
  hw/riscv/boot.c: consolidate all kernel init in riscv_load_kernel()
  hw/riscv/boot.c: make riscv_load_initrd() static

-- 
2.39.1

load_elf_ram_sym() will sign-extend 32 bit addresses. If a 32 bit QEMU
guest happens to be running in a hypervisor that are using 64 bits to
encode its address, kernel_entry can be padded with '1's and create
problems [1].

Using a translate_fn() callback in load_elf_ram_sym() to filter the
padding from the address doesn't work. A more detailed explanation can
be found in [2]. The short version is that glue(load_elf, SZ), from
include/hw/elf_ops.h, will calculate 'pentry' (mapped into the
'kernel_load_base' var in riscv_load_Kernel()) before using
translate_fn(), and will not recalculate it after executing it. This
means that the callback does not prevent the padding from
kernel_load_base to appear.

Let's instead use a kernel_low var to capture the 'lowaddr' value from
load_elf_ram_sim(), and return it when we're dealing with 32 bit CPUs.

[1] https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg02281.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg00099.html

diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/boot.c
+++ b/hw/riscv/boot.c
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
 }
 
 target_ulong riscv_load_kernel(MachineState *machine,
+                               RISCVHartArrayState *harts,
                                target_ulong kernel_start_addr,
                                symbol_fn_t sym_cb)
 {
     const char *kernel_filename = machine->kernel_filename;
-    uint64_t kernel_load_base, kernel_entry;
+    uint64_t kernel_load_base, kernel_entry, kernel_low;
 
     g_assert(kernel_filename != NULL);
 
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_kernel(MachineState *machine,
      * the (expected) load address load address. This allows kernels to have
      * separate SBI and ELF entry points (used by FreeBSD, for example).
      */
-    if (load_elf_ram_sym(kernel_filename, NULL, NULL, NULL,
-                         NULL, &kernel_load_base, NULL, NULL, 0,
+    if (load_elf_ram_sym(kernel_filename, NULL, NULL, NULL, NULL,
+                         &kernel_load_base, &kernel_low, NULL, 0,
                          EM_RISCV, 1, 0, NULL, true, sym_cb) > 0) {
-        return kernel_load_base;
+        kernel_entry = kernel_load_base;
+
+        if (riscv_is_32bit(harts)) {
+            kernel_entry = kernel_low;
+        }
+
+        return kernel_entry;
     }
 
     if (load_uimage_as(kernel_filename, &kernel_entry, NULL, NULL,
diff --git a/hw/riscv/microchip_pfsoc.c b/hw/riscv/microchip_pfsoc.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/microchip_pfsoc.c
+++ b/hw/riscv/microchip_pfsoc.c
@@ -XXX,XX +XXX,XX @@ static void microchip_icicle_kit_machine_init(MachineState *machine)
         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc.u_cpus,
                                                          firmware_end_addr);
 
-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr, NULL);
+        kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
+                                         kernel_start_addr, NULL);
 
         if (machine->initrd_filename) {
             riscv_load_initrd(machine, kernel_entry);
diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/opentitan.c
+++ b/hw/riscv/opentitan.c
@@ -XXX,XX +XXX,XX @@ static void opentitan_board_init(MachineState *machine)
     }
 
     if (machine->kernel_filename) {
-        riscv_load_kernel(machine, memmap[IBEX_DEV_RAM].base, NULL);
+        riscv_load_kernel(machine, &s->soc.cpus,
+                          memmap[IBEX_DEV_RAM].base, NULL);
     }
 }
 
diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/sifive_e.c
+++ b/hw/riscv/sifive_e.c
@@ -XXX,XX +XXX,XX @@ static void sifive_e_machine_init(MachineState *machine)
                           memmap[SIFIVE_E_DEV_MROM].base, &address_space_memory);
 
     if (machine->kernel_filename) {
-        riscv_load_kernel(machine, memmap[SIFIVE_E_DEV_DTIM].base, NULL);
+        riscv_load_kernel(machine, &s->soc.cpus,
+                          memmap[SIFIVE_E_DEV_DTIM].base, NULL);
     }
 }
 
diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/sifive_u.c
+++ b/hw/riscv/sifive_u.c
@@ -XXX,XX +XXX,XX @@ static void sifive_u_machine_init(MachineState *machine)
         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc.u_cpus,
                                                          firmware_end_addr);
 
-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr, NULL);
+        kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
+                                         kernel_start_addr, NULL);
 
         if (machine->initrd_filename) {
             riscv_load_initrd(machine, kernel_entry);
diff --git a/hw/riscv/spike.c b/hw/riscv/spike.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/spike.c
+++ b/hw/riscv/spike.c
@@ -XXX,XX +XXX,XX @@ static void spike_board_init(MachineState *machine)
         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc[0],
                                                          firmware_end_addr);
 
-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr,
+        kernel_entry = riscv_load_kernel(machine, &s->soc[0],
+                                         kernel_start_addr,
                                          htif_symbol_callback);
 
         if (machine->initrd_filename) {
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -XXX,XX +XXX,XX @@ static void virt_machine_done(Notifier *notifier, void *data)
         kernel_start_addr = riscv_calc_kernel_start_addr(&s->soc[0],
                                                          firmware_end_addr);
 
-        kernel_entry = riscv_load_kernel(machine, kernel_start_addr, NULL);
+        kernel_entry = riscv_load_kernel(machine, &s->soc[0],
+                                         kernel_start_addr, NULL);
 
         if (machine->initrd_filename) {
             riscv_load_initrd(machine, kernel_entry);
diff --git a/include/hw/riscv/boot.h b/include/hw/riscv/boot.h
index XXXXXXX..XXXXXXX 100644
--- a/include/hw/riscv/boot.h
+++ b/include/hw/riscv/boot.h
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
                                  hwaddr firmware_load_addr,
                                  symbol_fn_t sym_cb);
 target_ulong riscv_load_kernel(MachineState *machine,
+                               RISCVHartArrayState *harts,
                                target_ulong firmware_end_addr,
                                symbol_fn_t sym_cb);
 void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry);
-- 
2.39.1

The microchip_icicle_kit, sifive_u, spike and virt boards are now doing
the same steps when '-kernel' is used:

- execute load_kernel()
- load init_rd()
- write kernel_cmdline

Let's fold everything inside riscv_load_kernel() to avoid code
repetition. To not change the behavior of boards that aren't calling
riscv_load_init(), add an 'load_initrd' flag to riscv_load_kernel() and
allow these boards to opt out from initrd loading.

Cc: Palmer Dabbelt <palmer@dabbelt.com>
Reviewed-by: Bin Meng <bmeng@tinylab.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 hw/riscv/boot.c            | 21 ++++++++++++++++++---
 hw/riscv/microchip_pfsoc.c | 11 +----------
 hw/riscv/opentitan.c       |  3 ++-
 hw/riscv/sifive_e.c        |  3 ++-
 hw/riscv/sifive_u.c        | 11 +----------
 hw/riscv/spike.c           | 11 +----------
 hw/riscv/virt.c            | 11 +----------
 include/hw/riscv/boot.h    |  1 +
 8 files changed, 27 insertions(+), 45 deletions(-)

diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/boot.c
+++ b/hw/riscv/boot.c
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
 target_ulong riscv_load_kernel(MachineState *machine,
                                RISCVHartArrayState *harts,
                                target_ulong kernel_start_addr,
+                               bool load_initrd,
                                symbol_fn_t sym_cb)
 {
     const char *kernel_filename = machine->kernel_filename;
     uint64_t kernel_load_base, kernel_entry, kernel_low;
+    void *fdt = machine->fdt;
 
     g_assert(kernel_filename != NULL);
 
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_kernel(MachineState *machine,
             kernel_entry = kernel_low;
         }
 
-        return kernel_entry;
+        goto out;
     }
 
     if (load_uimage_as(kernel_filename, &kernel_entry, NULL, NULL,
                        NULL, NULL, NULL) > 0) {
-        return kernel_entry;
+        goto out;
     }
 
     if (load_image_targphys_as(kernel_filename, kernel_start_addr,
                                current_machine->ram_size, NULL) > 0) {
-        return kernel_start_addr;
+        kernel_entry = kernel_start_addr;
+        goto out;
     }
 
     error_report("could not load kernel '%s'", kernel_filename);
     exit(1);
+
+out:
+    if (load_initrd && machine->initrd_filename) {
+        riscv_load_initrd(machine, kernel_entry);
+    }
+
+    if (fdt && machine->kernel_cmdline && *machine->kernel_cmdline) {
+        qemu_fdt_setprop_string(fdt, "/chosen", "bootargs",
+                                machine->kernel_cmdline);
+    }
+
+    return kernel_entry;
 }
 
 void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry)
diff --git a/hw/riscv/microchip_pfsoc.c b/hw/riscv/microchip_pfsoc.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/microchip_pfsoc.c
+++ b/hw/riscv/microchip_pfsoc.c
@@ -XXX,XX +XXX,XX @@ static void microchip_icicle_kit_machine_init(MachineState *machine)
                                                          firmware_end_addr);
 
         kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
-                                         kernel_start_addr, NULL);
-
-        if (machine->initrd_filename) {
-            riscv_load_initrd(machine, kernel_entry);
-        }
-
-        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
-            qemu_fdt_setprop_string(machine->fdt, "/chosen",
-                                    "bootargs", machine->kernel_cmdline);
-        }
+                                         kernel_start_addr, true, NULL);
 
         /* Compute the fdt load address in dram */
         fdt_load_addr = riscv_compute_fdt_addr(memmap[MICROCHIP_PFSOC_DRAM_LO].base,
diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/opentitan.c
+++ b/hw/riscv/opentitan.c
@@ -XXX,XX +XXX,XX @@ static void opentitan_board_init(MachineState *machine)
 
     if (machine->kernel_filename) {
         riscv_load_kernel(machine, &s->soc.cpus,
-                          memmap[IBEX_DEV_RAM].base, NULL);
+                          memmap[IBEX_DEV_RAM].base,
+                          false, NULL);
     }
 }
 
diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/sifive_e.c
+++ b/hw/riscv/sifive_e.c
@@ -XXX,XX +XXX,XX @@ static void sifive_e_machine_init(MachineState *machine)
 
     if (machine->kernel_filename) {
         riscv_load_kernel(machine, &s->soc.cpus,
-                          memmap[SIFIVE_E_DEV_DTIM].base, NULL);
+                          memmap[SIFIVE_E_DEV_DTIM].base,
+                          false, NULL);
     }
 }
 
diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/sifive_u.c
+++ b/hw/riscv/sifive_u.c
@@ -XXX,XX +XXX,XX @@ static void sifive_u_machine_init(MachineState *machine)
                                                          firmware_end_addr);
 
         kernel_entry = riscv_load_kernel(machine, &s->soc.u_cpus,
-                                         kernel_start_addr, NULL);
-
-        if (machine->initrd_filename) {
-            riscv_load_initrd(machine, kernel_entry);
-        }
-
-        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
-            qemu_fdt_setprop_string(machine->fdt, "/chosen", "bootargs",
-                                    machine->kernel_cmdline);
-        }
+                                         kernel_start_addr, true, NULL);
     } else {
        /*
         * If dynamic firmware is used, it doesn't know where is the next mode
diff --git a/hw/riscv/spike.c b/hw/riscv/spike.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/spike.c
+++ b/hw/riscv/spike.c
@@ -XXX,XX +XXX,XX @@ static void spike_board_init(MachineState *machine)
 
         kernel_entry = riscv_load_kernel(machine, &s->soc[0],
                                          kernel_start_addr,
-                                         htif_symbol_callback);
-
-        if (machine->initrd_filename) {
-            riscv_load_initrd(machine, kernel_entry);
-        }
-
-        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
-            qemu_fdt_setprop_string(machine->fdt, "/chosen", "bootargs",
-                                    machine->kernel_cmdline);
-        }
+                                         true, htif_symbol_callback);
     } else {
        /*
         * If dynamic firmware is used, it doesn't know where is the next mode
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -XXX,XX +XXX,XX @@ static void virt_machine_done(Notifier *notifier, void *data)
                                                          firmware_end_addr);
 
         kernel_entry = riscv_load_kernel(machine, &s->soc[0],
-                                         kernel_start_addr, NULL);
-
-        if (machine->initrd_filename) {
-            riscv_load_initrd(machine, kernel_entry);
-        }
-
-        if (machine->kernel_cmdline && *machine->kernel_cmdline) {
-            qemu_fdt_setprop_string(machine->fdt, "/chosen", "bootargs",
-                                    machine->kernel_cmdline);
-        }
+                                         kernel_start_addr, true, NULL);
     } else {
        /*
         * If dynamic firmware is used, it doesn't know where is the next mode
diff --git a/include/hw/riscv/boot.h b/include/hw/riscv/boot.h
index XXXXXXX..XXXXXXX 100644
--- a/include/hw/riscv/boot.h
+++ b/include/hw/riscv/boot.h
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
 target_ulong riscv_load_kernel(MachineState *machine,
                                RISCVHartArrayState *harts,
                                target_ulong firmware_end_addr,
+                               bool load_initrd,
                                symbol_fn_t sym_cb);
 void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry);
 uint64_t riscv_compute_fdt_addr(hwaddr dram_start, uint64_t dram_size,
-- 
2.39.1

The only remaining caller is riscv_load_kernel_and_initrd() which
belongs to the same file.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Bin Meng <bmeng@tinylab.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
---
 hw/riscv/boot.c         | 80 ++++++++++++++++++++---------------------
 include/hw/riscv/boot.h |  1 -
 2 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
index XXXXXXX..XXXXXXX 100644
--- a/hw/riscv/boot.c
+++ b/hw/riscv/boot.c
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_firmware(const char *firmware_filename,
     exit(1);
 }
 
+static void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry)
+{
+    const char *filename = machine->initrd_filename;
+    uint64_t mem_size = machine->ram_size;
+    void *fdt = machine->fdt;
+    hwaddr start, end;
+    ssize_t size;
+
+    g_assert(filename != NULL);
+
+    /*
+     * We want to put the initrd far enough into RAM that when the
+     * kernel is uncompressed it will not clobber the initrd. However
+     * on boards without much RAM we must ensure that we still leave
+     * enough room for a decent sized initrd, and on boards with large
+     * amounts of RAM we must avoid the initrd being so far up in RAM
+     * that it is outside lowmem and inaccessible to the kernel.
+     * So for boards with less  than 256MB of RAM we put the initrd
+     * halfway into RAM, and for boards with 256MB of RAM or more we put
+     * the initrd at 128MB.
+     */
+    start = kernel_entry + MIN(mem_size / 2, 128 * MiB);
+
+    size = load_ramdisk(filename, start, mem_size - start);
+    if (size == -1) {
+        size = load_image_targphys(filename, start, mem_size - start);
+        if (size == -1) {
+            error_report("could not load ramdisk '%s'", filename);
+            exit(1);
+        }
+    }
+
+    /* Some RISC-V machines (e.g. opentitan) don't have a fdt. */
+    if (fdt) {
+        end = start + size;
+        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-start", start);
+        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-end", end);
+    }
+}
+
 target_ulong riscv_load_kernel(MachineState *machine,
                                RISCVHartArrayState *harts,
                                target_ulong kernel_start_addr,
@@ -XXX,XX +XXX,XX @@ out:
     return kernel_entry;
 }
 
-void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry)
-{
-    const char *filename = machine->initrd_filename;
-    uint64_t mem_size = machine->ram_size;
-    void *fdt = machine->fdt;
-    hwaddr start, end;
-    ssize_t size;
-
-    g_assert(filename != NULL);
-
-    /*
-     * We want to put the initrd far enough into RAM that when the
-     * kernel is uncompressed it will not clobber the initrd. However
-     * on boards without much RAM we must ensure that we still leave
-     * enough room for a decent sized initrd, and on boards with large
-     * amounts of RAM we must avoid the initrd being so far up in RAM
-     * that it is outside lowmem and inaccessible to the kernel.
-     * So for boards with less  than 256MB of RAM we put the initrd
-     * halfway into RAM, and for boards with 256MB of RAM or more we put
-     * the initrd at 128MB.
-     */
-    start = kernel_entry + MIN(mem_size / 2, 128 * MiB);
-
-    size = load_ramdisk(filename, start, mem_size - start);
-    if (size == -1) {
-        size = load_image_targphys(filename, start, mem_size - start);
-        if (size == -1) {
-            error_report("could not load ramdisk '%s'", filename);
-            exit(1);
-        }
-    }
-
-    /* Some RISC-V machines (e.g. opentitan) don't have a fdt. */
-    if (fdt) {
-        end = start + size;
-        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-start", start);
-        qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-end", end);
-    }
-}
-
 /*
  * This function makes an assumption that the DRAM interval
  * 'dram_base' + 'dram_size' is contiguous.
diff --git a/include/hw/riscv/boot.h b/include/hw/riscv/boot.h
index XXXXXXX..XXXXXXX 100644
--- a/include/hw/riscv/boot.h
+++ b/include/hw/riscv/boot.h
@@ -XXX,XX +XXX,XX @@ target_ulong riscv_load_kernel(MachineState *machine,
                                target_ulong firmware_end_addr,
                                bool load_initrd,
                                symbol_fn_t sym_cb);
-void riscv_load_initrd(MachineState *machine, uint64_t kernel_entry);
 uint64_t riscv_compute_fdt_addr(hwaddr dram_start, uint64_t dram_size,
                                 MachineState *ms);
 void riscv_load_fdt(hwaddr fdt_addr, void *fdt);
-- 
2.39.1