arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency

[PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency

Posted by Luke Yang 2 months, 2 weeks ago

On  aarch64 kernels, recent changes (specifically irqbypass patch
https://lore.kernel.org/all/20250516230734.2564775-6-seanjc@google.com/)
shifted __arch_clear_user() such that the tight zeroing loop straddles
I-cache lines. This causes measurable read performance regression when
reading from /dev/zero.

Add `.p2align 6` (64-byte alignment) to guarantee the loop stays within a
single I-cache boundary, restoring the previous IPC and throughput.

Tested on bare-metal aarch64 systems:

  Good kernel: pread_z100k ~ 6.9 s
  Bad kernel:  pread_z100k ~ 9.0 s
  With patch:  pread_z100k ~ 6.9 s

Reproducer:

// gcc -O2 -Wall -Wextra -o pread_z100k pread_z100k.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (100 * 1024)
#define COUNT 1000000

static double now_sec(void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1e6;
}

int main(void)
{
    int fd = open("/dev/zero", O_RDONLY);
    if (fd < 0) { perror("open /dev/zero"); return 1; }

    char *buf = malloc(SIZE);
    if (!buf) { perror("malloc"); close(fd); return 1; }

    double t1 = now_sec();
    for (int i = 0; i < COUNT; i++) {
        ssize_t r = pread(fd, buf, SIZE, 0);
        if (r != SIZE) { perror("pread"); break; }
    }
    double t2 = now_sec();

    printf("%.6f\n", t2 - t1);

    close(fd);
    free(buf);
    return 0;
}

Signed-off-by: Luke Yang <luyang@redhat.com>
Signed-off-by: Jirka Hladky <jhladky@redhat.com>
---
 arch/arm64/lib/clear_user.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/lib/clear_user.S b/arch/arm64/lib/clear_user.S
index de9a303b6..91eee4a7c 100644
--- a/arch/arm64/lib/clear_user.S
+++ b/arch/arm64/lib/clear_user.S
@@ -17,6 +17,11 @@
  * Alignment fixed up by hardware.
  */

+/*
+ * Ensure __arch_clear_user() always starts on a clean I-cache boundary.
+ */
+    .p2align 6    // 2^6 = 64-byte alignment
+
 SYM_FUNC_START(__arch_clear_user)
     add    x2, x0, x1

-- 
2.51.1

Re: [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency

Posted by Will Deacon 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 12:04:55AM -0500, Luke Yang wrote:
> On  aarch64 kernels, recent changes (specifically irqbypass patch
> https://lore.kernel.org/all/20250516230734.2564775-6-seanjc@google.com/)
> shifted __arch_clear_user() such that the tight zeroing loop straddles
> I-cache lines. This causes measurable read performance regression when
> reading from /dev/zero.
> 
> Add `.p2align 6` (64-byte alignment) to guarantee the loop stays within a
> single I-cache boundary, restoring the previous IPC and throughput.

Hmm, but what's special about __arch_clear_user()? If we make this change,
anybody could surely make similar arguments for other functions on their
hot paths?

Amusingly, there's CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B to play around
with that and it sounds like the irqbyass change you cite calls into
the category of changes highlighted by the Kconfig text.

Will

Re: [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency

Posted by Luke Yang 2 months, 2 weeks ago

On Mon, 24 Nov 2025 13:38:25 +0000, Will Deacon wrote:
> Hmm, but what's special about __arch_clear_user()? If we make this
> change, anybody could surely make similar arguments for other
> functions on their hot paths?

Hi Will,

Thanks for the feedback.

I agree that the precedent question matters. In this case, though, the
irqbypass change introduced roughly a 30% regression in /dev/zero read
throughput. That is a fundamental primitive that many workloads rely on,
and the regression stems from an unintended shift of __arch_clear_user()
so that its tight zeroing loop now crosses an I-cache boundary. This has
also persisted as a deterministic change in performance in all
subsequent kernel builds we have tested since its appearance in 6.17.

The proposed ".p2align 6" is not adding a new micro-optimisation. It
restores the previous instruction-cache locality that the function had
before the irqbypass reshuffle. The cost is very small (up to 64 bytes
of padding in one place), and the bar for applying this kind of fix is
correspondingly high: did an unrelated change cause a significant
performance regression in a widely used core primitive?

I am open to any solution that reliably restores the lost IPC, so please
let me know if you have something else in mind.

Thanks,
Luke