arch/arm64/lib/clear_user.S | 5 +++++ 1 file changed, 5 insertions(+)
On aarch64 kernels, recent changes (specifically irqbypass patch
https://lore.kernel.org/all/20250516230734.2564775-6-seanjc@google.com/)
shifted __arch_clear_user() such that the tight zeroing loop straddles
I-cache lines. This causes measurable read performance regression when
reading from /dev/zero.
Add `.p2align 6` (64-byte alignment) to guarantee the loop stays within a
single I-cache boundary, restoring the previous IPC and throughput.
Tested on bare-metal aarch64 systems:
Good kernel: pread_z100k ~ 6.9 s
Bad kernel: pread_z100k ~ 9.0 s
With patch: pread_z100k ~ 6.9 s
Reproducer:
// gcc -O2 -Wall -Wextra -o pread_z100k pread_z100k.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (100 * 1024)
#define COUNT 1000000
static double now_sec(void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1e6;
}
int main(void)
{
int fd = open("/dev/zero", O_RDONLY);
if (fd < 0) { perror("open /dev/zero"); return 1; }
char *buf = malloc(SIZE);
if (!buf) { perror("malloc"); close(fd); return 1; }
double t1 = now_sec();
for (int i = 0; i < COUNT; i++) {
ssize_t r = pread(fd, buf, SIZE, 0);
if (r != SIZE) { perror("pread"); break; }
}
double t2 = now_sec();
printf("%.6f\n", t2 - t1);
close(fd);
free(buf);
return 0;
}
Signed-off-by: Luke Yang <luyang@redhat.com>
Signed-off-by: Jirka Hladky <jhladky@redhat.com>
---
arch/arm64/lib/clear_user.S | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/arm64/lib/clear_user.S b/arch/arm64/lib/clear_user.S
index de9a303b6..91eee4a7c 100644
--- a/arch/arm64/lib/clear_user.S
+++ b/arch/arm64/lib/clear_user.S
@@ -17,6 +17,11 @@
* Alignment fixed up by hardware.
*/
+/*
+ * Ensure __arch_clear_user() always starts on a clean I-cache boundary.
+ */
+ .p2align 6 // 2^6 = 64-byte alignment
+
SYM_FUNC_START(__arch_clear_user)
add x2, x0, x1
--
2.51.1
On Fri, Nov 21, 2025 at 12:04:55AM -0500, Luke Yang wrote: > On aarch64 kernels, recent changes (specifically irqbypass patch > https://lore.kernel.org/all/20250516230734.2564775-6-seanjc@google.com/) > shifted __arch_clear_user() such that the tight zeroing loop straddles > I-cache lines. This causes measurable read performance regression when > reading from /dev/zero. > > Add `.p2align 6` (64-byte alignment) to guarantee the loop stays within a > single I-cache boundary, restoring the previous IPC and throughput. Hmm, but what's special about __arch_clear_user()? If we make this change, anybody could surely make similar arguments for other functions on their hot paths? Amusingly, there's CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B to play around with that and it sounds like the irqbyass change you cite calls into the category of changes highlighted by the Kconfig text. Will
On Mon, 24 Nov 2025 13:38:25 +0000, Will Deacon wrote: > Hmm, but what's special about __arch_clear_user()? If we make this > change, anybody could surely make similar arguments for other > functions on their hot paths? Hi Will, Thanks for the feedback. I agree that the precedent question matters. In this case, though, the irqbypass change introduced roughly a 30% regression in /dev/zero read throughput. That is a fundamental primitive that many workloads rely on, and the regression stems from an unintended shift of __arch_clear_user() so that its tight zeroing loop now crosses an I-cache boundary. This has also persisted as a deterministic change in performance in all subsequent kernel builds we have tested since its appearance in 6.17. The proposed ".p2align 6" is not adding a new micro-optimisation. It restores the previous instruction-cache locality that the function had before the irqbypass reshuffle. The cost is very small (up to 64 bytes of padding in one place), and the bar for applying this kind of fix is correspondingly high: did an unrelated change cause a significant performance regression in a widely used core primitive? I am open to any solution that reliably restores the lost IPC, so please let me know if you have something else in mind. Thanks, Luke
© 2016 - 2025 Red Hat, Inc.