From nobody Tue Apr 7 19:54:15 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 227D92DCF7D for ; Wed, 11 Mar 2026 17:51:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773251504; cv=none; b=ZtYAPREKxRk1H/yP8aKIznzjtUloB/qgl7tMaMiyHjnNgeAPSFH0AvZEOrwc3Z9SnHP5+IO54Kvj8NrHcZrYBW7igEhqYwDYZwAIL9/xfbJUwzPAXJ+SZvTsQeNzL2BMLGrIwoIcGw9R1zJKPAR9aEqgNUcJtnQe6KayUkhbYtE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773251504; c=relaxed/simple; bh=MgLfvD2TKCRssrQ9nE4tB7wVzTgRvJBdjDYiwNP59+I=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=rNUKANA8qywojolYxFs9VQJdHxbfMkf6WYGm0KAESqeclL++tAD/pVjO/7UGqEpq/1cC38Rd9pKFSFOw9BiveOCA4c+nQSKThqwVDs4LcgXpYkZQlStmfm/lzvJhd3hA4qtBObtc59mtyDjZ2SuyGA0WyZ0PiBsSgo0bWAJKqXU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 330381682; Wed, 11 Mar 2026 10:51:35 -0700 (PDT) Received: from e142334-100.cambridge.arm.com (e142334-100.cambridge.arm.com [10.1.194.63]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id E166B3F73B; Wed, 11 Mar 2026 10:51:39 -0700 (PDT) From: Muhammad Usama Anjum To: Catalin Marinas , Will Deacon , "Matthew Wilcox (Oracle)" , Muhammad Usama Anjum , Thomas Huth , Andrew Morton , Lance Yang , Yeoreum Yun , David Hildenbrand Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH] arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode Date: Wed, 11 Mar 2026 17:50:50 +0000 Message-ID: <20260311175054.3889093-1-usama.anjum@arm.com> X-Mailer: git-send-email 2.47.3 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In MTE synchronous mode, tag check faults are reported as immediate Data Abort exceptions. The TFSR_EL1.TF1 bit is never set, since faults never go through the asynchronous path. Therefore, reading TFSR_EL1 and executing data and instruction barriers on kernel entry, exit, context switch, and suspend is unnecessary overhead in sync mode. The exit path (mte_check_tfsr_exit) and the assembly paths (check_mte_async_tcf / clear_mte_async_tcf in entry.S) already had this check. Extend the same optimization on kernel entry/exit, context switch and suspend. All mte kselftests pass. The kunit before and after the patch show same results. A selection of test_vmalloc benchmarks running on a arm64 machine. v6.19 is the baseline. (>0 is faster, <0 is slower, (R)/(I) =3D statistically significant Regression/Improvement). Based on significance and ignoring the noise, the benchmarks improved. * 77 result classes were considered, with 9 wins, 0 losses and 68 ties Results of fastpath [1] on v6.19 vs this patch: +----------------------------+---------------------------------------------= -------------+------------+ | Benchmark | Result Class = | barriers | +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D+ | micromm/fork | fork: p:1, d:10 (seconds) = | (I) 2.75% | | | fork: p:512, d:10 (seconds) = | 0.96% | +----------------------------+---------------------------------------------= -------------+------------+ | micromm/munmap | munmap: p:1, d:10 (seconds) = | -1.78% | | | munmap: p:512, d:10 (seconds) = | 5.02% | +----------------------------+---------------------------------------------= -------------+------------+ | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (us= ec) | -0.56% | | | fix_size_alloc_test: p:1, h:0, l:500000 (use= c) | 0.70% | | | fix_size_alloc_test: p:4, h:0, l:500000 (use= c) | 1.18% | | | fix_size_alloc_test: p:16, h:0, l:500000 (us= ec) | -5.01% | | | fix_size_alloc_test: p:16, h:1, l:500000 (us= ec) | 13.81% | | | fix_size_alloc_test: p:64, h:0, l:100000 (us= ec) | 6.51% | | | fix_size_alloc_test: p:64, h:1, l:100000 (us= ec) | 32.87% | | | fix_size_alloc_test: p:256, h:0, l:100000 (u= sec) | 4.17% | | | fix_size_alloc_test: p:256, h:1, l:100000 (u= sec) | 8.40% | | | fix_size_alloc_test: p:512, h:0, l:100000 (u= sec) | -0.48% | | | fix_size_alloc_test: p:512, h:1, l:100000 (u= sec) | -0.74% | | | full_fit_alloc_test: p:1, h:0, l:500000 (use= c) | 0.53% | | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:5= 00000 (usec) | -2.81% | | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:5= 00000 (usec) | -2.06% | | | long_busy_list_alloc_test: p:1, h:0, l:50000= 0 (usec) | -0.56% | | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) = | -0.41% | | | random_size_align_alloc_test: p:1, h:0, l:50= 0000 (usec) | 0.89% | | | random_size_alloc_test: p:1, h:0, l:500000 (= usec) | 1.71% | | | vm_map_ram_test: p:1, h:0, l:500000 (usec) = | 0.83% | +----------------------------+---------------------------------------------= -------------+------------+ | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec) = | 0.05% | | | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (u= sec) | 0.60% | | | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99= (usec) | 0.00% | | | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec) = | -0.34% | | | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (u= sec) | -0.58% | | | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99= (usec) | 9.09% | | | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)= | -0.74% | | | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (= usec) | -1.40% | | | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p9= 9 (usec) | 0.00% | | | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)= | -0.78% | | | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (= usec) | -0.11% | | | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p9= 9 (usec) | 0.11% | | | -m 16 -t 256 -r 10 -s 1000, avg_rps (req/sec= ) | 2.64% | | | -m 16 -t 256 -r 10 -s 1000, req_latency_p99 = (usec) | 3.15% | | | -m 16 -t 256 -r 10 -s 1000, wakeup_latency_p= 99 (usec) | 17.54% | | | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec) = | -1.22% | | | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (u= sec) | 0.85% | | | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99= (usec) | 0.00% | | | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec) = | -0.34% | | | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (u= sec) | 1.05% | | | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99= (usec) | 0.00% | | | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)= | -0.41% | | | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (= usec) | 0.58% | | | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p9= 9 (usec) | 2.13% | | | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)= | 0.67% | | | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (= usec) | 2.07% | | | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p9= 9 (usec) | -1.28% | | | -m 32 -t 256 -r 10 -s 1000, avg_rps (req/sec= ) | 1.01% | | | -m 32 -t 256 -r 10 -s 1000, req_latency_p99 = (usec) | 0.69% | | | -m 32 -t 256 -r 10 -s 1000, wakeup_latency_p= 99 (usec) | 13.12% | | | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec) = | -0.25% | | | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (u= sec) | -0.48% | | | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99= (usec) | 10.53% | | | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec) = | -0.06% | | | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (u= sec) | 0.00% | | | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99= (usec) | 0.00% | | | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)= | -0.36% | | | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (= usec) | 0.52% | | | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p9= 9 (usec) | 0.11% | | | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)= | 0.52% | | | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (= usec) | 3.53% | | | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p9= 9 (usec) | -0.10% | | | -m 64 -t 256 -r 10 -s 1000, avg_rps (req/sec= ) | 2.53% | | | -m 64 -t 256 -r 10 -s 1000, req_latency_p99 = (usec) | 1.82% | | | -m 64 -t 256 -r 10 -s 1000, wakeup_latency_p= 99 (usec) | -5.80% | +----------------------------+---------------------------------------------= -------------+------------+ | syscall/getpid | mean (ns) = | (I) 15.98% | | | p99 (ns) = | (I) 11.11% | | | p99.9 (ns) = | (I) 16.13% | +----------------------------+---------------------------------------------= -------------+------------+ | syscall/getppid | mean (ns) = | (I) 14.82% | | | p99 (ns) = | (I) 17.86% | | | p99.9 (ns) = | (I) 9.09% | +----------------------------+---------------------------------------------= -------------+------------+ | syscall/invalid | mean (ns) = | (I) 17.78% | | | p99 (ns) = | (I) 11.11% | | | p99.9 (ns) = | 13.33% | +----------------------------+---------------------------------------------= -------------+------------+ [1] https://gitlab.arm.com/tooling/fastpath Signed-off-by: Muhammad Usama Anjum Reviewed-by: David Hildenbrand (Arm) Reviewed-by: Yeoreum Yun --- The patch applies on v6.19 and next-20260309. --- arch/arm64/include/asm/mte.h | 6 +++++- arch/arm64/kernel/mte.c | 5 +++++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 6d4a78b9dc3e6..0e05d20cf2583 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -252,7 +252,8 @@ static inline void mte_check_tfsr_entry(void) if (!kasan_hw_tags_enabled()) return; =20 - mte_check_tfsr_el1(); + if (system_uses_mte_async_or_asymm_mode()) + mte_check_tfsr_el1(); } =20 static inline void mte_check_tfsr_exit(void) @@ -260,6 +261,9 @@ static inline void mte_check_tfsr_exit(void) if (!kasan_hw_tags_enabled()) return; =20 + if (!system_uses_mte_async_or_asymm_mode()) + return; + /* * The asynchronous faults are sync'ed automatically with * TFSR_EL1 on kernel entry but for exit an explicit dsb() diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 32148bf09c1dc..8da2891b834d7 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -291,6 +291,8 @@ void mte_thread_switch(struct task_struct *next) /* TCO may not have been disabled on exception entry for the current task= . */ mte_disable_tco_entry(next); =20 + if (!system_uses_mte_async_or_asymm_mode()) + return; /* * Check if an async tag exception occurred at EL1. * @@ -350,6 +352,9 @@ void mte_suspend_enter(void) if (!system_supports_mte()) return; =20 + if (!system_uses_mte_async_or_asymm_mode()) + return; + /* * The barriers are required to guarantee that the indirect writes * to TFSR_EL1 are synchronized before we report the state. --=20 2.47.3