From nobody Mon Apr  6 18:23:59 2026
Received: from canpmsgout01.his.huawei.com (canpmsgout01.his.huawei.com
 [113.46.200.216])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE2DD395240
	for <linux-kernel@vger.kernel.org>; Mon, 16 Mar 2026 12:43:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=113.46.200.216
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773665015; cv=none;
 b=Wui+DV232rlu9rEo8FACuyrK7QRldsEiiY9BxEQ++QCzGQ+/80+FyTUa7YH+fVSYHl2tyGP9fce4CGC9OE41knmbs/kZ5X84lrEc75NcpMABxI/8tgNxW4bJhSBU3erri2gcNlB/ppxP9VyT4yUC0JP2hK/ceXU4l33iVq356OA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773665015; c=relaxed/simple;
	bh=6/dXvUPvYroCLHVSjfR2lihuH5j4Gm62sHjT8XHwJQ8=;
	h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type;
 b=TkwsRyhyA0J4oVQ9el7UCq9+lhkhIzUBjmwrOK2uI9zvjdLVyXDd3Re8Xux1B2oXaC5zlpn9uSZUNPAZECZfSeWx2NklMsINU51+3GrvADPqbxq5Dw6PTLu9y3d2DuhUYqaTRwukiqrs3tkKiFaGHWLyWm4r2Jm3aNptK/DDUvw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=huawei.com;
 spf=pass smtp.mailfrom=huawei.com;
 dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com
 header.b=tjBYPBRi; arc=none smtp.client-ip=113.46.200.216
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=huawei.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=huawei.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com
 header.b="tjBYPBRi"
dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim;
	c=relaxed/relaxed; q=dns/txt;
	h=From;
	bh=eLUMhmd9h4ee//Gr+wOdnrAtwTFTiQzYH/GSpPyA/Go=;
	b=tjBYPBRiPkZ9YtHqG6mXSXJSE8JfqAKcZVeb0jnbsojRgclKpLc4kg7J/FPaST3nwr1tm/QYx
	C6Zq4Yrz3c50DSbcnmoT5Lm5xYGcN3oqMsv1NY0fDQK0QCNi2Vc4XVb4yXIP3KtexLyKNHJOFQi
	pABBCblQliuZDa1pinZhGl8=
Received: from mail.maildlp.com (unknown [172.19.163.0])
	by canpmsgout01.his.huawei.com (SkyGuard) with ESMTPS id 4fZF4K5vRSz1T4G3;
	Mon, 16 Mar 2026 20:38:09 +0800 (CST)
Received: from dggemv706-chm.china.huawei.com (unknown [10.3.19.33])
	by mail.maildlp.com (Postfix) with ESMTPS id 761884056B;
	Mon, 16 Mar 2026 20:43:28 +0800 (CST)
Received: from kwepemn200010.china.huawei.com (7.202.194.133) by
 dggemv706-chm.china.huawei.com (10.3.19.33) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Mon, 16 Mar 2026 20:43:28 +0800
Received: from huawei.com (10.44.142.85) by kwepemn200010.china.huawei.com
 (7.202.194.133) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 16 Mar
 2026 20:43:27 +0800
From: Qi Xi <xiqi2@huawei.com>
To: <catalin.marinas@arm.com>, <will@kernel.org>
CC: <sunnanyong@huawei.com>, <xiqi2@huawei.com>, <wangkefeng.wang@huawei.com>,
	<benniu@meta.com>, <linux-arm-kernel@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>
Subject: [PATCH v3] Faster Arm64 __arch_copy_from_user and __arch_copy_to_user
Date: Mon, 16 Mar 2026 20:31:00 +0800
Message-ID: <20260316123100.82932-1-xiqi2@huawei.com>
X-Mailer: git-send-email 2.33.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-ClientProxiedBy: kwepems500002.china.huawei.com (7.221.188.17) To
 kwepemn200010.china.huawei.com (7.202.194.133)
Content-Type: text/plain; charset="utf-8"

Based on Ben Niu's "Faster Arm64 __arch_copy_from_user and
__arch_copy_to_user" patch [1], this implementation further optimizes
and simplifies user space copies by:

1. Limiting optimization scope to >=3D128 bytes copies where PAN state matt=
ers.
   For <128 bytes copies, the implementation uses non-privileged
   instructions uniformly, simplifying the code and reducing maintenance
   cost.
2. Adding "arm64.nopan" cmdline support using the standard idreg-override
   framework, allowing runtime PAN disable without building separate
   CONFIG_ARM64_PAN=3Dy/n kernels as required by Ben Niu's version.
   The implementation maintains separate paths for PAN-enabled (using
   unprivileged ldtr/sttr) and PAN-disabled (using standard ldp/stp), with
   runtime selection via ALTERNATIVE() at the large copy loop entry.
3. Retaining the critical path optimization from the original patch:
   reducing pointer update instructions through manual batch updates,
   processing 64 bytes per iteration with only one pair of add instructions.

Performance improvements measured on Kunpeng 920 with PAN disabled:

The ku_copy microbenchmark [2] (a kernel module that measures
copy_to/from_user throughput across various sizes by copying 1GB of
data in each test):
copy_to_user throughput change (positive =3D improvement):
128B: +0.9%   256B: +10.3%  512B: +23.3%  1024B: +38.1%
2048B: +56.2% 4096B: +68.5% 8192B: +74.8% 16384B: +79.7%
32768B: +80.7% 65536B: +81.3% 131072B: +77.3% 262144B: +77.9%
copy_from_user throughput change:
128B: +2.0%   256B: +7.5%   512B: +20.3%  1024B: +28.4%
2048B: +38.1% 4096B: +39.6% 8192B: +41.5% 16384B: +42.3%
32768B: +42.2% 65536B: +44.8% 131072B: +70.3% 262144B: +71.0%

Real-world workloads:
- RocksDB read-write mixed workload:
  Overall throughput improved by 2%.
  copy_to_user hotspot reduced from 3.3% to 2.7% of total CPU cycles.
  copy_from_user hotspot reduced from 2.25% to 0.85% of total CPU cycles.

- BRPC rdma_performance (server side, baidu_std protocol over TCP):
  copy_to_user accounts for ~11.5% of total CPU cycles.
  After optimization, server CPU utilization reduced from 64% to 62%
  (2% absolute improvement, equivalent to ~17% reduction in
  copy_to_user overhead)

[1] https://lore.kernel.org/all/20251018052237.1368504-2-benniu@meta.com/
[2] https://github.com/mcfi/benchmark/tree/main/ku_copy

Co-developed-by: Ben Niu <benniu@meta.com>
Signed-off-by: Ben Niu <benniu@meta.com>
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Signed-off-by: Qi Xi <xiqi2@huawei.com>
---
Changes in v3:
- Limiting optimization scope to >=3D128 bytes copies.
- Use idreg-override for PAN runtime selection with "arm64.nopan" cmdline.
---
 arch/arm64/include/asm/asm-uaccess.h  |  22 ++----
 arch/arm64/kernel/pi/idreg-override.c |   2 +
 arch/arm64/lib/copy_from_user.S       |  17 +++-
 arch/arm64/lib/copy_template.S        | 108 +++++++++++++++++++-------
 arch/arm64/lib/copy_to_user.S         |  17 +++-
 5 files changed, 114 insertions(+), 52 deletions(-)

diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/=
asm-uaccess.h
index 9148f5a31968..198a05d478fc 100644
--- a/arch/arm64/include/asm/asm-uaccess.h
+++ b/arch/arm64/include/asm/asm-uaccess.h
@@ -70,27 +70,21 @@ alternative_else_nop_endif
  * This is complicated as there is no post-increment or pair versions of t=
he
  * unprivileged instructions, and USER() only works for single instruction=
s.
  */
-	.macro user_ldp l, reg1, reg2, addr, post_inc
-8888:		ldtr	\reg1, [\addr];
-8889:		ldtr	\reg2, [\addr, #8];
-		add	\addr, \addr, \post_inc;
+	.macro user_ldst l, inst, reg, addr, post_inc
+8888:		\inst		\reg, [\addr];
+		add		\addr, \addr, \post_inc;
=20
 		_asm_extable_uaccess	8888b, \l;
-		_asm_extable_uaccess	8889b, \l;
 	.endm
=20
-	.macro user_stp l, reg1, reg2, addr, post_inc
-8888:		sttr	\reg1, [\addr];
-8889:		sttr	\reg2, [\addr, #8];
-		add	\addr, \addr, \post_inc;
+	.macro user_ldst_index l, inst, reg, addr, val
+8888:		\inst		\reg, [\addr, \val];
=20
-		_asm_extable_uaccess	8888b,\l;
-		_asm_extable_uaccess	8889b,\l;
+		_asm_extable_uaccess	8888b, \l;
 	.endm
=20
-	.macro user_ldst l, inst, reg, addr, post_inc
-8888:		\inst		\reg, [\addr];
-		add		\addr, \addr, \post_inc;
+	.macro user_ldst_pair_index l, inst, reg1, reg2, addr, val
+8888:		\inst		\reg1, \reg2, [\addr, \val];
=20
 		_asm_extable_uaccess	8888b, \l;
 	.endm
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/i=
dreg-override.c
index bc57b290e5e7..ac26f1f3aad4 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -64,6 +64,7 @@ static const struct ftr_set_desc mmfr1 __prel64_initconst=
 =3D {
 	.override	=3D &id_aa64mmfr1_override,
 	.fields		=3D {
 		FIELD("vh", ID_AA64MMFR1_EL1_VH_SHIFT, mmfr1_vh_filter),
+		FIELD("pan", ID_AA64MMFR1_EL1_PAN_SHIFT, NULL),
 		{}
 	},
 };
@@ -249,6 +250,7 @@ static const struct {
 	{ "arm64.nolva",		"id_aa64mmfr2.varange=3D0" },
 	{ "arm64.no32bit_el0",		"id_aa64pfr0.el0=3D1" },
 	{ "arm64.nompam",		"id_aa64pfr0.mpam=3D0 id_aa64pfr1.mpam_frac=3D0" },
+	{ "arm64.nopan",		"id_aa64mmfr1.pan=3D0" },
 };
=20
 static int __init parse_hexdigit(const char *p, u64 *v)
diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_use=
r.S
index 400057d607ec..1f578c4d0ae6 100644
--- a/arch/arm64/lib/copy_from_user.S
+++ b/arch/arm64/lib/copy_from_user.S
@@ -44,12 +44,21 @@
 	str \reg, [\ptr], \val
 	.endm
=20
-	.macro ldp1 reg1, reg2, ptr, val
-	user_ldp 9997f, \reg1, \reg2, \ptr, \val
+	.macro ldp_unpriv reg1, reg2, ptr, val
+	user_ldst_index 9997f, ldtr, \reg1, \ptr, \val
+	user_ldst_index 9997f, ldtr, \reg2, \ptr, \val + 8
 	.endm
=20
-	.macro stp1 reg1, reg2, ptr, val
-	stp \reg1, \reg2, [\ptr], \val
+	.macro stp_unpriv reg1, reg2, ptr, val
+	stp \reg1, \reg2, [\ptr, \val]
+	.endm
+
+	.macro ldp_priv reg1, reg2, ptr, val
+	user_ldst_pair_index 9997f, ldp, \reg1, \reg2, \ptr, \val
+	.endm
+
+	.macro stp_priv reg1, reg2, ptr, val
+	stp \reg1, \reg2, [\ptr, \val]
 	.endm
=20
 	.macro cpy1 dst, src, count
diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S
index 7f2f5a0e2fb9..5ef6dc9bf7d8 100644
--- a/arch/arm64/lib/copy_template.S
+++ b/arch/arm64/lib/copy_template.S
@@ -97,14 +97,20 @@ alternative_else_nop_endif
 	cmp	tmp1w, #0x20
 	b.eq	1f
 	b.lt	2f
-	ldp1	A_l, A_h, src, #16
-	stp1	A_l, A_h, dst, #16
+	ldp_unpriv	A_l, A_h, src, #0
+	stp_unpriv	A_l, A_h, dst, #0
+	add	src, src, #16
+	add	dst, dst, #16
 1:
-	ldp1	A_l, A_h, src, #16
-	stp1	A_l, A_h, dst, #16
+	ldp_unpriv	A_l, A_h, src, #0
+	stp_unpriv	A_l, A_h, dst, #0
+	add	src, src, #16
+	add	dst, dst, #16
 2:
-	ldp1	A_l, A_h, src, #16
-	stp1	A_l, A_h, dst, #16
+	ldp_unpriv	A_l, A_h, src, #0
+	stp_unpriv	A_l, A_h, dst, #0
+	add	src, src, #16
+	add	dst, dst, #16
 .Ltiny15:
 	/*
 	* Prefer to break one ldp/stp into several load/store to access
@@ -142,14 +148,16 @@ alternative_else_nop_endif
 	* Less than 128 bytes to copy, so handle 64 here and then jump
 	* to the tail.
 	*/
-	ldp1	A_l, A_h, src, #16
-	stp1	A_l, A_h, dst, #16
-	ldp1	B_l, B_h, src, #16
-	ldp1	C_l, C_h, src, #16
-	stp1	B_l, B_h, dst, #16
-	stp1	C_l, C_h, dst, #16
-	ldp1	D_l, D_h, src, #16
-	stp1	D_l, D_h, dst, #16
+	ldp_unpriv	A_l, A_h, src, #0
+	stp_unpriv	A_l, A_h, dst, #0
+	ldp_unpriv	B_l, B_h, src, #16
+	ldp_unpriv	C_l, C_h, src, #32
+	stp_unpriv	B_l, B_h, dst, #16
+	stp_unpriv	C_l, C_h, dst, #32
+	ldp_unpriv	D_l, D_h, src, #48
+	stp_unpriv	D_l, D_h, dst, #48
+	add	src, src, #64
+	add	dst, dst, #64
=20
 	tst	count, #0x3f
 	b.ne	.Ltail63
@@ -161,30 +169,70 @@ alternative_else_nop_endif
 	*/
 	.p2align	L1_CACHE_SHIFT
 .Lcpy_body_large:
+	/* Runtime PAN decision for large copies */
+	ALTERNATIVE("b .Llarge_pan_disabled", "b .Llarge_pan_enabled", ARM64_HAS_=
PAN)
+
+.Llarge_pan_enabled:
+	/* PAN enabled version - use unprivileged loads (ldp_unpriv) */
 	/* pre-get 64 bytes data. */
-	ldp1	A_l, A_h, src, #16
-	ldp1	B_l, B_h, src, #16
-	ldp1	C_l, C_h, src, #16
-	ldp1	D_l, D_h, src, #16
+	ldp_unpriv	A_l, A_h, src, #0
+	ldp_unpriv	B_l, B_h, src, #16
+	ldp_unpriv	C_l, C_h, src, #32
+	ldp_unpriv	D_l, D_h, src, #48
+	add	src, src, #64
+1:
+	/*
+	* interlace the load of next 64 bytes data block with store of the last
+	* loaded 64 bytes data.
+	*/
+	stp_unpriv	A_l, A_h, dst, #0
+	ldp_unpriv	A_l, A_h, src, #0
+	stp_unpriv	B_l, B_h, dst, #16
+	ldp_unpriv	B_l, B_h, src, #16
+	stp_unpriv	C_l, C_h, dst, #32
+	ldp_unpriv	C_l, C_h, src, #32
+	stp_unpriv	D_l, D_h, dst, #48
+	ldp_unpriv	D_l, D_h, src, #48
+	add	dst, dst, #64
+	add	src, src, #64
+	subs	count, count, #64
+	b.ge	1b
+	b	.Llarge_done
+
+.Llarge_pan_disabled:
+	/* PAN disabled version - use normal loads without post-increment */
+	/* pre-get 64 bytes data using normal loads */
+	ldp_priv	A_l, A_h, src, #0
+	ldp_priv	B_l, B_h, src, #16
+	ldp_priv	C_l, C_h, src, #32
+	ldp_priv	D_l, D_h, src, #48
+	add	src, src, #64
 1:
 	/*
 	* interlace the load of next 64 bytes data block with store of the last
 	* loaded 64 bytes data.
 	*/
-	stp1	A_l, A_h, dst, #16
-	ldp1	A_l, A_h, src, #16
-	stp1	B_l, B_h, dst, #16
-	ldp1	B_l, B_h, src, #16
-	stp1	C_l, C_h, dst, #16
-	ldp1	C_l, C_h, src, #16
-	stp1	D_l, D_h, dst, #16
-	ldp1	D_l, D_h, src, #16
+	stp_priv	A_l, A_h, dst, #0
+	ldp_priv	A_l, A_h, src, #0
+	stp_priv	B_l, B_h, dst, #16
+	ldp_priv	B_l, B_h, src, #16
+	stp_priv	C_l, C_h, dst, #32
+	ldp_priv	C_l, C_h, src, #32
+	stp_priv	D_l, D_h, dst, #48
+	ldp_priv	D_l, D_h, src, #48
+	add	dst, dst, #64
+	add	src, src, #64
 	subs	count, count, #64
 	b.ge	1b
-	stp1	A_l, A_h, dst, #16
-	stp1	B_l, B_h, dst, #16
-	stp1	C_l, C_h, dst, #16
-	stp1	D_l, D_h, dst, #16
+
+.Llarge_done:
+	/* Post-loop: store the last block of data using stp_unpriv */
+	/* (without post-increment) */
+	stp_unpriv	A_l, A_h, dst, #0
+	stp_unpriv	B_l, B_h, dst, #16
+	stp_unpriv	C_l, C_h, dst, #32
+	stp_unpriv	D_l, D_h, dst, #48
+	add	dst, dst, #64
=20
 	tst	count, #0x3f
 	b.ne	.Ltail63
diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
index 819f2e3fc7a9..9738ae96c823 100644
--- a/arch/arm64/lib/copy_to_user.S
+++ b/arch/arm64/lib/copy_to_user.S
@@ -43,12 +43,21 @@
 	user_ldst 9997f, sttr, \reg, \ptr, \val
 	.endm
=20
-	.macro ldp1 reg1, reg2, ptr, val
-	ldp \reg1, \reg2, [\ptr], \val
+	.macro ldp_unpriv reg1, reg2, ptr, val
+	ldp \reg1, \reg2, [\ptr, \val]
 	.endm
=20
-	.macro stp1 reg1, reg2, ptr, val
-	user_stp 9997f, \reg1, \reg2, \ptr, \val
+	.macro stp_unpriv reg1, reg2, ptr, val
+	user_ldst_index 9997f, sttr, \reg1, \ptr, \val
+	user_ldst_index 9997f, sttr, \reg2, \ptr, \val + 8
+	.endm
+
+	.macro ldp_priv reg1, reg2, ptr, val
+	ldp \reg1, \reg2, [\ptr, \val]
+	.endm
+
+	.macro stp_priv reg1, reg2, ptr, val
+	user_ldst_pair_index 9997f, stp, \reg1, \reg2, \ptr, \val
 	.endm
=20
 	.macro cpy1 dst, src, count
--=20
2.33.0