From nobody Thu Apr  2 17:02:37 2026
Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com
 [209.85.221.50])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD0762C11E7
	for <linux-kernel@vger.kernel.org>; Fri, 27 Mar 2026 06:02:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.221.50
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774591353; cv=none;
 b=BUm6jg91wJ7LNXAqkyjdw9Ylh77ABoZNmj3Be+n7XLZQ0MMIr3syQO7Ziy1xNxBeYFWfXsXfgUdfQ6u/ByezX2rm88KGugHWblxfeqCblregFDv5h+caWxZgWX//e5ZcJhG0ZOZVwFs0FlWctwLLbNXpEy5jLxLPh2bgPdbwy3k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774591353; c=relaxed/simple;
	bh=1Gy1ZkyfbPJGN/gD/ZZPMWNJIiamiZNTkeOoJS7nZaw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=npVDlCvhvy2xPmcErAFDK3tRDxRP0LyYlt6hkMrjIPtOlJUHBFaFEDivOJu7mRUJeY6ESQh1c35FTl8CBFM0PMvOV/FckfjgPdfbuc3JTgFOTanXi6/EVfOxbXSPxwKHT1/kzqLsiihawbsuSZTTkr60jro0+OOEXa+lmqdjtpo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=sjDR3ml1; arc=none smtp.client-ip=209.85.221.50
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="sjDR3ml1"
Received: by mail-wr1-f50.google.com with SMTP id
 ffacd0b85a97d-43b5bded412so1207625f8f.0
        for <linux-kernel@vger.kernel.org>;
 Thu, 26 Mar 2026 23:02:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1774591349; x=1775196149;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=DMNJkheQOmXo7e7N2Os/3E05O7K3NNwW1iRasmpmKzM=;
        b=sjDR3ml1qglIzaKmPZuQy/ryVgWnDBr8UAQANerEY6J3hW/0uBO+Z3jkTpUGhCxPcm
         cUSw9Vj5CUKf4lTw9gPosemvIs+Tk391P4n8V2QH6qBsB+NxIryTVcQADd8I1bUxW/yn
         MUAb8yEO6AVQyE0YYePSdVGQ8AT/ieoXitNjQaxLkk1u7KHtZ0s4/gIwPsMvh8yqKRjz
         p/lsIDzdGxl+ezScFtl36V96XLrHLlX5D1oVJnxrmS98+tpzImlxvMLcbmOHwI9ORVX9
         j151p0EiR9iEy1A/2y2+gqM3gVPCV+M5H7HbklNjlkTGPN4oHSVrUW5YWDuvwpVfzOSL
         chGg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1774591349; x=1775196149;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=DMNJkheQOmXo7e7N2Os/3E05O7K3NNwW1iRasmpmKzM=;
        b=GnFfOyjzvx6uXxJuH3oudIhIyo/Dy1M0FEL4K1V+Lr9hLSPmjdVcmCHxhFiDG7g/Qz
         1vbtFZmDFAG9mR9/NkdXeMCHU99zGj0ALva33KsyDBhEsQSJYbOHsUG3jrNyeQyHaM11
         MGbzv2nglHg+WYRKXl4y9exhZm2LfiOaRisuOmV4LoQjBEFTIt3q6BatZLZF3S63bMpj
         C5SZVZmDG5rJMl1/UqPcLGgdiiq/qfPznFgGiq36yGFNyNlmuhZcAlpv7/ejQGoc/Ped
         YS63ppStPxAa8IHbu4yopuWWcTrTT/6IctXDLckhLH73fsOucfCzMJI7YxowistKXMr+
         52kw==
X-Forwarded-Encrypted: i=1;
 AJvYcCVDnV2/r/ASU0mBZIupNdFaXnq4xIAaeM36LOklssDO+e4839h4KuRX4nSAEXuaFzopmYyq7QrQFeDVFLQ=@vger.kernel.org
X-Gm-Message-State: AOJu0YzXuIuJQiZM4Y0T+dYpKl85Eo366uQLMW1h2/RBZTBb2bwt2NLA
	hvaTtm4zHblUqVH5VBrEANhXQOIhQGESLFE9/Mm2j72CL7Vtg/ynlv3a
X-Gm-Gg: ATEYQzwXH5to5Z6TwIg5/6RP6zY2eM4FokXfWkMYBW0PUrtTb4xM6kpRGk6UE17RoSh
	6c459PesrZAmHk8MJ7s4AlplC3fdGGR3B1n+0+AOFPMZNSpgbC7NYHenN8ht3X/Aea2KNH1JHE4
	vvjSvnBmpG0Eg7RGjos6+0WV6oXzj6/3qT7hOwFk9D8gC6/McV1USTTpT0lgBVspgOUWwbgUi/5
	gs62+rTSG0efxEWJdKlYVT47ldrsEcFKqfrXQTiy4Q61h2Ex24QhtiuPuZbUz9VpO17NtHAPjs6
	Z2hoNr6SxYWAGY2zYAgQsA4MwExbhQ0ksQmJEP2hUk5iAPhbgq+2aLdoeV/0NAFNLeLJVm995V4
	Rpe1dQWOkVa/3Pp+OeXryLEzfnOJhL20HlTmVI0nPFP14ED7rr2Hnr8Uh2d8ypG05Fl1AqFYOlr
	Etizk6FEmPpt1uv5vGkvpbtNCIVb5JgjyJ
X-Received: by 2002:a05:6000:2c0e:b0:43b:4980:b15a with SMTP id
 ffacd0b85a97d-43b9e9ea07fmr1386811f8f.13.1774591348781;
        Thu, 26 Mar 2026 23:02:28 -0700 (PDT)
Received: from lima-dev.. ([45.89.90.224])
        by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-43b919df85csm17960628f8f.28.2026.03.26.23.02.27
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 26 Mar 2026 23:02:28 -0700 (PDT)
From: Demian Shulhan <demyansh@gmail.com>
To: linux-crypto@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: ebiggers@kernel.org,
	ardb@kernel.org,
	Demian Shulhan <demyansh@gmail.com>
Subject: [PATCH v2] lib/crc: arm64: add NEON accelerated CRC64-NVMe
 implementation
Date: Fri, 27 Mar 2026 06:02:11 +0000
Message-ID: <20260327060211.902077-1-demyansh@gmail.com>
X-Mailer: git-send-email 2.43.0
In-Reply-To: <20260317065425.2684093-1-demyansh@gmail.com>
References: <20260317065425.2684093-1-demyansh@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
software implementation is slow, which creates a bottleneck in NVMe and
other storage subsystems.

The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
than raw assembly for better readability and maintainability.

Key highlights of this implementation:
- Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
  spikes on large buffers.
- Pre-calculates and loads fold constants via vld1q_u64() to minimize
  register spilling.
- Benchmarks show the break-even point against the generic implementation
  is around 128 bytes. The PMULL path is enabled only for len >=3D 128.
- Safely falls back to the generic implementation on Big-Endian systems.

Performance results (kunit crc_benchmark on Cortex-A72):
- Generic (len=3D4096): ~268 MB/s
- PMULL (len=3D4096): ~1556 MB/s (nearly 6x improvement)

Signed-off-by: Demian Shulhan <demyansh@gmail.com>
---
v2: - Removed KERNEL_MODE_NEON check from Kconfig as it's redundant on arm6=
4.
  - Added missing prototype for crc64_nvme_arm64_c to fix sparse/W=3D1 warn=
ing.
  - Improved readability in Makefile with extra newlines and comments.
  - Removed redundant include guards in crc64.h.
  - Switched to do-while loops for better optimization in hot paths.
  - Added comments explaining the magic constants (fold/Barrett).
---
 lib/crc/Kconfig                  |  1 +
 lib/crc/Makefile                 |  8 +++-
 lib/crc/arm64/crc64-neon-inner.c | 82 ++++++++++++++++++++++++++++++++
 lib/crc/arm64/crc64.h            | 29 +++++++++++
 4 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 lib/crc/arm64/crc64-neon-inner.c
 create mode 100644 lib/crc/arm64/crc64.h

diff --git a/lib/crc/Kconfig b/lib/crc/Kconfig
index 70e7a6016de3..16cb42d5e306 100644
--- a/lib/crc/Kconfig
+++ b/lib/crc/Kconfig
@@ -82,6 +82,7 @@ config CRC64
 config CRC64_ARCH
 	bool
 	depends on CRC64 && CRC_OPTIMIZATIONS
+	default y if ARM64
 	default y if RISCV && RISCV_ISA_ZBC && 64BIT
 	default y if X86_64
=20
diff --git a/lib/crc/Makefile b/lib/crc/Makefile
index 7543ad295ab6..c9c35419b39c 100644
--- a/lib/crc/Makefile
+++ b/lib/crc/Makefile
@@ -38,9 +38,15 @@ obj-$(CONFIG_CRC64) +=3D crc64.o
 crc64-y :=3D crc64-main.o
 ifeq ($(CONFIG_CRC64_ARCH),y)
 CFLAGS_crc64-main.o +=3D -I$(src)/$(SRCARCH)
+
+CFLAGS_REMOVE_arm64/crc64-neon-inner.o +=3D -mgeneral-regs-only
+CFLAGS_arm64/crc64-neon-inner.o +=3D -ffreestanding -march=3Darmv8-a+crypto
+CFLAGS_arm64/crc64-neon-inner.o +=3D -isystem $(shell $(CC) -print-file-na=
me=3Dinclude)
+crc64-$(CONFIG_ARM64) +=3D arm64/crc64-neon-inner.o
+
 crc64-$(CONFIG_RISCV) +=3D riscv/crc64_lsb.o riscv/crc64_msb.o
 crc64-$(CONFIG_X86) +=3D x86/crc64-pclmul.o
-endif
+endif # CONFIG_CRC64_ARCH
=20
 obj-y +=3D tests/
=20
diff --git a/lib/crc/arm64/crc64-neon-inner.c b/lib/crc/arm64/crc64-neon-in=
ner.c
new file mode 100644
index 000000000000..ad268ad35ab8
--- /dev/null
+++ b/lib/crc/arm64/crc64-neon-inner.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Accelerated CRC64 (NVMe) using ARM NEON C intrinsics
+ */
+
+#include <linux/types.h>
+#include <asm/neon-intrinsics.h>
+
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len);
+
+#define GET_P64_0(v) ((poly64_t)vgetq_lane_u64(vreinterpretq_u64_p64(v), 0=
))
+#define GET_P64_1(v) ((poly64_t)vgetq_lane_u64(vreinterpretq_u64_p64(v), 1=
))
+
+/* x^191 mod G, x^127 mod G */
+static const u64 fold_consts_val[2] =3D { 0xeadc41fd2ba3d420ULL,
+					0x21e9761e252621acULL };
+/* floor(x^127 / G), (G - x^64) / x */
+static const u64 bconsts_val[2] =3D { 0x27ecfa329aef9f77ULL,
+				    0x34d926535897936aULL };
+
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len)
+{
+	uint64x2_t v0_u64 =3D { crc, 0 };
+	poly64x2_t v0 =3D vreinterpretq_p64_u64(v0_u64);
+	poly64x2_t fold_consts =3D
+		vreinterpretq_p64_u64(vld1q_u64(fold_consts_val));
+	poly64x2_t v1 =3D vreinterpretq_p64_u8(vld1q_u8(p));
+
+	v0 =3D vreinterpretq_p64_u8(veorq_u8(vreinterpretq_u8_p64(v0),
+					   vreinterpretq_u8_p64(v1)));
+	p +=3D 16;
+	len -=3D 16;
+
+	do {
+		v1 =3D vreinterpretq_p64_u8(vld1q_u8(p));
+
+		poly128_t v2 =3D vmull_high_p64(fold_consts, v0);
+		poly128_t v0_128 =3D
+			vmull_p64(GET_P64_0(fold_consts), GET_P64_0(v0));
+
+		uint8x16_t x0 =3D veorq_u8(vreinterpretq_u8_p128(v0_128),
+					 vreinterpretq_u8_p128(v2));
+
+		x0 =3D veorq_u8(x0, vreinterpretq_u8_p64(v1));
+		v0 =3D vreinterpretq_p64_u8(x0);
+
+		p +=3D 16;
+		len -=3D 16;
+	} while (len >=3D 16);
+
+	/*
+	 * Reduce the 128-bit value to 64 bits.
+	 * By multiplying the high 64 bits by x^127 mod G (fold_consts_val[1])
+	 * and XORing the result with the low 64 bits.
+	 */
+	poly64x2_t v7 =3D vreinterpretq_p64_u64((uint64x2_t){ 0, 0 });
+	poly128_t v1_128 =3D vmull_p64(GET_P64_1(fold_consts), GET_P64_0(v0));
+
+	uint8x16_t ext_v0 =3D
+		vextq_u8(vreinterpretq_u8_p64(v0), vreinterpretq_u8_p64(v7), 8);
+	uint8x16_t x0 =3D veorq_u8(ext_v0, vreinterpretq_u8_p128(v1_128));
+
+	v0 =3D vreinterpretq_p64_u8(x0);
+
+	/* Final Barrett reduction */
+	poly64x2_t bconsts =3D vreinterpretq_p64_u64(vld1q_u64(bconsts_val));
+
+	v1_128 =3D vmull_p64(GET_P64_0(bconsts), GET_P64_0(v0));
+
+	poly64x2_t v1_64 =3D vreinterpretq_p64_u8(vreinterpretq_u8_p128(v1_128));
+	poly128_t v3_128 =3D vmull_p64(GET_P64_1(bconsts), GET_P64_0(v1_64));
+
+	x0 =3D veorq_u8(vreinterpretq_u8_p64(v0), vreinterpretq_u8_p128(v3_128));
+
+	uint8x16_t ext_v2 =3D vextq_u8(vreinterpretq_u8_p64(v7),
+				     vreinterpretq_u8_p128(v1_128), 8);
+
+	x0 =3D veorq_u8(x0, ext_v2);
+
+	v0 =3D vreinterpretq_p64_u8(x0);
+	return vgetq_lane_u64(vreinterpretq_u64_p64(v0), 1);
+}
diff --git a/lib/crc/arm64/crc64.h b/lib/crc/arm64/crc64.h
new file mode 100644
index 000000000000..2c1449d57486
--- /dev/null
+++ b/lib/crc/arm64/crc64.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * CRC64 using ARM64 PMULL instructions
+ */
+
+#include <linux/cpufeature.h>
+#include <asm/simd.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len);
+
+#define crc64_be_arch crc64_be_generic
+
+static inline u64 crc64_nvme_arch(u64 crc, const u8 *p, size_t len)
+{
+	if (len >=3D 128 && cpu_have_named_feature(PMULL) &&
+	    likely(may_use_simd())) {
+		do {
+			size_t chunk =3D min_t(size_t, len & ~15, SZ_4K);
+
+			scoped_ksimd() crc =3D crc64_nvme_arm64_c(crc, p, chunk);
+
+			p +=3D chunk;
+			len -=3D chunk;
+		} while (len >=3D 128);
+	}
+	return crc64_nvme_generic(crc, p, len);
+}
--=20
2.43.0