From nobody Sun Feb 8 02:22:49 2026 Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D52E10FD for ; Thu, 29 Jan 2026 01:02:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769648540; cv=none; b=k9N6Z7erCzmOdrx2RXw7XUbh0+YbZJrYRfKp0HR9DcQyUEsckAD9BCXe23iym1qNDXmgvtN6zg13IQCfFUUXfOEdAkY9Eh0rsAUgmRYKAa3IIuiuPuhU2bRsVC+LdZtIYDlqnILw53lU2HRMkpYEhAwtLOC7jWmSmp+534EKOrI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769648540; c=relaxed/simple; bh=CEtd8JmDn+vG5m2ahPWOXy7Fr5f6iMri7ph+XE2pDQI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=PGWhpoPnFzBqFnfZa8luyL/r8LCmfeFNObRaMdSs5giPSn2KU1XIp8wQnyDUprQYvzOimdOr5+mQI8mQhMgJghLZ/yQthFuI+drM11+P4vltlNooYyk0RCF37TVvmaygx1hTgiMbmglmfE96YERYVH4CzQCP2olH2IqybgmF2BM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Ix153qZg; arc=none smtp.client-ip=209.85.128.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ix153qZg" Received: by mail-wm1-f46.google.com with SMTP id 5b1f17b1804b1-4807068eacbso3032365e9.2 for ; Wed, 28 Jan 2026 17:02:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769648537; x=1770253337; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=h4iBj8KyUyTe6sipZIufqWbKIe7HfRCC1Q1VoGen0RU=; b=Ix153qZg/tmdnzMiv9lsjtoEHpJhRMIU2674VJ8K7g9zIemlT5dhkTm0Z+kP/9sFY2 X2WVk8dn1MTSEEPEqo/JTqjYvMxSqlCfIEEACjirbxqEaUPkXes9Ms2qXqaQNlCcmFoZ GWPfa0lL1pkcirTsU/UeFlewJyzyj7OHonSQ9GScGeF+UzzM4bD82VlpUKzJ9uJjChxa ouTSlNmU6A2m9fEVphWc4r31NbKM4k0WOZ4PZ3gsFLgI7C4kNn4yBWv1TNYVPUwM97kT H6WiTpAzUBTb0ClpvrDU/zxtQ4mucwXPUKxZbfOcnJm1aokT4uXDbQ4DJsJh6R8wNPB1 NiBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769648537; x=1770253337; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=h4iBj8KyUyTe6sipZIufqWbKIe7HfRCC1Q1VoGen0RU=; b=ni7ABLnLHsj2BD9j/s8bEz5wGbabtOuKMTqK6YXU+zoEoNDzOKQCHgidLNQjnuCz/r IkoLAPmkI62fxv9JqD0aXOHnQM6lmQX6XMHwtv4SKsvFkAYGhtsJtLOisk8aoxIkmqYu AXXhOmnVo/KgBXJqgNWmEPEHwVjf4p81JQenOUpUMIqXdpcBnhpkD+xStj40/UUp2ynU EajRToPMSOu8XXY2bqMMwq+k1rFHe0XAL6gfytIIuhmHeT/UWtAVOSMeVQpcL8lxHPnb yCgXN6N++wESZ3CAOEYE+bER9E6v4Vp8dg35I13gnBHmDcgy0RczO6z9PoY2zLBfZPUZ DI5A== X-Gm-Message-State: AOJu0YyOGbwvm7fsSp4ZUe45W7UOaChq0gW5n2Cxh/bE7ZGCw2OXPtmY HtVV4fBY3MXjklViOrHKW2xOIvav6EHZRfRCsZ1510wa1Re1sJDnHcBg X-Gm-Gg: AZuq6aKyNb/0/sRakomXFctNMvX2AAa5ZysjSSRnJLXeXpl7Dc1jepFRGC+qUpVH8x/ kFkDx6mz3CVeaY00V4LoHB+frXvHBzwx6KkJuNQpTSd9RrRS/pl4b7e9Lq093AL70vcLCbP0KN8 Gyq7GlrjxTp9C4grYW++SczAVwilQTWvhJjk5HgPZolAVKYuFKooJccrTKrClhywIVn0HjX7awZ AcY9P8CvdaTJr8vZRcauamdkIgC9bzQ85C33zF5skjPi7L3Qpku21RCqisNduLMBfZPSzrp8duX 7+NqwJX3k2uQXNZNG+a+ZExG8BPm+NRqHmhGYmRZ/ljQ0Y8GuHUw4yOZB/7vj2tkPa7iKzwEBzZ 3C2PHAw3TYQWcbTK5VOHERXjP3mlABiymZkbEM8xK1j5oOJtVVRJNYh3RnwjYqjHpkuTidDBbnk xDEC99G5zuJ1/7R6En4P5Tpd1fr43k1niKipBr+p7lTLcOvZifIzffhl9+5uZ4O/oHNiazMPUKY g== X-Received: by 2002:a05:600c:35cb:b0:471:700:f281 with SMTP id 5b1f17b1804b1-48069c7c492mr96027775e9.25.1769648536595; Wed, 28 Jan 2026 17:02:16 -0800 (PST) Received: from turbo.teknoraver.net (net-37-117-189-76.cust.vodafonedsl.it. [37.117.189.76]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-48066bee7d0sm172742115e9.4.2026.01.28.17.02.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Jan 2026 17:02:16 -0800 (PST) From: Matteo Croce X-Google-Original-From: Matteo Croce To: linux-riscv@lists.infradead.org, Paul Walmsley , Palmer Dabbelt , Albert Ou Cc: linux-kernel@vger.kernel.org Subject: [PATCH] riscv: memcpy: fast copy for unaligned buffers Date: Thu, 29 Jan 2026 02:02:11 +0100 Message-ID: <20260129010211.103615-1-teknoraver@meta.com> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The RISC-V memcpy() does an 8 byte wide copy when the two buffers have the same alignment, and fallbacks to a single byte copy otherwise. Implement a unaligned aware 8 byte copy when buffers are unaligned which copies 8 bytes at time by doing proper shifting. Benchmarks shows that the aligned code path is unaffected, while the unaligned one gets a ~2.3x boost. Benchmark with the current implementation: memcpy: aligned copy of 400 MBytes in 429 msecs (931 MB/s) memcpy: unaligned copy of 400 MBytes in 1202 msecs (332 MB/s) Benchmark with the new unaligned copy: memcpy: aligned copy of 400 MBytes in 428 msecs (933 MB/s) memcpy: unaligned copy of 400 MBytes in 519 msecs (770 MB/s) These numbers are calculated on a 1.8 GHz SiFive P550 CPU with this custom unit test: https://lore.kernel.org/lkml/20260129004328.102770-1-teknoraver@meta.com/T/ Signed-off-by: Matteo Croce --- arch/riscv/lib/memcpy.S | 84 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 79 insertions(+), 5 deletions(-) diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S index 44e009ec5fef..293f8a348cfd 100644 --- a/arch/riscv/lib/memcpy.S +++ b/arch/riscv/lib/memcpy.S @@ -10,13 +10,14 @@ SYM_FUNC_START(__memcpy) move t6, a0 /* Preserve return value */ =20 - /* Defer to byte-oriented copy for small sizes */ - sltiu a3, a2, 128 - bnez a3, 4f - /* Use word-oriented copy only if low-order bits match */ + /* Check alignment first */ andi a3, t6, SZREG-1 andi a4, a1, SZREG-1 - bne a3, a4, 4f + bne a3, a4, .Lshifted_copy + + /* Aligned path: defer to byte-oriented copy for small sizes */ + sltiu a5, a2, 128 + bnez a5, 4f =20 beqz a3, 2f /* Skip if already aligned */ /* @@ -76,6 +77,79 @@ SYM_FUNC_START(__memcpy) addi t6, t6, 16*SZREG bltu a1, a3, 3b andi a2, a2, (16*SZREG)-1 /* Update count */ + j 4f /* Skip shifted copy section */ + +.Lshifted_copy: + /* + * Source and dest have different alignments. + * a3 =3D dest & (SZREG-1), a4 =3D src & (SZREG-1) + * Align destination first, then use shifted word copy. + */ + + /* For small sizes, just use byte copy */ + sltiu a5, a2, 16 + bnez a5, 4f + + /* If dest is already aligned, skip to shifted loop setup */ + beqz a3, .Ldest_aligned + + /* Calculate bytes needed to align dest: SZREG - a3 */ + neg a5, a3 + addi a5, a5, SZREG + sub a2, a2, a5 /* Update count */ + +.Lalign_dest_loop: + lb a4, 0(a1) + addi a1, a1, 1 + sb a4, 0(t6) + addi t6, t6, 1 + addi a5, a5, -1 + bnez a5, .Lalign_dest_loop + +.Ldest_aligned: + /* + * Dest is now aligned. Check if we have enough bytes + * remaining for word-oriented copy. + */ + sltiu a3, a2, SZREG + bnez a3, 4f + + /* + * Calculate shift amounts based on source alignment (distance). + * distance =3D src & (SZREG-1), guaranteed non-zero since we only + * reach here when src and dest had different alignments. + */ + andi a3, a1, SZREG-1 /* a3 =3D distance */ + slli a4, a3, 3 /* a4 =3D distance * 8 (right shift amount) */ + li a5, SZREG*8 + sub a5, a5, a4 /* a5 =3D SZREG*8 - distance*8 (left shift) */ + + /* Align src backwards to word boundary */ + sub a1, a1, a3 + + /* Calculate end address: dest + (count rounded down to words) */ + andi a6, a2, ~(SZREG-1) + add a6, t6, a6 /* a6 =3D loop end address for dest */ + + /* Load first aligned word from source */ + REG_L t0, 0(a1) + +.Lshifted_loop: + REG_L t1, SZREG(a1) /* Load next aligned word */ + srl t2, t0, a4 /* Shift right: low part from current word */ + mv t0, t1 /* Current =3D next for next iteration */ + addi a1, a1, SZREG + addi t6, t6, SZREG + sll t3, t0, a5 /* Shift left: high part from next word */ + or t2, t2, t3 /* Combine to form output word */ + REG_S t2, -SZREG(t6) /* Store to aligned dest */ + bltu t6, a6, .Lshifted_loop + + /* Restore src to correct unaligned position */ + add a1, a1, a3 + /* Calculate remaining byte count */ + andi a2, a2, SZREG-1 + /* Fall through to label 4 for remaining bytes */ =20 4: /* Handle trailing misalignment */ --=20 2.52.0