From nobody Thu Apr 2 22:32:15 2026 Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BE8AD3E4C96 for ; Wed, 11 Mar 2026 17:31:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773250308; cv=none; b=WJz+kFsX/6k3M/lT6Mrxuo6al+TECyEqGKeA0qPDR5vPLW88x+/Jv4XsAA4nprexCPBdV/tkPrUvaS2U+Cu9Jcts+Z5ZRmbv/2fo2ctPaBALpib1Rjltt0HywLnlk6vRYiUkUp2eN6o/HpUyDxFU/f6jmq5nk6kC1nXDza2hzlg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773250308; c=relaxed/simple; bh=OW8Nqx0snFD4B3rQCbDnHKXOf5eRJBxhdZAckqZjRvU=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=acgWeFAbtbsTmBcVYR0upn0vHbsBwAeYkfJx5V99lDXVm26XK30TJe56iDqw/QmmU57/Uaw0VzgvRSpH1k7m/XWeni2PQ7dhVPlxtM91bLFFebgLticcHI4RHzwYzb2KqlSiNs6XleMdKsH8VSkAsQXD/5NFh1WB6hf4mBYmq7M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HjA8f1RC; arc=none smtp.client-ip=209.85.128.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HjA8f1RC" Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-48541edecf9so1102425e9.1 for ; Wed, 11 Mar 2026 10:31:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773250303; x=1773855103; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Cq/zfyOGI1UhA1JJnslrSZjKnQT3mMB2hM6EdxiCDT8=; b=HjA8f1RCPqHCbbD2OA7kh6TL5wdmIugEtbCYGFoTvL2Vo6rDQjHGrTGkgb/PPvw8EO /5DGOnZxbjzXGZOYxMZTEy4619UOmIcpKLau0kIRe+cuE1mTga0fFLO43VeyTIbrYCpv 5DOx6yCOw9ORTXVxLjZLsoZnTSiYZTVkTJVPQTIkkFPbrKkL9yUVZoY8LbnGSRYaJUxg xK7DucB7vmju/vDc0I5jnRwF3ae3A2e92zvvT/oCYX7USwWf8Ha8WzKt/Iu6ef7HukE4 hN9o2WgQMVObcjnTZepuYF3/odhGmYEBqfmCwvxl9RB45Y8ZY82XtGZLOgpIc0ncN+cF R+rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773250303; x=1773855103; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Cq/zfyOGI1UhA1JJnslrSZjKnQT3mMB2hM6EdxiCDT8=; b=SG0BWiJ1s0RnFi1vfcSB2Z88wM58rsjzJBz/2EmRNz+lbkhBntFKUCptAjj5m1YHTj rOoOe9SUyUNP6pe5AHgilhkADX7ewKcA5ZZTp4iAWvhvGx3DwFUJKkddhlgsgP2GCqa0 rOtRxxz1Vb4PKPtO6wdXs/PYyRFuQEwt8tkoqUuUV+tlTsTsz5f4laF+auDCaIKppltm ERQZqwkKUasmhHuv0L3Tm4lws7Y9eYjiJfSt2VM2W2+dBMm9Uv4sy92Kg5OpN1J/R77x AXsZd9+3Bv+x8iHgwRTk98n9LeoN1kzEACS0kOBnLAmlGOu9Qk41WcFlUDCx5qfUB3HL OYxw== X-Forwarded-Encrypted: i=1; AJvYcCUOt3mxx7YX+lYnK60ZFPx90BYF2fJWFkQWEe+zA2YJ1s+SWvLAmpjA7qngw+/kDIVvhsvwKbmi3ObSKWo=@vger.kernel.org X-Gm-Message-State: AOJu0Yw2yeM6Mx+Sp4laF6mTlRWeqwwWqaB3GRtxkE4HWITufGbHI5y8 iWVjWmeqIqoFnMRGznl8GNEeoDd0akX2Va1A0FVY5nWclZ+8dUJIjKzv X-Gm-Gg: ATEYQzwJDAwbfXoRszqr5qzdektiNlMgl8dclZgdPPnJos9r/wTrDLZdrP/qY/5KWqe hh9kTV3h/W/h/up41LPgJXl3fB72TPGTea4dIhhBVMx6c0CDLL2qJowMV3nXr9QXRtrNIT7D8Ij 4BXQQ8Qo1RvwHlljwpkKoUUOzU7ZHm/FLfd57LtwVZjv6fCO7tYFXwOcvao9zwZK9QrdxK6wf+9 5bp+nEic8CzLA/4SBLOGHJOYpcGFa/JBNwsWlKqZ9+amXYtvBuhnp0sFiiHmQdO5OuoFIsKYQM9 Htn3OaIb3lw1iopJRPlQeDWxeDxLNMyWc0dshLf/Tru30vGm7sOpDpkV9qpbGMX0Tuw7FQse2ml CtO/rZZWcN4pIRBzfzaixj9T7JZ3yZMfLwYemWKqq3K6BURyxJTEJrQ/Pqwu+zeJYS9Pu1Mrx2l wtyolslrx7/rVZxSSny0U/ZFyX1+/p4vIcVClOS6tTnGf6+Oe3MYlUuGKZ9m/UwpNb1c14bpQy X-Received: by 2002:a05:600c:19cf:b0:485:3c2d:d02b with SMTP id 5b1f17b1804b1-4854b0fe9e4mr59456535e9.22.1773250302532; Wed, 11 Mar 2026 10:31:42 -0700 (PDT) Received: from turbo.teknoraver.net (net-5-95-156-124.cust.vodafonedsl.it. [5.95.156.124]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4854b0c3a10sm22829495e9.12.2026.03.11.10.31.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Mar 2026 10:31:42 -0700 (PDT) From: Matteo Croce X-Google-Original-From: Matteo Croce To: Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexandre Ghiti Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH v2] riscv: memcpy: fast copy for unaligned buffers Date: Wed, 11 Mar 2026 18:31:38 +0100 Message-ID: <20260311173138.1820-1-teknoraver@meta.com> X-Mailer: git-send-email 2.53.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Matteo Croce The RISC-V memcpy() does an 8 byte wide copy when the two buffers have the same alignment, and fallbacks to a single byte copy otherwise. Implement a unaligned aware 8 byte copy when buffers are unaligned which copies 8 bytes at time by doing proper shifting. Synthetic benchmarks[1] show that the aligned code path is unaffected, while the unaligned one gets a ~2.3x boost: Before: memcpy: aligned copy of 400 MBytes in 429 msecs (931 MB/s) memcpy: unaligned copy of 400 MBytes in 1202 msecs (332 MB/s) After: memcpy: aligned copy of 400 MBytes in 428 msecs (933 MB/s) memcpy: unaligned copy of 400 MBytes in 519 msecs (770 MB/s) Network RX benchmarks on a Milk-V Megrez (ESWIN EIC7700X) with a 1 Gbps NIC (stmmac driver) confirm the improvement in a real-world scenario. UDP RX flood with varying frame sizes: Frame size stock memcpy optimized memcpy Improvement ---------- ------------ ---------------- ----------- 64 bytes 242.6 Kpps 246.9 Kpps +1.8% 128 bytes 225.3 Kpps 243.0 Kpps +7.9% 256 bytes 200.8 Kpps 227.8 Kpps +13.4% 512 bytes 165.4 Kpps 203.6 Kpps +23.1% Throughput at 512-byte frames improved from 672 Mbps to 827 Mbps. The improvement scales with frame size as larger frames copy more bytes per packet. Larger frame sizes were not tested as they would saturate the 1 Gbps link. [1] https://lore.kernel.org/lkml/20260301011209.4160-1-teknoraver@meta.com/ Signed-off-by: Matteo Croce --- v2: add network benchmarks and link to synthetic benchmark arch/riscv/lib/memcpy.S | 84 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 79 insertions(+), 5 deletions(-) diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S index 44e009ec5fef..293f8a348cfd 100644 --- a/arch/riscv/lib/memcpy.S +++ b/arch/riscv/lib/memcpy.S @@ -10,13 +10,14 @@ SYM_FUNC_START(__memcpy) move t6, a0 /* Preserve return value */ =20 - /* Defer to byte-oriented copy for small sizes */ - sltiu a3, a2, 128 - bnez a3, 4f - /* Use word-oriented copy only if low-order bits match */ + /* Check alignment first */ andi a3, t6, SZREG-1 andi a4, a1, SZREG-1 - bne a3, a4, 4f + bne a3, a4, .Lshifted_copy + + /* Aligned path: defer to byte-oriented copy for small sizes */ + sltiu a5, a2, 128 + bnez a5, 4f =20 beqz a3, 2f /* Skip if already aligned */ /* @@ -76,6 +77,79 @@ SYM_FUNC_START(__memcpy) addi t6, t6, 16*SZREG bltu a1, a3, 3b andi a2, a2, (16*SZREG)-1 /* Update count */ + j 4f /* Skip shifted copy section */ + +.Lshifted_copy: + /* + * Source and dest have different alignments. + * a3 =3D dest & (SZREG-1), a4 =3D src & (SZREG-1) + * Align destination first, then use shifted word copy. + */ + + /* For small sizes, just use byte copy */ + sltiu a5, a2, 16 + bnez a5, 4f + + /* If dest is already aligned, skip to shifted loop setup */ + beqz a3, .Ldest_aligned + + /* Calculate bytes needed to align dest: SZREG - a3 */ + neg a5, a3 + addi a5, a5, SZREG + sub a2, a2, a5 /* Update count */ + +.Lalign_dest_loop: + lb a4, 0(a1) + addi a1, a1, 1 + sb a4, 0(t6) + addi t6, t6, 1 + addi a5, a5, -1 + bnez a5, .Lalign_dest_loop + +.Ldest_aligned: + /* + * Dest is now aligned. Check if we have enough bytes + * remaining for word-oriented copy. + */ + sltiu a3, a2, SZREG + bnez a3, 4f + + /* + * Calculate shift amounts based on source alignment (distance). + * distance =3D src & (SZREG-1), guaranteed non-zero since we only + * reach here when src and dest had different alignments. + */ + andi a3, a1, SZREG-1 /* a3 =3D distance */ + slli a4, a3, 3 /* a4 =3D distance * 8 (right shift amount) */ + li a5, SZREG*8 + sub a5, a5, a4 /* a5 =3D SZREG*8 - distance*8 (left shift) */ + + /* Align src backwards to word boundary */ + sub a1, a1, a3 + + /* Calculate end address: dest + (count rounded down to words) */ + andi a6, a2, ~(SZREG-1) + add a6, t6, a6 /* a6 =3D loop end address for dest */ + + /* Load first aligned word from source */ + REG_L t0, 0(a1) + +.Lshifted_loop: + REG_L t1, SZREG(a1) /* Load next aligned word */ + srl t2, t0, a4 /* Shift right: low part from current word */ + mv t0, t1 /* Current =3D next for next iteration */ + addi a1, a1, SZREG + addi t6, t6, SZREG + sll t3, t0, a5 /* Shift left: high part from next word */ + or t2, t2, t3 /* Combine to form output word */ + REG_S t2, -SZREG(t6) /* Store to aligned dest */ + bltu t6, a6, .Lshifted_loop + + /* Restore src to correct unaligned position */ + add a1, a1, a3 + /* Calculate remaining byte count */ + andi a2, a2, SZREG-1 + /* Fall through to label 4 for remaining bytes */ =20 4: /* Handle trailing misalignment */ --=20 2.53.0