From nobody Thu May 7 17:50:04 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B89D5C433EF for ; Sun, 22 May 2022 20:48:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351223AbiEVUsX (ORCPT ); Sun, 22 May 2022 16:48:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36042 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343506AbiEVUsT (ORCPT ); Sun, 22 May 2022 16:48:19 -0400 Received: from angie.orcam.me.uk (angie.orcam.me.uk [78.133.224.34]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E63F22A268; Sun, 22 May 2022 13:48:17 -0700 (PDT) Received: by angie.orcam.me.uk (Postfix, from userid 500) id ACFBB92009C; Sun, 22 May 2022 22:48:14 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by angie.orcam.me.uk (Postfix) with ESMTP id A5EEE92009B; Sun, 22 May 2022 21:48:14 +0100 (BST) Date: Sun, 22 May 2022 21:48:14 +0100 (BST) From: "Maciej W. Rozycki" To: Thomas Bogendoerfer cc: Paul Cercueil , Nathan Chancellor , Tiezhu Yang , linux-mips@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Recent commit 198688edbf77 ("MIPS: Fix inline asm input/output type=20 mismatch in checksum.h used with Clang") introduced a code size and=20 performance regression with 64-bit code emitted for `csum_tcpudp_nofold'=20 by GCC, caused by a redundant truncation operation produced due to a=20 data type change made to the variable associated with the inline=20 assembly's output operand. The intent previously expressed here with operands and constraints for=20 optimal code was to have the output operand share a register with one=20 inputs, both of a different integer type each. This is perfectly valid=20 with the MIPS psABI where a register can hold integer data of different=20 types and the assembly code used here makes data stored in the output=20 register match the data type used with the output operand, however it=20 has turned out impossible to express this arrangement in source code=20 such as to satisfy LLVM, apparently due to the compiler's internal=20 limitations. There is nothing peculiar about the inline assembly `csum_tcpudp_nofold'=20 includes however, though it does choose assembly instructions carefully. Rewrite this piece of assembly in plain C then, using corresponding C=20 language operations, making GCC produce the same assembly instructions,=20 possibly shuffled, in the general case and sometimes actually fewer of=20 them where an input is constant, because the compiler does not have to=20 reload it to a register (operand constraints could be adjusted for that,=20 but the plain C approach is cleaner anyway). Example code size changes are as follows, for a 32-bit configuration: text data bss total filename 5920480 1347236 126592 7394308 vmlinux-old 5920480 1347236 126592 7394308 vmlinux-now 5919728 1347236 126592 7393556 vmlinux-c and for a 64-bit configuration: text data bss total filename 6024112 1790828 225728 8040668 vmlinux-old 6024128 1790828 225728 8040684 vmlinux-now 6023760 1790828 225728 8040316 vmlinux-c respectively, where "old" is with the commit referred reverted, "now" is=20 with no change, and "c" is with this change applied. Signed-off-by: Maciej W. Rozycki --- Hi, I have visually inspected code produced and verified this change to boot=20 with TCP networking performing just fine, both with a 32-bit and a 64-bit=20 configuration. Sadly with the little endianness only, because in the=20 course of this verification I have discovered the core card of my Malta=20 board bit the dust a few days ago, apparently in a permanent manner, and I=20 have no other big-endian MIPS system available here to try. The only difference between the two endiannesses is the left-shift=20 operation on (proto + len) however, which doesn't happen for big-endian=20 configurations, so the little endianness should in principle provide=20 enough coverage. Also I'm leaving it to LLVM folks to verify, however this is plain C, so=20 it is expected to just work. Please apply. Maciej --- arch/mips/include/asm/checksum.h | 71 ++++++++++++++++++----------------= ----- 1 file changed, 34 insertions(+), 37 deletions(-) linux-mips-csum-tcpudp-nofold-proper-fix.diff Index: linux-macro/arch/mips/include/asm/checksum.h =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- linux-macro.orig/arch/mips/include/asm/checksum.h +++ linux-macro/arch/mips/include/asm/checksum.h @@ -128,48 +128,45 @@ static inline __sum16 ip_fast_csum(const =20 static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len, __u8 proto, - __wsum sum) + __wsum isum) { - unsigned long tmp =3D (__force unsigned long)sum; + const unsigned int sh32 =3D IS_ENABLED(CONFIG_64BIT) ? 32 : 0; + unsigned long sum =3D (__force unsigned long)daddr; + unsigned long tmp; + __u32 osum; =20 - __asm__( - " .set push # csum_tcpudp_nofold\n" - " .set noat \n" -#ifdef CONFIG_32BIT - " addu %0, %2 \n" - " sltu $1, %0, %2 \n" - " addu %0, $1 \n" + tmp =3D (__force unsigned long)saddr; + sum +=3D tmp; =20 - " addu %0, %3 \n" - " sltu $1, %0, %3 \n" - " addu %0, $1 \n" + if (IS_ENABLED(CONFIG_32BIT)) + sum +=3D sum < tmp; =20 - " addu %0, %4 \n" - " sltu $1, %0, %4 \n" - " addu %0, $1 \n" -#endif -#ifdef CONFIG_64BIT - " daddu %0, %2 \n" - " daddu %0, %3 \n" - " daddu %0, %4 \n" - " dsll32 $1, %0, 0 \n" - " daddu %0, $1 \n" - " sltu $1, %0, $1 \n" - " dsra32 %0, %0, 0 \n" - " addu %0, $1 \n" -#endif - " .set pop" - : "=3Dr" (tmp) - : "0" ((__force unsigned long)daddr), - "r" ((__force unsigned long)saddr), -#ifdef __MIPSEL__ - "r" ((proto + len) << 8), -#else - "r" (proto + len), -#endif - "r" ((__force unsigned long)sum)); + /* + * We know PROTO + LEN has the sign bit clear, so cast to a signed + * type to avoid an extraneous zero-extension where TMP is 64-bit. + */ + tmp =3D (__s32)(proto + len); + tmp <<=3D IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? 8 : 0; + sum +=3D tmp; + if (IS_ENABLED(CONFIG_32BIT)) + sum +=3D sum < tmp; =20 - return (__force __wsum)tmp; + tmp =3D (__force unsigned long)isum; + sum +=3D tmp; + + if (IS_ENABLED(CONFIG_32BIT)) { + sum +=3D sum < tmp; + osum =3D sum; + } else if (IS_ENABLED(CONFIG_64BIT)) { + tmp =3D sum << sh32; + sum +=3D tmp; + osum =3D sum < tmp; + osum +=3D sum >> sh32; + } else { + BUILD_BUG(); + } + + return (__force __wsum)osum; } #define csum_tcpudp_nofold csum_tcpudp_nofold