From nobody Sun Sep 14 16:30:56 2025 Received: from smtpout42.security-mail.net (smtpout42.security-mail.net [85.31.212.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 922A016EC13 for ; Mon, 22 Jul 2024 09:43:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=85.31.212.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721641435; cv=none; b=KFdAyEzN63OdKvGo0gQmsnFQD89M0gc5a8eu5ToU5aMJ42TgZnodUfDHqSouhXFWSsRE6d9eWCZ6Btd0LUyVGYRgZoEA1BV1OXF7gsPHCknypHLwZbkYIsE5lqkN/AoNg+YWh15C1phLBf65xXcmF3VE8GHSFwxzbvzENEMeIAQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721641435; c=relaxed/simple; bh=j+26YT+s2R1JHU4KrZw0Mbr2esOXJjuYIhfVdzOJL+E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=igezAqlc1Oa7T8P79TUxVtNMv606iIfmlHrsLB4Osaabv+62ZGX5hewaaaUsqufiGLqRY7aRFKzZ7bLh7UJBmfWvXPT4Dje8KvZOTFvf2TneBQTqhW/82BSgbr10gqSvkXbdmV9XKO7e1L8IGZLeH5bFSJyCE/0B1GW/ZDfJhMY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=kalrayinc.com; spf=pass smtp.mailfrom=kalrayinc.com; dkim=pass (1024-bit key) header.d=kalrayinc.com header.i=@kalrayinc.com header.b=JaS7HPB4; arc=none smtp.client-ip=85.31.212.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=kalrayinc.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=kalrayinc.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=kalrayinc.com header.i=@kalrayinc.com header.b="JaS7HPB4" Received: from localhost (localhost [127.0.0.1]) by fx302.security-mail.net (Postfix) with ESMTP id 1EAB080B7B4 for ; Mon, 22 Jul 2024 11:43:49 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kalrayinc.com; s=sec-sig-email; t=1721641429; bh=j+26YT+s2R1JHU4KrZw0Mbr2esOXJjuYIhfVdzOJL+E=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=JaS7HPB4df8cA2Wmkgpg32Ba99mmjroGEXdpXdyxQrwIBe4bC/SBKiZ4OvQiap5dI x9YfcTmWdBBPwak65JrsJz72UrlO30XTMtNuKV74yRCKuR+qPIkA04NLVSyR6iovy+ hoNqaGTaYZ/zIIMhrfIkuSJkOzWZEa4BMmF4WwG4= Received: from fx302 (localhost [127.0.0.1]) by fx302.security-mail.net (Postfix) with ESMTP id E2E9E80B826; Mon, 22 Jul 2024 11:43:48 +0200 (CEST) Received: from srvsmtp.lin.mbt.kalray.eu (unknown [217.181.231.53]) by fx302.security-mail.net (Postfix) with ESMTPS id 5483380B938; Mon, 22 Jul 2024 11:43:48 +0200 (CEST) Received: from junon.lan.kalrayinc.com (unknown [192.168.37.161]) by srvsmtp.lin.mbt.kalray.eu (Postfix) with ESMTPS id F32D940317; Mon, 22 Jul 2024 11:43:47 +0200 (CEST) X-Secumail-id: From: ysionneau@kalrayinc.com To: linux-kernel@vger.kernel.org Cc: Jonathan Borne , Julian Vetter , Yann Sionneau , Clement Leger , Jules Maselbas , Marius Gligor Subject: [RFC PATCH v3 29/37] kvx: Add some library functions Date: Mon, 22 Jul 2024 11:41:40 +0200 Message-ID: <20240722094226.21602-30-ysionneau@kalrayinc.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20240722094226.21602-1-ysionneau@kalrayinc.com> References: <20240722094226.21602-1-ysionneau@kalrayinc.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ALTERMIMEV2_out: done Content-Type: text/plain; charset="utf-8" From: Yann Sionneau Add some library functions for kvx, including: delay, memset, memcpy, strlen, clear_page, copy_page, raw_copy_from/to_user, asm_clear_user and libgcc functions. Co-developed-by: Clement Leger Signed-off-by: Clement Leger Co-developed-by: Jules Maselbas Signed-off-by: Jules Maselbas Co-developed-by: Julian Vetter Signed-off-by: Julian Vetter Co-developed-by: Marius Gligor Signed-off-by: Marius Gligor Signed-off-by: Yann Sionneau --- Notes: V1 -> V2: no changes V2 -> V3: - Add code to remove dependency on libgcc (lib/div.c) - simplify memset assembly code: remove special handling of memset '\0' - typos - use SYM_FUNC_{START/END} instead of ENTRY/ENDPROC --- arch/kvx/include/asm/string.h | 20 ++++ arch/kvx/lib/clear_page.S | 40 +++++++ arch/kvx/lib/copy_page.S | 90 ++++++++++++++ arch/kvx/lib/delay.c | 40 +++++++ arch/kvx/lib/div.c | 198 +++++++++++++++++++++++++++++++ arch/kvx/lib/libgcc.h | 25 ++++ arch/kvx/lib/memcpy.c | 72 ++++++++++++ arch/kvx/lib/memset.S | 216 ++++++++++++++++++++++++++++++++++ arch/kvx/lib/strlen.S | 122 +++++++++++++++++++ arch/kvx/lib/usercopy.S | 90 ++++++++++++++ 10 files changed, 913 insertions(+) create mode 100644 arch/kvx/include/asm/string.h create mode 100644 arch/kvx/lib/clear_page.S create mode 100644 arch/kvx/lib/copy_page.S create mode 100644 arch/kvx/lib/delay.c create mode 100644 arch/kvx/lib/div.c create mode 100644 arch/kvx/lib/libgcc.h create mode 100644 arch/kvx/lib/memcpy.c create mode 100644 arch/kvx/lib/memset.S create mode 100644 arch/kvx/lib/strlen.S create mode 100644 arch/kvx/lib/usercopy.S diff --git a/arch/kvx/include/asm/string.h b/arch/kvx/include/asm/string.h new file mode 100644 index 0000000000000..677c1393a5cdb --- /dev/null +++ b/arch/kvx/include/asm/string.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Clement Leger + * Jules Maselbas + */ + +#ifndef _ASM_KVX_STRING_H +#define _ASM_KVX_STRING_H + +#define __HAVE_ARCH_MEMSET +extern void *memset(void *s, int c, size_t n); + +#define __HAVE_ARCH_MEMCPY +extern void *memcpy(void *dest, const void *src, size_t n); + +#define __HAVE_ARCH_STRLEN +extern size_t strlen(const char *s); + +#endif /* _ASM_KVX_STRING_H */ diff --git a/arch/kvx/lib/clear_page.S b/arch/kvx/lib/clear_page.S new file mode 100644 index 0000000000000..8c97b2ae84bb3 --- /dev/null +++ b/arch/kvx/lib/clear_page.S @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Marius Gligor + * Clement Leger + */ + +#include +#include +#include + +#include +#include + +#define CLEAR_PAGE_LOOP_COUNT (PAGE_SIZE / 32) + +/* + * Clear page @dest. + * + * Parameters: + * r0 - dest page + */ +SYM_FUNC_START(clear_page) + make $r1 =3D CLEAR_PAGE_LOOP_COUNT + ;; + make $r4 =3D 0 + make $r5 =3D 0 + make $r6 =3D 0 + make $r7 =3D 0 + ;; + + loopdo $r1, clear_page_done + ;; + so 0[$r0] =3D $r4r5r6r7 + addd $r0 =3D $r0, 32 + ;; + clear_page_done: + ret + ;; +SYM_FUNC_END(clear_page) diff --git a/arch/kvx/lib/copy_page.S b/arch/kvx/lib/copy_page.S new file mode 100644 index 0000000000000..8bffc56ef9752 --- /dev/null +++ b/arch/kvx/lib/copy_page.S @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Clement Leger + */ + +#include +#include + +#include + +/* We have 8 load/store octuple (32 bytes) per hardware loop */ +#define COPY_SIZE_PER_LOOP (32 * 8) +#define COPY_PAGE_LOOP_COUNT (PAGE_SIZE / COPY_SIZE_PER_LOOP) + +/* + * Copy a page from src to dest (both are page aligned) + * In order to recover from smem latency, unroll the loop to trigger multi= ple + * onfly loads and avoid waiting too much for them to return. + * We use 8 * 32 load even though we could use more (up to 10 loads) to si= mplify + * the handling using a single hardware loop + * + * Parameters: + * r0 - dest + * r1 - src + */ +SYM_FUNC_START(copy_page) + make $r2 =3D COPY_PAGE_LOOP_COUNT + make $r3 =3D 0 + ;; + loopdo $r2, copy_page_done + ;; + /* + * Load 8 * 32 bytes using uncached access to avoid hitting + * the cache + */ + lo.xs $r32r33r34r35 =3D $r3[$r1] + /* Copy current copy index for store */ + copyd $r2 =3D $r3 + addd $r3 =3D $r3, 1 + ;; + lo.xs $r36r37r38r39 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + lo.xs $r40r41r42r43 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + lo.xs $r44r45r46r47 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + lo.xs $r48r49r50r51 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + lo.xs $r52r53r54r55 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + lo.xs $r56r57r58r59 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + lo.xs $r60r61r62r63 =3D $r3[$r1] + addd $r3 =3D $r3, 1 + ;; + /* And then store all of them */ + so.xs $r2[$r0] =3D $r32r33r34r35 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r36r37r38r39 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r40r41r42r43 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r44r45r46r47 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r48r49r50r51 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r52r53r54r55 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r56r57r58r59 + addd $r2 =3D $r2, 1 + ;; + so.xs $r2[$r0] =3D $r60r61r62r63 + ;; + copy_page_done: + ret + ;; +SYM_FUNC_END(copy_page) diff --git a/arch/kvx/lib/delay.c b/arch/kvx/lib/delay.c new file mode 100644 index 0000000000000..96e59114b47dd --- /dev/null +++ b/arch/kvx/lib/delay.c @@ -0,0 +1,40 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Clement Leger + */ + +#include +#include + +#include +#include + +void __delay(unsigned long loops) +{ + cycles_t target_cycle =3D get_cycles() + loops; + + while (get_cycles() < target_cycle) + ; +} +EXPORT_SYMBOL(__delay); + +inline void __const_udelay(unsigned long xloops) +{ + u64 loops =3D (u64)xloops * (u64)loops_per_jiffy * HZ; + + __delay(loops >> 32); +} +EXPORT_SYMBOL(__const_udelay); + +void __udelay(unsigned long usecs) +{ + __const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */ +} +EXPORT_SYMBOL(__udelay); + +void __ndelay(unsigned long nsecs) +{ + __const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */ +} +EXPORT_SYMBOL(__ndelay); diff --git a/arch/kvx/lib/div.c b/arch/kvx/lib/div.c new file mode 100644 index 0000000000000..1e107732e1a23 --- /dev/null +++ b/arch/kvx/lib/div.c @@ -0,0 +1,198 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Benoit Dinechin + */ +#include +#include "libgcc.h" + +static inline uint64x4_t uint64x2_divmod(uint64x2_t a, uint64x2_t b) +{ + float64x2_t double1 =3D 1.0 - (float64x2_t){}; + int64x2_t bbig =3D (int64x2_t)b < 0; + int64x2_t bin01 =3D (uint64x2_t)b <=3D 1; + int64x2_t special =3D bbig | bin01; + // uint64x2_t q =3D bbig ? a >=3D b : a; + uint64x2_t q =3D __builtin_kvx_selectdp(-(a >=3D b), a, bbig, ".nez"); + // uint64x2_t r =3D bbig ? a - (b&-q) : 0; + uint64x2_t r =3D __builtin_kvx_selectdp(a - (b & -q), 0 - (uint64x2_t){},= bbig, ".nez"); + float64x2_t doublea =3D __builtin_kvx_floatudp(a, 0, ".rn.s"); + float64x2_t doubleb =3D __builtin_kvx_floatudp(b, 0, ".rn.s"); + float floatb_0 =3D __builtin_kvx_fnarrowdw(doubleb[0], ".rn.s"); + float floatb_1 =3D __builtin_kvx_fnarrowdw(doubleb[1], ".rn.s"); + float floatrec_0 =3D __builtin_kvx_frecw(floatb_0, ".rn.s"); + float floatrec_1 =3D __builtin_kvx_frecw(floatb_1, ".rn.s"); + + if (__builtin_kvx_anydp(b, ".eqz")) + goto div0; + + float64x2_t doublerec =3D {__builtin_kvx_fwidenwd(floatrec_0, ".s"), + __builtin_kvx_fwidenwd(floatrec_1, ".s")}; + float64x2_t doubleq0 =3D __builtin_kvx_fmuldp(doublea, doublerec, ".rn.s"= ); + uint64x2_t q0 =3D __builtin_kvx_fixedudp(doubleq0, 0, ".rn.s"); + int64x2_t a1 =3D (int64x2_t)(a - q0 * b); + float64x2_t alpha =3D __builtin_kvx_ffmsdp(doubleb, doublerec, double1, "= .rn.s"); + float64x2_t beta =3D __builtin_kvx_ffmadp(alpha, doublerec, doublerec, ".= rn.s"); + float64x2_t doublea1 =3D __builtin_kvx_floatdp(a1, 0, ".rn.s"); + float64x2_t gamma =3D __builtin_kvx_fmuldp(beta, doublea1, ".rn.s"); + int64x2_t q1 =3D __builtin_kvx_fixeddp(gamma, 0, ".rn.s"); + int64x2_t rem =3D a1 - q1 * b; + uint64x2_t quo =3D q0 + q1; + uint64x2_t cond =3D (uint64x2_t)(rem >> 63); + + // q =3D !special ? quo + cond : q; + q =3D __builtin_kvx_selectdp(quo + cond, q, special, ".eqz"); + // r =3D !special ? rem + (b & cond) : r; + r =3D __builtin_kvx_selectdp(rem + (b & cond), r, special, ".eqz"); + return __builtin_kvx_cat256(q, r); + +div0: + __builtin_trap(); +} + +static inline uint32x4_t uint32x2_divmod(uint32x2_t a, uint32x2_t b) +{ + int i; + + uint64x2_t acc =3D __builtin_kvx_widenwdp(a, ".z"); + uint64x2_t src =3D __builtin_kvx_widenwdp(b, ".z") << (32 - 1); + uint64x2_t wb =3D __builtin_kvx_widenwdp(b, ".z"); + uint32x2_t q, r; + + if (__builtin_kvx_anywp(b, ".eqz")) + goto div0; + // As `src =3D=3D b << (32 -1)` adding src yields `src =3D=3D b << 32`. + src +=3D src & (wb > acc); + + for (i =3D 0; i < 32; i++) + acc =3D __builtin_kvx_stsudp(src, acc); + + q =3D __builtin_kvx_narrowdwp(acc, ""); + r =3D __builtin_kvx_narrowdwp(acc >> 32, ""); + return __builtin_kvx_cat128(q, r); +div0: + __builtin_trap(); +} + + +int32x2_t __divv2si3(int32x2_t a, int32x2_t b) +{ + uint32x2_t absa =3D __builtin_kvx_abswp(a, ""); + uint32x2_t absb =3D __builtin_kvx_abswp(b, ""); + uint32x4_t divmod =3D uint32x2_divmod(absa, absb); + int32x2_t result =3D __builtin_kvx_low64(divmod); + + return __builtin_kvx_selectwp(-result, result, a ^ b, ".ltz"); +} + + +uint64x2_t __udivv2di3(uint64x2_t a, uint64x2_t b) +{ + uint64x4_t divmod =3D uint64x2_divmod(a, b); + + return __builtin_kvx_low128(divmod); +} + +uint64x2_t __umodv2di3(uint64x2_t a, uint64x2_t b) +{ + uint64x4_t divmod =3D uint64x2_divmod(a, b); + + return __builtin_kvx_high128(divmod); +} + +int64x2_t __modv2di3(int64x2_t a, int64x2_t b) +{ + uint64x2_t absa =3D __builtin_kvx_absdp(a, ""); + uint64x2_t absb =3D __builtin_kvx_absdp(b, ""); + uint64x4_t divmod =3D uint64x2_divmod(absa, absb); + int64x2_t result =3D __builtin_kvx_high128(divmod); + + return __builtin_kvx_selectdp(-result, result, a, ".ltz"); +} + +uint64_t __udivdi3(uint64_t a, uint64_t b) +{ + uint64x2_t udivv2di3 =3D __udivv2di3(a - (uint64x2_t){}, b - (uint64x2_t)= {}); + + return (uint64_t)udivv2di3[1]; +} + +static inline uint64x2_t uint64_divmod(uint64_t a, uint64_t b) +{ + double double1 =3D 1.0; + int64_t bbig =3D (int64_t)b < 0; + int64_t bin01 =3D (uint64_t)b <=3D 1; + int64_t special =3D bbig | bin01; + // uint64_t q =3D bbig ? a >=3D b : a; + uint64_t q =3D __builtin_kvx_selectd(a >=3D b, a, bbig, ".dnez"); + // uint64_t r =3D bbig ? a - (b&-q) : 0; + uint64_t r =3D __builtin_kvx_selectd(a - (b & -q), 0, bbig, ".dnez"); + double doublea =3D __builtin_kvx_floatud(a, 0, ".rn.s"); + double doubleb =3D __builtin_kvx_floatud(b, 0, ".rn.s"); + float floatb =3D __builtin_kvx_fnarrowdw(doubleb, ".rn.s"); + float floatrec =3D __builtin_kvx_frecw(floatb, ".rn.s"); + + if (b =3D=3D 0) + goto div0; + + double doublerec =3D __builtin_kvx_fwidenwd(floatrec, ".s"); + double doubleq0 =3D __builtin_kvx_fmuld(doublea, doublerec, ".rn.s"); + uint64_t q0 =3D __builtin_kvx_fixedud(doubleq0, 0, ".rn.s"); + int64_t a1 =3D a - q0 * b; + double alpha =3D __builtin_kvx_ffmsd(doubleb, doublerec, double1, ".rn.s"= ); + double beta =3D __builtin_kvx_ffmad(alpha, doublerec, doublerec, ".rn.s"); + double doublea1 =3D __builtin_kvx_floatd(a1, 0, ".rn.s"); + double gamma =3D __builtin_kvx_fmuld(beta, doublea1, ".rn.s"); + int64_t q1 =3D __builtin_kvx_fixedd(gamma, 0, ".rn.s"); + int64_t rem =3D a1 - q1 * b; + uint64_t quo =3D q0 + q1; + uint64_t cond =3D rem >> 63; + + // q =3D !special ? quo + cond : q; + q =3D __builtin_kvx_selectd(quo + cond, q, special, ".deqz"); + // r =3D !special ? rem + (b & cond) : r; + r =3D __builtin_kvx_selectd(rem + (b & cond), r, special, ".deqz"); + + return (uint64x2_t){q, r}; + +div0: + __builtin_trap(); +} + + +int64_t __divdi3(int64_t a, int64_t b) +{ + uint64_t absa =3D __builtin_kvx_absd(a, ""); + uint64_t absb =3D __builtin_kvx_absd(b, ""); + uint64x2_t divmod =3D uint64_divmod(absa, absb); + + if ((a ^ b) < 0) + divmod[0] =3D -divmod[0]; + + return divmod[0]; +} + + +uint64_t __umoddi3(uint64_t a, uint64_t b) +{ + uint64x2_t umodv2di3 =3D __umodv2di3(a - (uint64x2_t){}, b - (uint64x2_t)= {}); + + return (uint64_t)umodv2di3[1]; +} + +int64_t __moddi3(int64_t a, int64_t b) +{ + int64x2_t modv2di3 =3D __modv2di3(a - (int64x2_t){}, b - (int64x2_t){}); + + return (int64_t)modv2di3[1]; +} + +int64x2_t __divv2di3(int64x2_t a, int64x2_t b) +{ + uint64x2_t absa =3D __builtin_kvx_absdp(a, ""); + uint64x2_t absb =3D __builtin_kvx_absdp(b, ""); + uint64x4_t divmod =3D uint64x2_divmod(absa, absb); + int64x2_t result =3D __builtin_kvx_low128(divmod); + + return __builtin_kvx_selectdp(-result, result, a ^ b, ".ltz"); +} diff --git a/arch/kvx/lib/libgcc.h b/arch/kvx/lib/libgcc.h new file mode 100644 index 0000000000000..cbbe33762ecc4 --- /dev/null +++ b/arch/kvx/lib/libgcc.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Benoit Dinechin + */ + +typedef uint32_t uint32x2_t __attribute((vector_size(2 * sizeof(uint32_t))= )); +typedef uint32_t uint32x4_t __attribute((vector_size(4 * sizeof(uint32_t))= )); +typedef int32_t int32x2_t __attribute((vector_size(2 * sizeof(int32_t)))); +typedef int64_t int64x2_t __attribute((vector_size(2 * sizeof(int64_t)))); +typedef uint64_t uint64x2_t __attribute((vector_size(2 * sizeof(uint64_t))= )); +typedef uint64_t uint64x4_t __attribute((vector_size(4 * sizeof(uint64_t))= )); + +typedef double float64_t; +typedef float64_t float64x2_t __attribute((vector_size(2 * sizeof(float64_= t)))); + +int32x2_t __divv2si3(int32x2_t a, int32x2_t b); +uint64x2_t __udivv2di3(uint64x2_t a, uint64x2_t b); +uint64x2_t __umodv2di3(uint64x2_t a, uint64x2_t b); +int64x2_t __modv2di3(int64x2_t a, int64x2_t b); +uint64_t __udivdi3(uint64_t a, uint64_t b); +int64_t __divdi3(int64_t a, int64_t b); +uint64_t __umoddi3(uint64_t a, uint64_t b); +int64_t __moddi3(int64_t a, int64_t b); +int64x2_t __divv2di3(int64x2_t a, int64x2_t b); diff --git a/arch/kvx/lib/memcpy.c b/arch/kvx/lib/memcpy.c new file mode 100644 index 0000000000000..17343537a4346 --- /dev/null +++ b/arch/kvx/lib/memcpy.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Clement Leger + * Yann Sionneau + */ + +#include +#include + +#include + +void *memcpy(void *dest, const void *src, size_t n) +{ + __uint128_t *tmp128_d =3D dest; + const __uint128_t *tmp128_s =3D src; + uint64_t *tmp64_d; + const uint64_t *tmp64_s; + uint32_t *tmp32_d; + const uint32_t *tmp32_s; + uint16_t *tmp16_d; + const uint16_t *tmp16_s; + uint8_t *tmp8_d; + const uint8_t *tmp8_s; + + while (n >=3D 16) { + *tmp128_d =3D *tmp128_s; + tmp128_d++; + tmp128_s++; + n -=3D 16; + } + + tmp64_d =3D (uint64_t *) tmp128_d; + tmp64_s =3D (uint64_t *) tmp128_s; + while (n >=3D 8) { + *tmp64_d =3D *tmp64_s; + tmp64_d++; + tmp64_s++; + n -=3D 8; + } + + tmp32_d =3D (uint32_t *) tmp64_d; + tmp32_s =3D (uint32_t *) tmp64_s; + while (n >=3D 4) { + *tmp32_d =3D *tmp32_s; + tmp32_d++; + tmp32_s++; + n -=3D 4; + } + + tmp16_d =3D (uint16_t *) tmp32_d; + tmp16_s =3D (uint16_t *) tmp32_s; + while (n >=3D 2) { + *tmp16_d =3D *tmp16_s; + tmp16_d++; + tmp16_s++; + n -=3D 2; + } + + tmp8_d =3D (uint8_t *) tmp16_d; + tmp8_s =3D (uint8_t *) tmp16_s; + while (n >=3D 1) { + *tmp8_d =3D *tmp8_s; + tmp8_d++; + tmp8_s++; + n--; + } + + return dest; +} +EXPORT_SYMBOL(memcpy); + diff --git a/arch/kvx/lib/memset.S b/arch/kvx/lib/memset.S new file mode 100644 index 0000000000000..bd2eec89ae1a3 --- /dev/null +++ b/arch/kvx/lib/memset.S @@ -0,0 +1,216 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Clement Leger + * Marius Gligor + * Jules Maselbas + */ + +#include + +#include + +#define REPLICATE_BYTE_MASK 0x0101010101010101 +#define MIN_SIZE_FOR_ALIGN 128 + +/* + * Optimized memset for kvx architecture + * + * In order to optimize memset on kvx, we can use various things: + * - conditional store which avoids branch penalty + * - store half/word/double/quad/octuple to store up to 32 bytes at a time + * - hardware loop for steady cases. + * + * First, we start by checking if the size is below a minimum size. If so,= we + * skip the alignment part. The kvx architecture supports misalignment and= the + * penalty for doing unaligned accesses is lower than trying to do realign= ing. + * So for small sizes, we don't even bother to realign. + * The sbmm8 instruction is used to replicate the pattern on all bytes of a + * register in one call. + * Once alignment has been reached, we can use the hardware loop in order = to + * optimize throughput. Care must be taken to align hardware loops on at l= east + * 8 bytes for better performances. + * Once the main loop has been done, we finish the copy by checking length= to do + * the necessary calls to store remaining bytes. + * + * Pseudo code: + * + * int memset(void *dest, char pattern, long length) + * { + * long dest_align =3D -((long) dest); + * long copy; + * long orig_dest =3D dest; + * + * uint64_t pattern =3D sbmm8(pattern, 0x0101010101010101); + * uint128_t pattern128 =3D pattern << 64 | pattern; + * uint256_t pattern128 =3D pattern128 << 128 | pattern128; + * + * // Keep only low bits + * dest_align &=3D 0x1F; + * length -=3D dest_align; + * + * // Byte align + * copy =3D align & (1 << 0); + * if (copy) + * *((u8 *) dest) =3D pattern; + * dest +=3D copy; + * // Half align + * copy =3D align & (1 << 1); + * if (copy) + * *((u16 *) dest) =3D pattern; + * dest +=3D copy; + * // Word align + * copy =3D align & (1 << 2); + * if (copy) + * *((u32 *) dest) =3D pattern; + * dest +=3D copy; + * // Double align + * copy =3D align & (1 << 3); + * if (copy) + * *((u64 *) dest) =3D pattern; + * dest +=3D copy; + * // Quad align + * copy =3D align & (1 << 4); + * if (copy) + * *((u128 *) dest) =3D pattern128; + * dest +=3D copy; + * + * // We are now aligned on 256 bits + * loop_octuple_count =3D size >> 5; + * for (i =3D 0; i < loop_octuple_count; i++) { + * *((u256 *) dest) =3D pattern256; + * dest +=3D 32; + * } + * + * if (length =3D=3D 0) + * return orig_dest; + * + * // Copy remaining part + * remain =3D length & (1 << 4); + * if (copy) + * *((u128 *) dest) =3D pattern128; + * dest +=3D remain; + * remain =3D length & (1 << 3); + * if (copy) + * *((u64 *) dest) =3D pattern; + * dest +=3D remain; + * remain =3D length & (1 << 2); + * if (copy) + * *((u32 *) dest) =3D pattern; + * dest +=3D remain; + * remain =3D length & (1 << 1); + * if (copy) + * *((u16 *) dest) =3D pattern; + * dest +=3D remain; + * remain =3D length & (1 << 0); + * if (copy) + * *((u8 *) dest) =3D pattern; + * dest +=3D remain; + * + * return orig_dest; + * } + */ + +.text +.align 16 +SYM_FUNC_START(memset): + /* Preserve return value */ + copyd $r3 =3D $r0 + /* Replicate the first pattern byte on all bytes */ + sbmm8 $r32 =3D $r1, REPLICATE_BYTE_MASK + /* Check if length < MIN_SIZE_FOR_ALIGN */ + compd.geu $r7 =3D $r2, MIN_SIZE_FOR_ALIGN + /* Invert address to compute size to copy to be aligned on 32 bytes */ + negd $r5 =3D $r0 + ;; + /* Copy second part of pattern for sq */ + copyd $r33 =3D $r32 + /* Compute the size that will be copied to align on 32 bytes boundary */ + andw $r5 =3D $r5, 0x1F + ;; + /* + * If size < MIN_SIZE_FOR_ALIGN bits, directly go to so, it will be done + * unaligned but that is still better that what we can do with sb + */ + cb.deqz $r7? .Laligned_32 + ;; + /* If we are already aligned on 32 bytes, jump to main "so" loop */ + cb.deqz $r5? .Laligned_32 + /* Remove unaligned part from length */ + sbfd $r2 =3D $r5, $r2 + /* Check if we need to copy 1 byte */ + andw $r4 =3D $r5, (1 << 0) + ;; + /* If we are not aligned, store byte */ + sb.dnez $r4? [$r0] =3D $r32 + addd $r0 =3D $r0, $r4 + /* Check if we need to copy 2 bytes */ + andw $r4 =3D $r5, (1 << 1) + ;; + sh.dnez $r4? [$r0] =3D $r32 + addd $r0 =3D $r0, $r4 + /* Check if we need to copy 4 bytes */ + andw $r4 =3D $r5, (1 << 2) + ;; + sw.dnez $r4? [$r0] =3D $r32 + addd $r0 =3D $r0, $r4 + /* Check if we need to copy 8 bytes */ + andw $r4 =3D $r5, (1 << 3) + ;; + sd.dnez $r4? [$r0] =3D $r32 + addd $r0 =3D $r0, $r4 + /* Check if we need to copy 16 bytes */ + andw $r4 =3D $r5, (1 << 4) + ;; + sq.dnez $r4? [$r0] =3D $r32r33 + addd $r0 =3D $r0, $r4 + ;; +.Laligned_32: + /* Prepare amount of data for 32 bytes store */ + srld $r10 =3D $r2, 5 + ;; + copyq $r34r35 =3D $r32, $r33 + /* Remaining bytes for 16 bytes store */ + andw $r8 =3D $r2, (1 << 4) + make $r11 =3D 32 + /* Check if there are enough data for 32 bytes store */ + cb.deqz $r10? .Laligned_32_done + ;; + loopdo $r10, .Laligned_32_done + ;; + so 0[$r0] =3D $r32r33r34r35 + addd $r0 =3D $r0, $r11 + ;; +.Laligned_32_done: + /* + * Now that we have handled every aligned bytes using 'so', we can + * handled the remainder of length using store by decrementing size + * We also exploit the fact we are aligned to simply check remaining + * size */ + sq.dnez $r8? [$r0] =3D $r32r33 + addd $r0 =3D $r0, $r8 + /* Remaining bytes for 8 bytes store */ + andw $r8 =3D $r2, (1 << 3) + cb.deqz $r2? .Lmemset_done + ;; + sd.dnez $r8? [$r0] =3D $r32 + addd $r0 =3D $r0, $r8 + /* Remaining bytes for 4 bytes store */ + andw $r8 =3D $r2, (1 << 2) + ;; + sw.dnez $r8? [$r0] =3D $r32 + addd $r0 =3D $r0, $r8 + /* Remaining bytes for 2 bytes store */ + andw $r8 =3D $r2, (1 << 1) + ;; + sh.dnez $r8? [$r0] =3D $r32 + addd $r0 =3D $r0, $r8 + ;; + sb.odd $r2? [$r0] =3D $r32 + ;; +.Lmemset_done: + /* Restore original value */ + copyd $r0 =3D $r3 + ret + ;; +SYM_FUNC_END(memset) diff --git a/arch/kvx/lib/strlen.S b/arch/kvx/lib/strlen.S new file mode 100644 index 0000000000000..c4bffb56cfa64 --- /dev/null +++ b/arch/kvx/lib/strlen.S @@ -0,0 +1,122 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Jules Maselbas + */ +#include +#include + +/* + * kvx optimized strlen + * + * This implementation of strlen only does aligned memory accesses. + * Since we don't know the total length the idea is to do double word + * load and stop on the first null byte found. As it's always safe to + * read more up to a lower 8-bytes boundary. + * + * This implementation of strlen uses a trick to detect if a double + * word contains a null byte [1]: + * + * > #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL) + * > The sub-expression (v - 0x01010101UL), evaluates to a high bit set + * > in any byte whenever the corresponding byte in v is zero or greater + * > than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high + * > bits set in bytes where the byte of v doesn't have its high bit set + * > (so the byte was less than 0x80). Finally, by ANDing these two sub- + * > expressions the result is the high bits set where the bytes in v + * > were zero, since the high bits set due to a value greater than 0x80 + * > in the first sub-expression are masked off by the second. + * + * [1] http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord + * + * A second trick is used to get the exact number of characters before + * the first null byte in a double word: + * + * clz(sbmmt(zero, 0x0102040810204080)) + * + * This trick uses the haszero result which maps null byte to 0x80 and + * others value to 0x00. The idea is to count the number of consecutive + * null byte in the double word (counting from less significant byte + * to most significant byte). To do so, using the bit matrix transpose + * will "pack" all high bit (0x80) to the most significant byte (MSB). + * It is not possible to count the trailing zeros in this MSB, however + * if a byte swap is done before the bit matrix transpose we still have + * all the information in the MSB but now we can count the leading zeros. + * The instruction sbmmt with the matrix 0x0102040810204080 does exactly + * what we need a byte swap followed by a bit transpose. + * + * A last trick is used to handle the first double word misalignment. + * This is done by masking off the N lower bytes (excess read) with N + * between 0 and 7. The mask is applied on haszero results and will + * force the N lower bytes to be considered not null. + * + * This is a C implementation of the algorithm described above: + * + * size_t strlen(char *s) { + * uint64_t *p =3D (uint64_t *)((uintptr_t)s) & ~0x7; + * uint64_t rem =3D ((uintptr_t)s) % 8; + * uint64_t low =3D -0x0101010101010101; + * uint64_t high =3D 0x8080808080808080; + * uint64_t dword, zero; + * uint64_t msk, len; + * + * dword =3D *p++; + * zero =3D (dword + low) & ~dword & high; + * msk =3D 0xffffffffffffffff << (rem * 8); + * zero &=3D msk; + * + * while (!zero) { + * dword =3D *p++; + * zero =3D (dword + low) & ~dword & high; + * } + * + * zero =3D __builtin_kvx_sbmmt8(zero, 0x0102040810204080); + * len =3D ((void *)p - (void *)s) - 8; + * len +=3D __builtin_kvx_clzd(zero); + * + * return len; + * } + */ + +.text +.align 16 +SYM_FUNC_START(strlen) + andd $r1 =3D $r0, ~0x7 + andd $r2 =3D $r0, 0x7 + make $r10 =3D -0x0101010101010101 + make $r11 =3D 0x8080808080808080 + ;; + ld $r4 =3D 0[$r1] + sllw $r2 =3D $r2, 3 + make $r3 =3D 0xffffffffffffffff + ;; + slld $r2 =3D $r3, $r2 + addd $r5 =3D $r4, $r10 + andnd $r6 =3D $r4, $r11 + ;; + andd $r6 =3D $r6, $r2 + make $r3 =3D 0 + ;; +.loop: + andd $r4 =3D $r5, $r6 + addd $r1 =3D $r1, 0x8 + ;; + cb.dnez $r4? .end + ld.deqz $r4? $r4 =3D [$r1] + ;; + addd $r5 =3D $r4, $r10 + andnd $r6 =3D $r4, $r11 + goto .loop + ;; +.end: + addd $r1 =3D $r1, -0x8 + sbmmt8 $r4 =3D $r4, 0x0102040810204080 + ;; + clzd $r4 =3D $r4 + sbfd $r1 =3D $r0, $r1 + ;; + addd $r0 =3D $r4, $r1 + ret + ;; +SYM_FUNC_END(strlen) +EXPORT_SYMBOL(strlen) diff --git a/arch/kvx/lib/usercopy.S b/arch/kvx/lib/usercopy.S new file mode 100644 index 0000000000000..950df1e88479d --- /dev/null +++ b/arch/kvx/lib/usercopy.S @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2017-2023 Kalray Inc. + * Author(s): Clement Leger + */ +#include + +/** + * Copy from/to a user buffer + * r0 =3D to buffer + * r1 =3D from buffer + * r2 =3D size to copy + * This function can trapped when hitting a non-mapped page. + * It will trigger a trap NOMAPPING and the trap handler will interpret + * it and check if instruction pointer is inside __ex_table. + * The next step are described later ! + */ +.text +SYM_FUNC_START(raw_copy_from_user) +SYM_FUNC_START(raw_copy_to_user) + /** + * naive implementation byte per byte + */ + make $r33 =3D 0x0; + /* If size =3D=3D 0, exit directly */ + cb.deqz $r2? copy_exit + ;; + loopdo $r2, copy_exit + ;; +0: lbz $r34 =3D $r33[$r1] + ;; +1: sb $r33[$r0] =3D $r34 + addd $r33 =3D $r33, 1 /* Ptr increment */ + addd $r2 =3D $r2, -1 /* Remaining bytes to copy */ + ;; + copy_exit: + copyd $r0 =3D $r2 + ret + ;; +SYM_FUNC_END(raw_copy_to_user) +SYM_FUNC_END(raw_copy_from_user) + +/** + * Exception table + * each entry correspond to the following: + * .dword trapping_addr, restore_addr + * + * On trap, the handler will try to locate if $spc is matching a + * trapping address in the exception table. If so, the restore addr + * will be put in the return address of the trap handler, allowing + * to properly finish the copy and return only the bytes copied/cleared + */ +.pushsection __ex_table,"a" +.balign 8 +.dword 0b, copy_exit +.dword 1b, copy_exit +.popsection + +/** + * Clear a user buffer + * r0 =3D buffer to clear + * r1 =3D size to clear + */ +.text +SYM_FUNC_START(asm_clear_user) + /** + * naive implementation byte per byte + */ + make $r33 =3D 0x0; + make $r34 =3D 0x0; + /* If size =3D=3D 0, exit directly */ + cb.deqz $r1? clear_exit + ;; + loopdo $r1, clear_exit + ;; +40: sb $r33[$r0] =3D $r34 + addd $r33 =3D $r33, 1 /* Ptr increment */ + addd $r1 =3D $r1, -1 /* Remaining bytes to copy */ + ;; + clear_exit: + copyd $r0 =3D $r1 + ret + ;; +SYM_FUNC_END(asm_clear_user) + +.pushsection __ex_table,"a" +.balign 8 +.dword 40b, clear_exit +.popsection + --=20 2.45.2