From nobody Thu Oct 2 22:39:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6407286402; Tue, 9 Sep 2025 23:16:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459787; cv=none; b=bJ8Cdb2C3VAHlYZbx4zlraj8aEg1Hsk6ObXxXsfVyGqwHzFOcpMQ/JneABuahkE1TLgZd+CJjhknwIEqAEyW451CWWkUFXyPOGdCt2XmKIk6LPhEhIyLpzDX8KG7KHx27PK3L9Y6btcNHvvJJXSbSy+C5JHwAXXBtjBAh6TfRbc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459787; c=relaxed/simple; bh=zQSVcYiA+Ajji2537vKvD29dULgH2d+aQQyTZVT/6Q4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=VieXtcA5H0D+QOhLR/LLFCOsSl5D5TY/PTo/x14cIOU58KOKx/lYhnFcpS8prFaKwImKHVU5wFndR7r59XLoZaq0s/NvbViandF5XIBdvg5sTnUCVlzFbgtX0BXmLEUyikQPMrS/VrUBdz60Z+l/FQHS/vzsE6K1l2TUg2NC+dw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UT+Muug7; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UT+Muug7" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 29DF8C4CEFA; Tue, 9 Sep 2025 23:16:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757459786; bh=zQSVcYiA+Ajji2537vKvD29dULgH2d+aQQyTZVT/6Q4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=UT+Muug7lDVB7naUGgvT0NvJeEq1R9NfqvsAzefJXO1KkrecNdp0Q5wrRJb2fQ85X WiTfnTrAU5A+amCXGV7PZUkzuHRgwwOJsrzVoBa76IkeExeLA2uSEat55HWElND8nM zR85m/UjFOVz0kYcglS7HeOs3QyTdVMlgLDLxjoz+QmYjgl/EOELIcYhNwK2xrmvG8 G3heGyid7vQIF5oHaOBk6tpwrw6BLxVSM5xuZo6h5uS6DHG9Im1WFhhndaeouzmALM kDUACUuLZZHhCoE1iMVg9+B6mkFB/AvqsI4St/+pS6lgbE1jEWgmqLwd6b9rnYW42U R7fqqE0OTjS1w== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , x86@kernel.org, Sami Tolvanen , Mikulas Patocka , linux-arm-kernel@lists.infradead.org, Eric Biggers Subject: [PATCH 1/6] lib/crypto: sha256: Add support for 2-way interleaved hashing Date: Tue, 9 Sep 2025 16:13:21 -0700 Message-ID: <20250909231326.1074672-2-ebiggers@kernel.org> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250909231326.1074672-1-ebiggers@kernel.org> References: <20250909231326.1074672-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Many arm64 and x86_64 CPUs can compute two SHA-256 hashes in nearly the same speed as one, if the instructions are interleaved. This is because SHA-256 is serialized block-by-block, and two interleaved hashes take much better advantage of the CPU's instruction-level parallelism. Meanwhile, a very common use case for SHA-256 hashing in the Linux kernel is dm-verity and fs-verity. Both use a Merkle tree that has a fixed block size, usually 4096 bytes with an empty or 32-byte salt prepended. Usually, many blocks need to be hashed at a time. This is an ideal scenario for 2-way interleaved hashing. To enable this optimization, add a new function sha256_finup_2x() to the SHA-256 library API. It computes the hash of two equal-length messages, starting from a common initial context. For now it always falls back to sequential processing. Later patches will wire up arm64 and x86_64 optimized implementations. Note that the interleaving factor could in principle be higher than 2x. However, that runs into many practical difficulties and CPU throughput limitations. Thus, both the implementations I'm adding are 2x. In the interest of using the simplest solution, the API matches that. Signed-off-by: Eric Biggers --- include/crypto/sha2.h | 24 ++++++++++++++++ lib/crypto/sha256.c | 66 ++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 86 insertions(+), 4 deletions(-) diff --git a/include/crypto/sha2.h b/include/crypto/sha2.h index 15e461e568cca..5019c6f1eb3ff 100644 --- a/include/crypto/sha2.h +++ b/include/crypto/sha2.h @@ -373,10 +373,34 @@ void sha256_final(struct sha256_ctx *ctx, u8 out[SHA2= 56_DIGEST_SIZE]); * * Context: Any context. */ void sha256(const u8 *data, size_t len, u8 out[SHA256_DIGEST_SIZE]); =20 +/** + * sha256_finup_2x() - Compute two SHA-256 digests from a common initial + * context. On some CPUs, this is faster than sequentially + * computing each digest. + * @ctx: an optional initial context, which may have already processed dat= a. If + * NULL, a default initial context is used (equivalent to sha256_init()). + * @data1: data for the first message + * @data2: data for the second message + * @len: the length of each of @data1 and @data2, in bytes + * @out1: (output) the first SHA-256 message digest + * @out2: (output) the second SHA-256 message digest + */ +void sha256_finup_2x(const struct sha256_ctx *ctx, const u8 *data1, + const u8 *data2, size_t len, u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]); + +/** + * sha256_finup_2x_is_optimized() - Check if sha256_finup_2x() is using a = real + * interleaved implementation, as opposed to a + * sequential fallback + * @return: true if optimized + */ +bool sha256_finup_2x_is_optimized(void); + /** * struct hmac_sha256_key - Prepared key for HMAC-SHA256 * @key: private */ struct hmac_sha256_key { diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c index 8fa15165d23e8..f307b48e7a658 100644 --- a/lib/crypto/sha256.c +++ b/lib/crypto/sha256.c @@ -23,17 +23,25 @@ static const struct sha256_block_state sha224_iv =3D { SHA224_H0, SHA224_H1, SHA224_H2, SHA224_H3, SHA224_H4, SHA224_H5, SHA224_H6, SHA224_H7, }, }; =20 -static const struct sha256_block_state sha256_iv =3D { - .h =3D { - SHA256_H0, SHA256_H1, SHA256_H2, SHA256_H3, - SHA256_H4, SHA256_H5, SHA256_H6, SHA256_H7, +static const struct sha256_ctx initial_sha256_ctx =3D { + + .ctx =3D { + .state =3D { + .h =3D { + SHA256_H0, SHA256_H1, SHA256_H2, SHA256_H3, + SHA256_H4, SHA256_H5, SHA256_H6, SHA256_H7, + }, + }, + .bytecount =3D 0, }, }; =20 +#define sha256_iv (initial_sha256_ctx.ctx.state) + static const u32 sha256_K[64] =3D { 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174, 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da, @@ -259,10 +267,60 @@ void sha256(const u8 *data, size_t len, u8 out[SHA256= _DIGEST_SIZE]) sha256_update(&ctx, data, len); sha256_final(&ctx, out); } EXPORT_SYMBOL(sha256); =20 +#ifndef sha256_finup_2x_arch +static bool sha256_finup_2x_arch(const struct __sha256_ctx *ctx, + const u8 *data1, const u8 *data2, size_t len, + u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]) +{ + return false; +} +static bool sha256_finup_2x_is_optimized_arch(void) +{ + return false; +} +#endif + +/* Sequential fallback implementation of sha256_finup_2x() */ +static noinline_for_stack void sha256_finup_2x_sequential( + const struct __sha256_ctx *ctx, const u8 *data1, const u8 *data2, + size_t len, u8 out1[SHA256_DIGEST_SIZE], u8 out2[SHA256_DIGEST_SIZE]) +{ + struct __sha256_ctx mut_ctx; + + mut_ctx =3D *ctx; + __sha256_update(&mut_ctx, data1, len); + __sha256_final(&mut_ctx, out1, SHA256_DIGEST_SIZE); + + mut_ctx =3D *ctx; + __sha256_update(&mut_ctx, data2, len); + __sha256_final(&mut_ctx, out2, SHA256_DIGEST_SIZE); +} + +void sha256_finup_2x(const struct sha256_ctx *ctx, const u8 *data1, + const u8 *data2, size_t len, u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]) +{ + if (ctx =3D=3D NULL) + ctx =3D &initial_sha256_ctx; + + if (likely(sha256_finup_2x_arch(&ctx->ctx, data1, data2, len, out1, + out2))) + return; + sha256_finup_2x_sequential(&ctx->ctx, data1, data2, len, out1, out2); +} +EXPORT_SYMBOL_GPL(sha256_finup_2x); + +bool sha256_finup_2x_is_optimized(void) +{ + return sha256_finup_2x_is_optimized_arch(); +} +EXPORT_SYMBOL_GPL(sha256_finup_2x_is_optimized); + /* pre-boot environment (as indicated by __DISABLE_EXPORTS) doesn't need H= MAC */ #ifndef __DISABLE_EXPORTS static void __hmac_sha256_preparekey(struct sha256_block_state *istate, struct sha256_block_state *ostate, const u8 *raw_key, size_t raw_key_len, --=20 2.50.1 From nobody Thu Oct 2 22:39:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BF4D286885; Tue, 9 Sep 2025 23:16:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459787; cv=none; b=TA9/5OQC2hzqgqyhcpVnYHOx7vLkrGaeekr4VAjX0YflazmGwJpSMS/SyzSs6KDFzbp2MzsK/Y7bkDxMcKoL30hpX6dIUFkgsuO1xnY1zDtvwgrxGbbxGbI/kkPyohQVvTMHlvn4b3Vvu7mDYcwUJkYkp6YkxjE7qoqi3dQg/gM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459787; c=relaxed/simple; bh=Zhg+v5tmcuxlFgfMaL5bgOzM8gmZljJLV/XULlh6+uo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kbwIANrTdAhz/ypAvmu9pYdNopXW3EmBlTn5DjRfeX7E+mN38lkbx16ZW23W6IbowBwEgzpv16HzGxnasX6raSr4f3/2HEZCUY0Q39engo2IMxe9sB7lcimCKNfpvMbw38KJMUUIoaqS6ust53pivBcNhVglAXYxiBHSgDoy+5U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=s0hdHVZy; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="s0hdHVZy" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9A159C4CEFD; Tue, 9 Sep 2025 23:16:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757459786; bh=Zhg+v5tmcuxlFgfMaL5bgOzM8gmZljJLV/XULlh6+uo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=s0hdHVZyzWWP8OkN+Nkj9Jst52w+40y2KIUuo+BT4hiZBNhrLWMKXC2j085IVs4Fo AflHIBu/z6wuLac9qVL6DQdfP33ooDGGQtho1DVB8hgZb3JGCpDb0yNnBHbElSbTqv Rg2IArCGpL94YsQt3Fkguy3DXq6qJ+MkrlRDg6ljB3m5TpiRzvmQj8U6LTuXRckJjq 2B1GsEB2N0y3drc0iSqp9ZlEDD6gzyUOoEa/2eVyzWZpJajBI638h+nvunPBRJIezw Q/f/uS58HXyc8EDD4xR0JYw5yBTs/yW3uIRp2QearSYm+gF2xeOmer5GAI+KglMRPB OrNOI12DGPOPw== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , x86@kernel.org, Sami Tolvanen , Mikulas Patocka , linux-arm-kernel@lists.infradead.org, Eric Biggers Subject: [PATCH 2/6] lib/crypto: arm64/sha256: Add support for 2-way interleaved hashing Date: Tue, 9 Sep 2025 16:13:22 -0700 Message-ID: <20250909231326.1074672-3-ebiggers@kernel.org> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250909231326.1074672-1-ebiggers@kernel.org> References: <20250909231326.1074672-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add an implementation of sha256_finup_2x_arch() for arm64. It interleaves the computation of two SHA-256 hashes using the ARMv8 SHA-256 instructions. dm-verity and fs-verity will take advantage of this for greatly improved performance on capable CPUs. This increases the throughput of SHA-256 hashing 4096-byte messages by the following amounts on the following CPUs: ARM Cortex-X1: 70% ARM Cortex-X3: 68% ARM Cortex-A76: 65% ARM Cortex-A715: 43% ARM Cortex-A510: 25% ARM Cortex-A55: 8% Signed-off-by: Eric Biggers --- lib/crypto/arm64/sha256-ce.S | 281 ++++++++++++++++++++++++++++++++++- lib/crypto/arm64/sha256.h | 37 +++++ 2 files changed, 312 insertions(+), 6 deletions(-) diff --git a/lib/crypto/arm64/sha256-ce.S b/lib/crypto/arm64/sha256-ce.S index b99d9589c4217..ba0b32e3ad60e 100644 --- a/lib/crypto/arm64/sha256-ce.S +++ b/lib/crypto/arm64/sha256-ce.S @@ -68,22 +68,26 @@ .word 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5 .word 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3 .word 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208 .word 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2 =20 + .macro load_round_constants tmp + adr_l \tmp, .Lsha2_rcon + ld1 { v0.4s- v3.4s}, [\tmp], #64 + ld1 { v4.4s- v7.4s}, [\tmp], #64 + ld1 { v8.4s-v11.4s}, [\tmp], #64 + ld1 {v12.4s-v15.4s}, [\tmp] + .endm + /* * size_t __sha256_ce_transform(struct sha256_block_state *state, * const u8 *data, size_t nblocks); */ .text SYM_FUNC_START(__sha256_ce_transform) - /* load round constants */ - adr_l x8, .Lsha2_rcon - ld1 { v0.4s- v3.4s}, [x8], #64 - ld1 { v4.4s- v7.4s}, [x8], #64 - ld1 { v8.4s-v11.4s}, [x8], #64 - ld1 {v12.4s-v15.4s}, [x8] + + load_round_constants x8 =20 /* load state */ ld1 {dgav.4s, dgbv.4s}, [x0] =20 /* load input */ @@ -132,5 +136,270 @@ CPU_LE( rev32 v19.16b, v19.16b ) /* store new state */ 1: st1 {dgav.4s, dgbv.4s}, [x0] mov x0, x2 ret SYM_FUNC_END(__sha256_ce_transform) + + .unreq dga + .unreq dgav + .unreq dgb + .unreq dgbv + .unreq t0 + .unreq t1 + .unreq dg0q + .unreq dg0v + .unreq dg1q + .unreq dg1v + .unreq dg2q + .unreq dg2v + + // parameters for sha256_ce_finup2x() + ctx .req x0 + data1 .req x1 + data2 .req x2 + len .req w3 + out1 .req x4 + out2 .req x5 + + // other scalar variables + count .req x6 + final_step .req w7 + + // x8-x9 are used as temporaries. + + // v0-v15 are used to cache the SHA-256 round constants. + // v16-v19 are used for the message schedule for the first message. + // v20-v23 are used for the message schedule for the second message. + // v24-v31 are used for the state and temporaries as given below. + // *_a are for the first message and *_b for the second. + state0_a_q .req q24 + state0_a .req v24 + state1_a_q .req q25 + state1_a .req v25 + state0_b_q .req q26 + state0_b .req v26 + state1_b_q .req q27 + state1_b .req v27 + t0_a .req v28 + t0_b .req v29 + t1_a_q .req q30 + t1_a .req v30 + t1_b_q .req q31 + t1_b .req v31 + +#define OFFSETOF_BYTECOUNT 32 // offsetof(struct __sha256_ctx, bytecount) +#define OFFSETOF_BUF 40 // offsetof(struct __sha256_ctx, buf) +// offsetof(struct __sha256_ctx, state) is assumed to be 0. + + // Do 4 rounds of SHA-256 for each of two messages (interleaved). m0_a + // and m0_b contain the current 4 message schedule words for the first + // and second message respectively. + // + // If not all the message schedule words have been computed yet, then + // this also computes 4 more message schedule words for each message. + // m1_a-m3_a contain the next 3 groups of 4 message schedule words for + // the first message, and likewise m1_b-m3_b for the second. After + // consuming the current value of m0_a, this macro computes the group + // after m3_a and writes it to m0_a, and likewise for *_b. This means + // that the next (m0_a, m1_a, m2_a, m3_a) is the current (m1_a, m2_a, + // m3_a, m0_a), and likewise for *_b, so the caller must cycle through + // the registers accordingly. + .macro do_4rounds_2x i, k, m0_a, m1_a, m2_a, m3_a, \ + m0_b, m1_b, m2_b, m3_b + add t0_a\().4s, \m0_a\().4s, \k\().4s + add t0_b\().4s, \m0_b\().4s, \k\().4s + .if \i < 48 + sha256su0 \m0_a\().4s, \m1_a\().4s + sha256su0 \m0_b\().4s, \m1_b\().4s + sha256su1 \m0_a\().4s, \m2_a\().4s, \m3_a\().4s + sha256su1 \m0_b\().4s, \m2_b\().4s, \m3_b\().4s + .endif + mov t1_a.16b, state0_a.16b + mov t1_b.16b, state0_b.16b + sha256h state0_a_q, state1_a_q, t0_a\().4s + sha256h state0_b_q, state1_b_q, t0_b\().4s + sha256h2 state1_a_q, t1_a_q, t0_a\().4s + sha256h2 state1_b_q, t1_b_q, t0_b\().4s + .endm + + .macro do_16rounds_2x i, k0, k1, k2, k3 + do_4rounds_2x \i + 0, \k0, v16, v17, v18, v19, v20, v21, v22, v23 + do_4rounds_2x \i + 4, \k1, v17, v18, v19, v16, v21, v22, v23, v20 + do_4rounds_2x \i + 8, \k2, v18, v19, v16, v17, v22, v23, v20, v21 + do_4rounds_2x \i + 12, \k3, v19, v16, v17, v18, v23, v20, v21, v22 + .endm + +// +// void sha256_ce_finup2x(const struct __sha256_ctx *ctx, +// const u8 *data1, const u8 *data2, int len, +// u8 out1[SHA256_DIGEST_SIZE], +// u8 out2[SHA256_DIGEST_SIZE]); +// +// This function computes the SHA-256 digests of two messages |data1| and +// |data2| that are both |len| bytes long, starting from the initial conte= xt +// |ctx|. |len| must be at least SHA256_BLOCK_SIZE. +// +// The instructions for the two SHA-256 operations are interleaved. On ma= ny +// CPUs, this is almost twice as fast as hashing each message individually= due +// to taking better advantage of the CPU's SHA-256 and SIMD throughput. +// +SYM_FUNC_START(sha256_ce_finup2x) + sub sp, sp, #128 + mov final_step, #0 + load_round_constants x8 + + // Load the initial state from ctx->state. + ld1 {state0_a.4s-state1_a.4s}, [ctx] + + // Load ctx->bytecount. Take the mod 64 of it to get the number of + // bytes that are buffered in ctx->buf. Also save it in a register with + // len added to it. + ldr x8, [ctx, #OFFSETOF_BYTECOUNT] + add count, x8, len, sxtw + and x8, x8, #63 + cbz x8, .Lfinup2x_enter_loop // No bytes buffered? + + // x8 bytes (1 to 63) are currently buffered in ctx->buf. Load them + // followed by the first 64 - x8 bytes of data. Since len >=3D 64, we + // just load 64 bytes from each of ctx->buf, data1, and data2 + // unconditionally and rearrange the data as needed. + add x9, ctx, #OFFSETOF_BUF + ld1 {v16.16b-v19.16b}, [x9] + st1 {v16.16b-v19.16b}, [sp] + + ld1 {v16.16b-v19.16b}, [data1], #64 + add x9, sp, x8 + st1 {v16.16b-v19.16b}, [x9] + ld1 {v16.4s-v19.4s}, [sp] + + ld1 {v20.16b-v23.16b}, [data2], #64 + st1 {v20.16b-v23.16b}, [x9] + ld1 {v20.4s-v23.4s}, [sp] + + sub len, len, #64 + sub data1, data1, x8 + sub data2, data2, x8 + add len, len, w8 + mov state0_b.16b, state0_a.16b + mov state1_b.16b, state1_a.16b + b .Lfinup2x_loop_have_data + +.Lfinup2x_enter_loop: + sub len, len, #64 + mov state0_b.16b, state0_a.16b + mov state1_b.16b, state1_a.16b +.Lfinup2x_loop: + // Load the next two data blocks. + ld1 {v16.4s-v19.4s}, [data1], #64 + ld1 {v20.4s-v23.4s}, [data2], #64 +.Lfinup2x_loop_have_data: + // Convert the words of the data blocks from big endian. +CPU_LE( rev32 v16.16b, v16.16b ) +CPU_LE( rev32 v17.16b, v17.16b ) +CPU_LE( rev32 v18.16b, v18.16b ) +CPU_LE( rev32 v19.16b, v19.16b ) +CPU_LE( rev32 v20.16b, v20.16b ) +CPU_LE( rev32 v21.16b, v21.16b ) +CPU_LE( rev32 v22.16b, v22.16b ) +CPU_LE( rev32 v23.16b, v23.16b ) +.Lfinup2x_loop_have_bswapped_data: + + // Save the original state for each block. + st1 {state0_a.4s-state1_b.4s}, [sp] + + // Do the SHA-256 rounds on each block. + do_16rounds_2x 0, v0, v1, v2, v3 + do_16rounds_2x 16, v4, v5, v6, v7 + do_16rounds_2x 32, v8, v9, v10, v11 + do_16rounds_2x 48, v12, v13, v14, v15 + + // Add the original state for each block. + ld1 {v16.4s-v19.4s}, [sp] + add state0_a.4s, state0_a.4s, v16.4s + add state1_a.4s, state1_a.4s, v17.4s + add state0_b.4s, state0_b.4s, v18.4s + add state1_b.4s, state1_b.4s, v19.4s + + // Update len and loop back if more blocks remain. + sub len, len, #64 + tbz len, #31, .Lfinup2x_loop // len >=3D 0? + + // Check if any final blocks need to be handled. + // final_step =3D 2: all done + // final_step =3D 1: need to do count-only padding block + // final_step =3D 0: need to do the block with 0x80 padding byte + tbnz final_step, #1, .Lfinup2x_done + tbnz final_step, #0, .Lfinup2x_finalize_countonly + add len, len, #64 + cbz len, .Lfinup2x_finalize_blockaligned + + // Not block-aligned; 1 <=3D len <=3D 63 data bytes remain. Pad the bloc= k. + // To do this, write the padding starting with the 0x80 byte to + // &sp[64]. Then for each message, copy the last 64 data bytes to sp + // and load from &sp[64 - len] to get the needed padding block. This + // code relies on the data buffers being >=3D 64 bytes in length. + sub w8, len, #64 // w8 =3D len - 64 + add data1, data1, w8, sxtw // data1 +=3D len - 64 + add data2, data2, w8, sxtw // data2 +=3D len - 64 + mov x9, 0x80 + fmov d16, x9 + movi v17.16b, #0 + stp q16, q17, [sp, #64] + stp q17, q17, [sp, #96] + sub x9, sp, w8, sxtw // x9 =3D &sp[64 - len] + cmp len, #56 + b.ge 1f // will count spill into its own block? + lsl count, count, #3 + rev count, count + str count, [x9, #56] + mov final_step, #2 // won't need count-only block + b 2f +1: + mov final_step, #1 // will need count-only block +2: + ld1 {v16.16b-v19.16b}, [data1] + st1 {v16.16b-v19.16b}, [sp] + ld1 {v16.4s-v19.4s}, [x9] + ld1 {v20.16b-v23.16b}, [data2] + st1 {v20.16b-v23.16b}, [sp] + ld1 {v20.4s-v23.4s}, [x9] + b .Lfinup2x_loop_have_data + + // Prepare a padding block, either: + // + // {0x80, 0, 0, 0, ..., count (as __be64)} + // This is for a block aligned message. + // + // { 0, 0, 0, 0, ..., count (as __be64)} + // This is for a message whose length mod 64 is >=3D 56. + // + // Pre-swap the endianness of the words. +.Lfinup2x_finalize_countonly: + movi v16.2d, #0 + b 1f +.Lfinup2x_finalize_blockaligned: + mov x8, #0x80000000 + fmov d16, x8 +1: + movi v17.2d, #0 + movi v18.2d, #0 + ror count, count, #29 // ror(lsl(count, 3), 32) + mov v19.d[0], xzr + mov v19.d[1], count + mov v20.16b, v16.16b + movi v21.2d, #0 + movi v22.2d, #0 + mov v23.16b, v19.16b + mov final_step, #2 + b .Lfinup2x_loop_have_bswapped_data + +.Lfinup2x_done: + // Write the two digests with all bytes in the correct order. +CPU_LE( rev32 state0_a.16b, state0_a.16b ) +CPU_LE( rev32 state1_a.16b, state1_a.16b ) +CPU_LE( rev32 state0_b.16b, state0_b.16b ) +CPU_LE( rev32 state1_b.16b, state1_b.16b ) + st1 {state0_a.4s-state1_a.4s}, [out1] + st1 {state0_b.4s-state1_b.4s}, [out2] + add sp, sp, #128 + ret +SYM_FUNC_END(sha256_ce_finup2x) diff --git a/lib/crypto/arm64/sha256.h b/lib/crypto/arm64/sha256.h index be4aeda9d0e6e..80d06df27d3a3 100644 --- a/lib/crypto/arm64/sha256.h +++ b/lib/crypto/arm64/sha256.h @@ -42,10 +42,47 @@ static void sha256_blocks(struct sha256_block_state *st= ate, } else { sha256_block_data_order(state, data, nblocks); } } =20 +static_assert(offsetof(struct __sha256_ctx, state) =3D=3D 0); +static_assert(offsetof(struct __sha256_ctx, bytecount) =3D=3D 32); +static_assert(offsetof(struct __sha256_ctx, buf) =3D=3D 40); +asmlinkage void sha256_ce_finup2x(const struct __sha256_ctx *ctx, + const u8 *data1, const u8 *data2, int len, + u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]); + +#define sha256_finup_2x_arch sha256_finup_2x_arch +static bool sha256_finup_2x_arch(const struct __sha256_ctx *ctx, + const u8 *data1, const u8 *data2, size_t len, + u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]) +{ + /* + * The assembly requires len >=3D SHA256_BLOCK_SIZE && len <=3D INT_MAX. + * Further limit len to 65536 to avoid spending too long with preemption + * disabled. (Of course, in practice len is nearly always 4096 anyway.) + */ + if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && + static_branch_likely(&have_ce) && len >=3D SHA256_BLOCK_SIZE && + len <=3D 65536 && likely(may_use_simd())) { + kernel_neon_begin(); + sha256_ce_finup2x(ctx, data1, data2, len, out1, out2); + kernel_neon_end(); + kmsan_unpoison_memory(out1, SHA256_DIGEST_SIZE); + kmsan_unpoison_memory(out2, SHA256_DIGEST_SIZE); + return true; + } + return false; +} + +static bool sha256_finup_2x_is_optimized_arch(void) +{ + return static_key_enabled(&have_ce); +} + #ifdef CONFIG_KERNEL_MODE_NEON #define sha256_mod_init_arch sha256_mod_init_arch static void sha256_mod_init_arch(void) { if (cpu_have_named_feature(ASIMD)) { --=20 2.50.1 From nobody Thu Oct 2 22:39:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F1FA287255; Tue, 9 Sep 2025 23:16:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459787; cv=none; b=lWcYk2QJBW/Kn3CUXxP08MSNPt9f4qyeX+qK9XTgtMH5k2Yj5OLEAmFWk9mwbVr7GTF9/Ls/QIIb2tkTwKSSKT4a8f33oiPK1mTghcos7GbAcAcW8Lq7dUjM5yuYJH5DUABRVu6kGOlyfNQt/gfRrgZ9PAb+fgOcht7k/jwLiAQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459787; c=relaxed/simple; bh=S6D1yNE1Syyh365j61OFnP89O+T3sxGXw/t1oEMSH+M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LjBk+d8SrvSbsxNWHqDTjQ59p1f+7fGcTll8yahn9gNusqeAfrCjPqmh3ngIVAlE0NaMQ6PEEOMBgV9vWO/JRrs15KNCcNO2e94k2/YavAGrFBAhpgsaA4ard7YBIaRmwo5F3cVYr2j6iUMFfLq3yFqEr7YKZXZTdPZ7zdkwtwA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PlKnfh9F; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PlKnfh9F" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 16069C4CEFA; Tue, 9 Sep 2025 23:16:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757459787; bh=S6D1yNE1Syyh365j61OFnP89O+T3sxGXw/t1oEMSH+M=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PlKnfh9F1ri+fH5JlMG64WG6iqYd9iiXy1jRm25czW9mthAY39+49oIyeZIu+bDNN wYmD+vmIvyDrXQwTQcBX1t1HRAvmHK+I9HLp/xLt7CeMqis2ZkHLf5Hw8CSmnvOvEi fDsc7HjJ87pXtHhx6ZOVdxa0RY48X4Y4bL2J+ppxfNHgHgQ0kpOF7+ZX6z2x4igQAB vUAIsJJZGsoruIetffDm9ERFA/WrcuUDf5dYy/8KRwLwrXTuJ9UjuCjWbFy/vpPGTj ROAOdBZyamLgS1wZ/xHVEdZzpXPIgAZ16uqK9P4CvafDhf/X5AanlSC+Fec7nF6DoW D4zUNhoxc10CA== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , x86@kernel.org, Sami Tolvanen , Mikulas Patocka , linux-arm-kernel@lists.infradead.org, Eric Biggers Subject: [PATCH 3/6] lib/crypto: x86/sha256: Add support for 2-way interleaved hashing Date: Tue, 9 Sep 2025 16:13:23 -0700 Message-ID: <20250909231326.1074672-4-ebiggers@kernel.org> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250909231326.1074672-1-ebiggers@kernel.org> References: <20250909231326.1074672-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add an implementation of sha256_finup_2x_arch() for x86_64. It interleaves the computation of two SHA-256 hashes using the x86 SHA-256 instructions. dm-verity and fs-verity will take advantage of this for greatly improved performance on capable CPUs. This increases the throughput of SHA-256 hashing 4096-byte messages by the following amounts on the following CPUs: Intel Ice Lake (server): 4% Intel Sapphire Rapids: 38% Intel Emerald Rapids: 38% AMD Zen 1 (Threadripper 1950X): 84% AMD Zen 4 (EPYC 9B14): 98% AMD Zen 5 (Ryzen 9 9950X): 64% For now, this seems to benefit AMD more than Intel. This seems to be because current AMD CPUs support concurrent execution of the SHA-NI instructions, but unfortunately current Intel CPUs don't, except for the sha256msg2 instruction. Hopefully future Intel CPUs will support SHA-NI on more execution ports. Zen 1 supports 2 concurrent sha256rnds2, and Zen 4 supports 4 concurrent sha256rnds2, which suggests that even better performance may be achievable on Zen 4 by interleaving more than two hashes. However, doing so poses a number of trade-offs, and furthermore Zen 5 goes back to supporting "only" 2 concurrent sha256rnds2. Signed-off-by: Eric Biggers --- lib/crypto/x86/sha256-ni-asm.S | 368 +++++++++++++++++++++++++++++++++ lib/crypto/x86/sha256.h | 39 ++++ 2 files changed, 407 insertions(+) diff --git a/lib/crypto/x86/sha256-ni-asm.S b/lib/crypto/x86/sha256-ni-asm.S index 4bd9490ffc662..de5f707e7ef71 100644 --- a/lib/crypto/x86/sha256-ni-asm.S +++ b/lib/crypto/x86/sha256-ni-asm.S @@ -163,10 +163,378 @@ SYM_FUNC_START(sha256_ni_transform) movdqu STATE0, 1*16(STATE_PTR) =20 RET SYM_FUNC_END(sha256_ni_transform) =20 +#undef DIGEST_PTR +#undef DATA_PTR +#undef NUM_BLKS +#undef SHA256CONSTANTS +#undef MSG +#undef STATE0 +#undef STATE1 +#undef MSG0 +#undef MSG1 +#undef MSG2 +#undef MSG3 +#undef TMP +#undef SHUF_MASK +#undef ABEF_SAVE +#undef CDGH_SAVE + +// parameters for sha256_ni_finup2x() +#define CTX %rdi +#define DATA1 %rsi +#define DATA2 %rdx +#define LEN %ecx +#define LEN8 %cl +#define LEN64 %rcx +#define OUT1 %r8 +#define OUT2 %r9 + +// other scalar variables +#define SHA256CONSTANTS %rax +#define COUNT %r10 +#define COUNT32 %r10d +#define FINAL_STEP %r11d + +// rbx is used as a temporary. + +#define MSG %xmm0 // sha256rnds2 implicit operand +#define STATE0_A %xmm1 +#define STATE1_A %xmm2 +#define STATE0_B %xmm3 +#define STATE1_B %xmm4 +#define TMP_A %xmm5 +#define TMP_B %xmm6 +#define MSG0_A %xmm7 +#define MSG1_A %xmm8 +#define MSG2_A %xmm9 +#define MSG3_A %xmm10 +#define MSG0_B %xmm11 +#define MSG1_B %xmm12 +#define MSG2_B %xmm13 +#define MSG3_B %xmm14 +#define SHUF_MASK %xmm15 + +#define OFFSETOF_STATE 0 // offsetof(struct __sha256_ctx, state) +#define OFFSETOF_BYTECOUNT 32 // offsetof(struct __sha256_ctx, bytecount) +#define OFFSETOF_BUF 40 // offsetof(struct __sha256_ctx, buf) + +// Do 4 rounds of SHA-256 for each of two messages (interleaved). m0_a an= d m0_b +// contain the current 4 message schedule words for the first and second m= essage +// respectively. +// +// If not all the message schedule words have been computed yet, then this= also +// computes 4 more message schedule words for each message. m1_a-m3_a con= tain +// the next 3 groups of 4 message schedule words for the first message, and +// likewise m1_b-m3_b for the second. After consuming the current value of +// m0_a, this macro computes the group after m3_a and writes it to m0_a, a= nd +// likewise for *_b. This means that the next (m0_a, m1_a, m2_a, m3_a) is= the +// current (m1_a, m2_a, m3_a, m0_a), and likewise for *_b, so the caller m= ust +// cycle through the registers accordingly. +.macro do_4rounds_2x i, m0_a, m1_a, m2_a, m3_a, m0_b, m1_b, m2_b, m3_b + movdqa (\i-32)*4(SHA256CONSTANTS), TMP_A + movdqa TMP_A, TMP_B + paddd \m0_a, TMP_A + paddd \m0_b, TMP_B +.if \i < 48 + sha256msg1 \m1_a, \m0_a + sha256msg1 \m1_b, \m0_b +.endif + movdqa TMP_A, MSG + sha256rnds2 STATE0_A, STATE1_A + movdqa TMP_B, MSG + sha256rnds2 STATE0_B, STATE1_B + pshufd $0x0E, TMP_A, MSG + sha256rnds2 STATE1_A, STATE0_A + pshufd $0x0E, TMP_B, MSG + sha256rnds2 STATE1_B, STATE0_B +.if \i < 48 + movdqa \m3_a, TMP_A + movdqa \m3_b, TMP_B + palignr $4, \m2_a, TMP_A + palignr $4, \m2_b, TMP_B + paddd TMP_A, \m0_a + paddd TMP_B, \m0_b + sha256msg2 \m3_a, \m0_a + sha256msg2 \m3_b, \m0_b +.endif +.endm + +// +// void sha256_ni_finup2x(const struct __sha256_ctx *ctx, +// const u8 *data1, const u8 *data2, int len, +// u8 out1[SHA256_DIGEST_SIZE], +// u8 out2[SHA256_DIGEST_SIZE]); +// +// This function computes the SHA-256 digests of two messages |data1| and +// |data2| that are both |len| bytes long, starting from the initial conte= xt +// |ctx|. |len| must be at least SHA256_BLOCK_SIZE. +// +// The instructions for the two SHA-256 operations are interleaved. On ma= ny +// CPUs, this is almost twice as fast as hashing each message individually= due +// to taking better advantage of the CPU's SHA-256 and SIMD throughput. +// +SYM_FUNC_START(sha256_ni_finup2x) + // Allocate 128 bytes of stack space, 16-byte aligned. + push %rbx + push %rbp + mov %rsp, %rbp + sub $128, %rsp + and $~15, %rsp + + // Load the shuffle mask for swapping the endianness of 32-bit words. + movdqa PSHUFFLE_BYTE_FLIP_MASK(%rip), SHUF_MASK + + // Set up pointer to the round constants. + lea K256+32*4(%rip), SHA256CONSTANTS + + // Initially we're not processing the final blocks. + xor FINAL_STEP, FINAL_STEP + + // Load the initial state from ctx->state. + movdqu OFFSETOF_STATE+0*16(CTX), STATE0_A // DCBA + movdqu OFFSETOF_STATE+1*16(CTX), STATE1_A // HGFE + movdqa STATE0_A, TMP_A + punpcklqdq STATE1_A, STATE0_A // FEBA + punpckhqdq TMP_A, STATE1_A // DCHG + pshufd $0x1B, STATE0_A, STATE0_A // ABEF + pshufd $0xB1, STATE1_A, STATE1_A // CDGH + + // Load ctx->bytecount. Take the mod 64 of it to get the number of + // bytes that are buffered in ctx->buf. Also save it in a register with + // LEN added to it. + mov LEN, LEN + mov OFFSETOF_BYTECOUNT(CTX), %rbx + lea (%rbx, LEN64, 1), COUNT + and $63, %ebx + jz .Lfinup2x_enter_loop // No bytes buffered? + + // %ebx bytes (1 to 63) are currently buffered in ctx->buf. Load them + // followed by the first 64 - %ebx bytes of data. Since LEN >=3D 64, we + // just load 64 bytes from each of ctx->buf, DATA1, and DATA2 + // unconditionally and rearrange the data as needed. + + movdqu OFFSETOF_BUF+0*16(CTX), MSG0_A + movdqu OFFSETOF_BUF+1*16(CTX), MSG1_A + movdqu OFFSETOF_BUF+2*16(CTX), MSG2_A + movdqu OFFSETOF_BUF+3*16(CTX), MSG3_A + movdqa MSG0_A, 0*16(%rsp) + movdqa MSG1_A, 1*16(%rsp) + movdqa MSG2_A, 2*16(%rsp) + movdqa MSG3_A, 3*16(%rsp) + + movdqu 0*16(DATA1), MSG0_A + movdqu 1*16(DATA1), MSG1_A + movdqu 2*16(DATA1), MSG2_A + movdqu 3*16(DATA1), MSG3_A + movdqu MSG0_A, 0*16(%rsp,%rbx) + movdqu MSG1_A, 1*16(%rsp,%rbx) + movdqu MSG2_A, 2*16(%rsp,%rbx) + movdqu MSG3_A, 3*16(%rsp,%rbx) + movdqa 0*16(%rsp), MSG0_A + movdqa 1*16(%rsp), MSG1_A + movdqa 2*16(%rsp), MSG2_A + movdqa 3*16(%rsp), MSG3_A + + movdqu 0*16(DATA2), MSG0_B + movdqu 1*16(DATA2), MSG1_B + movdqu 2*16(DATA2), MSG2_B + movdqu 3*16(DATA2), MSG3_B + movdqu MSG0_B, 0*16(%rsp,%rbx) + movdqu MSG1_B, 1*16(%rsp,%rbx) + movdqu MSG2_B, 2*16(%rsp,%rbx) + movdqu MSG3_B, 3*16(%rsp,%rbx) + movdqa 0*16(%rsp), MSG0_B + movdqa 1*16(%rsp), MSG1_B + movdqa 2*16(%rsp), MSG2_B + movdqa 3*16(%rsp), MSG3_B + + sub $64, %rbx // rbx =3D buffered - 64 + sub %rbx, DATA1 // DATA1 +=3D 64 - buffered + sub %rbx, DATA2 // DATA2 +=3D 64 - buffered + add %ebx, LEN // LEN +=3D buffered - 64 + movdqa STATE0_A, STATE0_B + movdqa STATE1_A, STATE1_B + jmp .Lfinup2x_loop_have_data + +.Lfinup2x_enter_loop: + sub $64, LEN + movdqa STATE0_A, STATE0_B + movdqa STATE1_A, STATE1_B +.Lfinup2x_loop: + // Load the next two data blocks. + movdqu 0*16(DATA1), MSG0_A + movdqu 0*16(DATA2), MSG0_B + movdqu 1*16(DATA1), MSG1_A + movdqu 1*16(DATA2), MSG1_B + movdqu 2*16(DATA1), MSG2_A + movdqu 2*16(DATA2), MSG2_B + movdqu 3*16(DATA1), MSG3_A + movdqu 3*16(DATA2), MSG3_B + add $64, DATA1 + add $64, DATA2 +.Lfinup2x_loop_have_data: + // Convert the words of the data blocks from big endian. + pshufb SHUF_MASK, MSG0_A + pshufb SHUF_MASK, MSG0_B + pshufb SHUF_MASK, MSG1_A + pshufb SHUF_MASK, MSG1_B + pshufb SHUF_MASK, MSG2_A + pshufb SHUF_MASK, MSG2_B + pshufb SHUF_MASK, MSG3_A + pshufb SHUF_MASK, MSG3_B +.Lfinup2x_loop_have_bswapped_data: + + // Save the original state for each block. + movdqa STATE0_A, 0*16(%rsp) + movdqa STATE0_B, 1*16(%rsp) + movdqa STATE1_A, 2*16(%rsp) + movdqa STATE1_B, 3*16(%rsp) + + // Do the SHA-256 rounds on each block. +.irp i, 0, 16, 32, 48 + do_4rounds_2x (\i + 0), MSG0_A, MSG1_A, MSG2_A, MSG3_A, \ + MSG0_B, MSG1_B, MSG2_B, MSG3_B + do_4rounds_2x (\i + 4), MSG1_A, MSG2_A, MSG3_A, MSG0_A, \ + MSG1_B, MSG2_B, MSG3_B, MSG0_B + do_4rounds_2x (\i + 8), MSG2_A, MSG3_A, MSG0_A, MSG1_A, \ + MSG2_B, MSG3_B, MSG0_B, MSG1_B + do_4rounds_2x (\i + 12), MSG3_A, MSG0_A, MSG1_A, MSG2_A, \ + MSG3_B, MSG0_B, MSG1_B, MSG2_B +.endr + + // Add the original state for each block. + paddd 0*16(%rsp), STATE0_A + paddd 1*16(%rsp), STATE0_B + paddd 2*16(%rsp), STATE1_A + paddd 3*16(%rsp), STATE1_B + + // Update LEN and loop back if more blocks remain. + sub $64, LEN + jge .Lfinup2x_loop + + // Check if any final blocks need to be handled. + // FINAL_STEP =3D 2: all done + // FINAL_STEP =3D 1: need to do count-only padding block + // FINAL_STEP =3D 0: need to do the block with 0x80 padding byte + cmp $1, FINAL_STEP + jg .Lfinup2x_done + je .Lfinup2x_finalize_countonly + add $64, LEN + jz .Lfinup2x_finalize_blockaligned + + // Not block-aligned; 1 <=3D LEN <=3D 63 data bytes remain. Pad the bloc= k. + // To do this, write the padding starting with the 0x80 byte to + // &sp[64]. Then for each message, copy the last 64 data bytes to sp + // and load from &sp[64 - LEN] to get the needed padding block. This + // code relies on the data buffers being >=3D 64 bytes in length. + mov $64, %ebx + sub LEN, %ebx // ebx =3D 64 - LEN + sub %rbx, DATA1 // DATA1 -=3D 64 - LEN + sub %rbx, DATA2 // DATA2 -=3D 64 - LEN + mov $0x80, FINAL_STEP // using FINAL_STEP as a temporary + movd FINAL_STEP, MSG0_A + pxor MSG1_A, MSG1_A + movdqa MSG0_A, 4*16(%rsp) + movdqa MSG1_A, 5*16(%rsp) + movdqa MSG1_A, 6*16(%rsp) + movdqa MSG1_A, 7*16(%rsp) + cmp $56, LEN + jge 1f // will COUNT spill into its own block? + shl $3, COUNT + bswap COUNT + mov COUNT, 56(%rsp,%rbx) + mov $2, FINAL_STEP // won't need count-only block + jmp 2f +1: + mov $1, FINAL_STEP // will need count-only block +2: + movdqu 0*16(DATA1), MSG0_A + movdqu 1*16(DATA1), MSG1_A + movdqu 2*16(DATA1), MSG2_A + movdqu 3*16(DATA1), MSG3_A + movdqa MSG0_A, 0*16(%rsp) + movdqa MSG1_A, 1*16(%rsp) + movdqa MSG2_A, 2*16(%rsp) + movdqa MSG3_A, 3*16(%rsp) + movdqu 0*16(%rsp,%rbx), MSG0_A + movdqu 1*16(%rsp,%rbx), MSG1_A + movdqu 2*16(%rsp,%rbx), MSG2_A + movdqu 3*16(%rsp,%rbx), MSG3_A + + movdqu 0*16(DATA2), MSG0_B + movdqu 1*16(DATA2), MSG1_B + movdqu 2*16(DATA2), MSG2_B + movdqu 3*16(DATA2), MSG3_B + movdqa MSG0_B, 0*16(%rsp) + movdqa MSG1_B, 1*16(%rsp) + movdqa MSG2_B, 2*16(%rsp) + movdqa MSG3_B, 3*16(%rsp) + movdqu 0*16(%rsp,%rbx), MSG0_B + movdqu 1*16(%rsp,%rbx), MSG1_B + movdqu 2*16(%rsp,%rbx), MSG2_B + movdqu 3*16(%rsp,%rbx), MSG3_B + jmp .Lfinup2x_loop_have_data + + // Prepare a padding block, either: + // + // {0x80, 0, 0, 0, ..., count (as __be64)} + // This is for a block aligned message. + // + // { 0, 0, 0, 0, ..., count (as __be64)} + // This is for a message whose length mod 64 is >=3D 56. + // + // Pre-swap the endianness of the words. +.Lfinup2x_finalize_countonly: + pxor MSG0_A, MSG0_A + jmp 1f + +.Lfinup2x_finalize_blockaligned: + mov $0x80000000, %ebx + movd %ebx, MSG0_A +1: + pxor MSG1_A, MSG1_A + pxor MSG2_A, MSG2_A + ror $29, COUNT + movq COUNT, MSG3_A + pslldq $8, MSG3_A + movdqa MSG0_A, MSG0_B + pxor MSG1_B, MSG1_B + pxor MSG2_B, MSG2_B + movdqa MSG3_A, MSG3_B + mov $2, FINAL_STEP + jmp .Lfinup2x_loop_have_bswapped_data + +.Lfinup2x_done: + // Write the two digests with all bytes in the correct order. + movdqa STATE0_A, TMP_A + movdqa STATE0_B, TMP_B + punpcklqdq STATE1_A, STATE0_A // GHEF + punpcklqdq STATE1_B, STATE0_B + punpckhqdq TMP_A, STATE1_A // ABCD + punpckhqdq TMP_B, STATE1_B + pshufd $0xB1, STATE0_A, STATE0_A // HGFE + pshufd $0xB1, STATE0_B, STATE0_B + pshufd $0x1B, STATE1_A, STATE1_A // DCBA + pshufd $0x1B, STATE1_B, STATE1_B + pshufb SHUF_MASK, STATE0_A + pshufb SHUF_MASK, STATE0_B + pshufb SHUF_MASK, STATE1_A + pshufb SHUF_MASK, STATE1_B + movdqu STATE0_A, 1*16(OUT1) + movdqu STATE0_B, 1*16(OUT2) + movdqu STATE1_A, 0*16(OUT1) + movdqu STATE1_B, 0*16(OUT2) + + mov %rbp, %rsp + pop %rbp + pop %rbx + RET +SYM_FUNC_END(sha256_ni_finup2x) + .section .rodata.cst256.K256, "aM", @progbits, 256 .align 64 K256: .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 diff --git a/lib/crypto/x86/sha256.h b/lib/crypto/x86/sha256.h index 41fa95fbc3bf8..38e33b22a0927 100644 --- a/lib/crypto/x86/sha256.h +++ b/lib/crypto/x86/sha256.h @@ -5,10 +5,12 @@ * Copyright 2025 Google LLC */ #include #include =20 +static __ro_after_init DEFINE_STATIC_KEY_FALSE(have_sha_ni); + DEFINE_STATIC_CALL(sha256_blocks_x86, sha256_blocks_generic); =20 #define DEFINE_X86_SHA256_FN(c_fn, asm_fn) = \ asmlinkage void asm_fn(struct sha256_block_state *state, \ const u8 *data, size_t nblocks); \ @@ -33,15 +35,52 @@ static void sha256_blocks(struct sha256_block_state *st= ate, const u8 *data, size_t nblocks) { static_call(sha256_blocks_x86)(state, data, nblocks); } =20 +static_assert(offsetof(struct __sha256_ctx, state) =3D=3D 0); +static_assert(offsetof(struct __sha256_ctx, bytecount) =3D=3D 32); +static_assert(offsetof(struct __sha256_ctx, buf) =3D=3D 40); +asmlinkage void sha256_ni_finup2x(const struct __sha256_ctx *ctx, + const u8 *data1, const u8 *data2, int len, + u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]); + +#define sha256_finup_2x_arch sha256_finup_2x_arch +static bool sha256_finup_2x_arch(const struct __sha256_ctx *ctx, + const u8 *data1, const u8 *data2, size_t len, + u8 out1[SHA256_DIGEST_SIZE], + u8 out2[SHA256_DIGEST_SIZE]) +{ + /* + * The assembly requires len >=3D SHA256_BLOCK_SIZE && len <=3D INT_MAX. + * Further limit len to 65536 to avoid spending too long with preemption + * disabled. (Of course, in practice len is nearly always 4096 anyway.) + */ + if (static_branch_likely(&have_sha_ni) && len >=3D SHA256_BLOCK_SIZE && + len <=3D 65536 && likely(irq_fpu_usable())) { + kernel_fpu_begin(); + sha256_ni_finup2x(ctx, data1, data2, len, out1, out2); + kernel_fpu_end(); + kmsan_unpoison_memory(out1, SHA256_DIGEST_SIZE); + kmsan_unpoison_memory(out2, SHA256_DIGEST_SIZE); + return true; + } + return false; +} + +static bool sha256_finup_2x_is_optimized_arch(void) +{ + return static_key_enabled(&have_sha_ni); +} + #define sha256_mod_init_arch sha256_mod_init_arch static void sha256_mod_init_arch(void) { if (boot_cpu_has(X86_FEATURE_SHA_NI)) { static_call_update(sha256_blocks_x86, sha256_blocks_ni); + static_branch_enable(&have_sha_ni); } else if (cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL) && boot_cpu_has(X86_FEATURE_AVX)) { if (boot_cpu_has(X86_FEATURE_AVX2) && boot_cpu_has(X86_FEATURE_BMI2)) --=20 2.50.1 From nobody Thu Oct 2 22:39:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A4EE2882C8; Tue, 9 Sep 2025 23:16:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459788; cv=none; b=Ibffnztoza3q1n6p2dCekc5iDIxuxMWq0RgNLKE/bc5Qs0IcDk4wnp3TXhtEGx8UuumzQwJNTVxwMWFBhAb8ZHTfb3Tu86ZO/qgdSqmjazQ9HOZ1KB0TE5b/bBZEz6h5otu64hg8LtH1Jn9OpjPKMZdvwsaaU9DtHunP75rP2s4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459788; c=relaxed/simple; bh=+jtPvEG9mqAA+HOHIx5a/x8gc0AqK7W2yRGI6eP50+8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OLfzGq9xFAi5P5NjACoo7JkrRm8EQ75FdUWFLbfTc1LoHas290a+0gfWZdxDjLqiiZH8kV1hB+Ww1YCHpuIgucyDMQna+2iARGi72uUTnv4lIjIWyQnw1oBiw/b1Khi81onbSko8Q7PCi5DZxK36i9NF1PzwOvdCrcLGgnLHqwc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qdEk+uaF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qdEk+uaF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 854C4C4CEF8; Tue, 9 Sep 2025 23:16:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757459787; bh=+jtPvEG9mqAA+HOHIx5a/x8gc0AqK7W2yRGI6eP50+8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=qdEk+uaFOdHaGiDPsTxBY7RpVY48whywiERdxCFE/N3bJzKqWolxFwS/BUhsFR5pX F5XDQszx6JgNe6yVCY4dV/pvdRyPuRuef/Si9wyxfP4QP+PFyGpJnYv1QKLQNApHBk qpu2cvprtfVcFxCwCXnorgXuLEvLDIwrUmpB986Kd8aGAwB5mnggFaoUodlhFz/62u ROmn62pPL2w3FFIRLhvoz40HkvqmYMPMi18RaXe+AWHJs3jnvvFIvp+Gxa+KhATbtr nOtxEtqGbawD7TGzOtNmyW1C/sVgYUci42C+1VEuRS2PBK1bfj4uiwWCyvobz9iaVi QqJKplc9Bwk/g== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , x86@kernel.org, Sami Tolvanen , Mikulas Patocka , linux-arm-kernel@lists.infradead.org, Eric Biggers Subject: [PATCH 4/6] lib/crypto: tests: Add tests and benchmark for sha256_finup_2x() Date: Tue, 9 Sep 2025 16:13:24 -0700 Message-ID: <20250909231326.1074672-5-ebiggers@kernel.org> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250909231326.1074672-1-ebiggers@kernel.org> References: <20250909231326.1074672-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update sha256_kunit to include test cases and a benchmark for the new sha256_finup_2x() function. Signed-off-by: Eric Biggers --- lib/crypto/tests/sha256_kunit.c | 184 ++++++++++++++++++++++++++++++++ 1 file changed, 184 insertions(+) diff --git a/lib/crypto/tests/sha256_kunit.c b/lib/crypto/tests/sha256_kuni= t.c index 1cd4caee6010d..628981f0a7eff 100644 --- a/lib/crypto/tests/sha256_kunit.c +++ b/lib/crypto/tests/sha256_kunit.c @@ -3,10 +3,11 @@ * Copyright 2025 Google LLC */ #include #include "sha256-testvecs.h" =20 +/* Generate the HASH_KUNIT_CASES using hash-test-template.h. */ #define HASH sha256 #define HASH_CTX sha256_ctx #define HASH_SIZE SHA256_DIGEST_SIZE #define HASH_INIT sha256_init #define HASH_UPDATE sha256_update @@ -19,13 +20,196 @@ #define HMAC_FINAL hmac_sha256_final #define HMAC hmac_sha256 #define HMAC_USINGRAWKEY hmac_sha256_usingrawkey #include "hash-test-template.h" =20 +static void free_guarded_buf(void *buf) +{ + vfree(buf); +} + +/* + * Allocate a KUnit-managed buffer that has length @len bytes immediately + * followed by an unmapped page. + */ +static void *alloc_guarded_buf(struct kunit *test, size_t len) +{ + size_t full_len =3D round_up(len, PAGE_SIZE); + void *buf =3D vmalloc(full_len); + + KUNIT_ASSERT_NOT_NULL(test, buf); + if (kunit_add_action_or_reset(test, free_guarded_buf, buf) !=3D 0) + return NULL; + return buf + full_len - len; +} + +/* + * Test for sha256_finup_2x(). Specifically, choose various data lengths = and + * salt lengths, and for each one, verify that sha256_finup_2x() produces = the + * same results as sha256_update() and sha256_final(). + * + * Use guarded buffers for all inputs and outputs to reliably detect any + * out-of-bounds reads or writes, even if they occur in assembly code. + */ +static void test_sha256_finup_2x(struct kunit *test) +{ + const size_t max_data_len =3D 16384; + u8 *data1_buf, *data2_buf, *hash1, *hash2; + u8 expected_hash1[SHA256_DIGEST_SIZE]; + u8 expected_hash2[SHA256_DIGEST_SIZE]; + u8 salt[SHA256_BLOCK_SIZE]; + struct sha256_ctx *ctx; + + data1_buf =3D alloc_guarded_buf(test, max_data_len); + data2_buf =3D alloc_guarded_buf(test, max_data_len); + hash1 =3D alloc_guarded_buf(test, SHA256_DIGEST_SIZE); + hash2 =3D alloc_guarded_buf(test, SHA256_DIGEST_SIZE); + ctx =3D alloc_guarded_buf(test, sizeof(*ctx)); + + rand_bytes(data1_buf, max_data_len); + rand_bytes(data2_buf, max_data_len); + rand_bytes(salt, sizeof(salt)); + + for (size_t i =3D 0; i < 500; i++) { + size_t salt_len =3D rand_length(sizeof(salt)); + size_t data_len =3D rand_length(max_data_len); + const u8 *data1 =3D data1_buf + max_data_len - data_len; + const u8 *data2 =3D data2_buf + max_data_len - data_len; + struct sha256_ctx orig_ctx; + + sha256_init(ctx); + sha256_update(ctx, salt, salt_len); + orig_ctx =3D *ctx; + + sha256_finup_2x(ctx, data1, data2, data_len, hash1, hash2); + KUNIT_ASSERT_MEMEQ_MSG( + test, ctx, &orig_ctx, sizeof(*ctx), + "sha256_finup_2x() modified its ctx argument"); + + sha256_update(ctx, data1, data_len); + sha256_final(ctx, expected_hash1); + sha256_update(&orig_ctx, data2, data_len); + sha256_final(&orig_ctx, expected_hash2); + KUNIT_ASSERT_MEMEQ_MSG( + test, hash1, expected_hash1, SHA256_DIGEST_SIZE, + "Wrong hash1 with salt_len=3D%zu data_len=3D%zu", salt_len, + data_len); + KUNIT_ASSERT_MEMEQ_MSG( + test, hash2, expected_hash2, SHA256_DIGEST_SIZE, + "Wrong hash2 with salt_len=3D%zu data_len=3D%zu", salt_len, + data_len); + } +} + +/* Test sha256_finup_2x() with ctx =3D=3D NULL */ +static void test_sha256_finup_2x_defaultctx(struct kunit *test) +{ + const size_t data_len =3D 128; + struct sha256_ctx ctx; + u8 hash1_a[SHA256_DIGEST_SIZE]; + u8 hash2_a[SHA256_DIGEST_SIZE]; + u8 hash1_b[SHA256_DIGEST_SIZE]; + u8 hash2_b[SHA256_DIGEST_SIZE]; + + rand_bytes(test_buf, 2 * data_len); + + sha256_init(&ctx); + sha256_finup_2x(&ctx, test_buf, &test_buf[data_len], data_len, hash1_a, + hash2_a); + + sha256_finup_2x(NULL, test_buf, &test_buf[data_len], data_len, hash1_b, + hash2_b); + + KUNIT_ASSERT_MEMEQ(test, hash1_a, hash1_b, SHA256_DIGEST_SIZE); + KUNIT_ASSERT_MEMEQ(test, hash2_a, hash2_b, SHA256_DIGEST_SIZE); +} + +/* + * Test that sha256_finup_2x() and sha256_update/final() produce consistent + * results with total message lengths that require more than 32 bits. + */ +static void test_sha256_finup_2x_hugelen(struct kunit *test) +{ + const size_t data_len =3D 4 * SHA256_BLOCK_SIZE; + struct sha256_ctx ctx =3D {}; + u8 expected_hash[SHA256_DIGEST_SIZE]; + u8 hash[SHA256_DIGEST_SIZE]; + + rand_bytes(test_buf, data_len); + for (size_t align =3D 0; align < SHA256_BLOCK_SIZE; align++) { + sha256_init(&ctx); + ctx.ctx.bytecount =3D 0x123456789abcd00 + align; + + sha256_finup_2x(&ctx, test_buf, test_buf, data_len, hash, hash); + + sha256_update(&ctx, test_buf, data_len); + sha256_final(&ctx, expected_hash); + + KUNIT_ASSERT_MEMEQ(test, hash, expected_hash, + SHA256_DIGEST_SIZE); + } +} + +/* Benchmark for sha256_finup_2x() */ +static void benchmark_sha256_finup_2x(struct kunit *test) +{ + /* + * Try a few different salt lengths, since sha256_finup_2x() performance + * may vary slightly for the same data_len depending on how many bytes + * were already processed in the initial context. + */ + static const size_t salt_lens_to_test[] =3D { 0, 32, 64 }; + const size_t data_len =3D 4096; + const size_t num_iters =3D 4096; + struct sha256_ctx ctx; + u8 hash1[SHA256_DIGEST_SIZE]; + u8 hash2[SHA256_DIGEST_SIZE]; + u64 t; + + if (!IS_ENABLED(CONFIG_CRYPTO_LIB_BENCHMARK)) + kunit_skip(test, "not enabled"); + if (!sha256_finup_2x_is_optimized()) + kunit_skip(test, "not relevant"); + + rand_bytes(test_buf, data_len * 2); + + /* Warm-up */ + for (size_t i =3D 0; i < num_iters; i++) + sha256_finup_2x(NULL, &test_buf[0], &test_buf[data_len], + data_len, hash1, hash2); + + for (size_t i =3D 0; i < ARRAY_SIZE(salt_lens_to_test); i++) { + size_t salt_len =3D salt_lens_to_test[i]; + + /* + * Prepare the initial context. The time to process the salt is + * not measured; we're just interested in sha256_finup_2x(). + */ + sha256_init(&ctx); + sha256_update(&ctx, test_buf, salt_len); + + preempt_disable(); + t =3D ktime_get_ns(); + for (size_t j =3D 0; j < num_iters; j++) + sha256_finup_2x(&ctx, &test_buf[0], &test_buf[data_len], + data_len, hash1, hash2); + t =3D ktime_get_ns() - t; + preempt_enable(); + kunit_info(test, "data_len=3D%zu salt_len=3D%zu: %llu MB/s", + data_len, salt_len, + div64_u64((u64)data_len * 2 * num_iters * 1000, + t ?: 1)); + } +} + static struct kunit_case hash_test_cases[] =3D { HASH_KUNIT_CASES, + KUNIT_CASE(test_sha256_finup_2x), + KUNIT_CASE(test_sha256_finup_2x_defaultctx), + KUNIT_CASE(test_sha256_finup_2x_hugelen), KUNIT_CASE(benchmark_hash), + KUNIT_CASE(benchmark_sha256_finup_2x), {}, }; =20 static struct kunit_suite hash_test_suite =3D { .name =3D "sha256", --=20 2.50.1 From nobody Thu Oct 2 22:39:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F0C8289E06; Tue, 9 Sep 2025 23:16:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459788; cv=none; b=IM2CFok+Kcd+cOG7BcSNYUgM1aMOjXcgo6iNMhx5oauSvtO90sts5mvDZQ2mLTHxtEftvnLXVK0jpg4+e3qaHwj3Cgd3KDYhdsBWWYzQMkFe316zELc8Rh0xr6SAXjgEYlW54RhCaNAkNL3ZsQrpOmudByCr6DS3O8kQNzCyprU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459788; c=relaxed/simple; bh=Wo2E3+6r1nAKVC9DS3YgQKzibRVbwi36i7wXqg9oQbM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cht7lGWMLBPvr7g9+Q26O9tw8kmDb1Z0u4ZmHAnMYwoScwmuw8FqUkk50tKNa5R9ByE8qo81zWFPmQvVkPe7qY/Q6nylHxfQbQpYiJ5cs7l9fYQJPZeZNY6lH5d0rftuaRnXVwXpUFxVUpm4uhuySkN6OUxoiHhHILyoeH79vSA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=GAibJo01; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="GAibJo01" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 003D6C4CEFD; Tue, 9 Sep 2025 23:16:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757459788; bh=Wo2E3+6r1nAKVC9DS3YgQKzibRVbwi36i7wXqg9oQbM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GAibJo01K5N7qAV7wLnPWYLzVTHDHe09AHiWUTOMsZK25vF4ZL3/qFgumBazcxyed ji03Vg4Ba9iEkUNDPucu/ZesLovQ5G0qEs2HWCW48MNWlUpfdV8QfeRUM/d6+KAPeG tx3P6fLMLUsLUcq8InH8yiH+FPjPoNSMMEzE1exDJ3Esu4xd8xPEpQMRKmkzEFioZx 0r4F6GpaiewdtNfQ7CzuWx0af4DFTnOZeRsvDCrS/jxr0VNWIiJAafOKKPyaSr+csz 35c6SM4sAxRp9t0mvfO9Ps3yclq5QmRNN0KVXleeLvBaFmSCgXgEZTdOpn8Y32ceXV k9r26FiJoKU5Q== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , x86@kernel.org, Sami Tolvanen , Mikulas Patocka , linux-arm-kernel@lists.infradead.org, Eric Biggers Subject: [PATCH 5/6] fsverity: Remove inode parameter from fsverity_hash_block() Date: Tue, 9 Sep 2025 16:13:25 -0700 Message-ID: <20250909231326.1074672-6-ebiggers@kernel.org> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250909231326.1074672-1-ebiggers@kernel.org> References: <20250909231326.1074672-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Due to the conversion from crypto_shash to the library API, fsverity_hash_block() can no longer fail. Therefore, the inode parameter, which was used only to print an error message in the case of a failure, is no longer necessary. Remove it. Signed-off-by: Eric Biggers --- fs/verity/enable.c | 12 +++++------- fs/verity/fsverity_private.h | 2 +- fs/verity/hash_algs.c | 3 +-- fs/verity/verify.c | 4 ++-- 4 files changed, 9 insertions(+), 12 deletions(-) diff --git a/fs/verity/enable.c b/fs/verity/enable.c index 503268cf42962..f2f5b0471b6b2 100644 --- a/fs/verity/enable.c +++ b/fs/verity/enable.c @@ -17,12 +17,11 @@ struct block_buffer { bool is_root_hash; u8 *data; }; =20 /* Hash a block, writing the result to the next level's pending block buff= er. */ -static int hash_one_block(struct inode *inode, - const struct merkle_tree_params *params, +static int hash_one_block(const struct merkle_tree_params *params, struct block_buffer *cur) { struct block_buffer *next =3D cur + 1; =20 /* @@ -34,12 +33,11 @@ static int hash_one_block(struct inode *inode, return -EINVAL; =20 /* Zero-pad the block if it's shorter than the block size. */ memset(&cur->data[cur->filled], 0, params->block_size - cur->filled); =20 - fsverity_hash_block(params, inode, cur->data, - &next->data[next->filled]); + fsverity_hash_block(params, cur->data, &next->data[next->filled]); next->filled +=3D params->digest_size; cur->filled =3D 0; return 0; } =20 @@ -121,22 +119,22 @@ static int build_merkle_tree(struct file *filp, if (bytes_read !=3D buffers[-1].filled) { err =3D -EINVAL; fsverity_err(inode, "Short read of file data"); goto out; } - err =3D hash_one_block(inode, params, &buffers[-1]); + err =3D hash_one_block(params, &buffers[-1]); if (err) goto out; for (level =3D 0; level < num_levels; level++) { if (buffers[level].filled + params->digest_size <=3D params->block_size) { /* Next block at @level isn't full yet */ break; } /* Next block at @level is full */ =20 - err =3D hash_one_block(inode, params, &buffers[level]); + err =3D hash_one_block(params, &buffers[level]); if (err) goto out; err =3D write_merkle_tree_block(inode, buffers[level].data, level_offset[level], @@ -152,11 +150,11 @@ static int build_merkle_tree(struct file *filp, cond_resched(); } /* Finish all nonempty pending tree blocks. */ for (level =3D 0; level < num_levels; level++) { if (buffers[level].filled !=3D 0) { - err =3D hash_one_block(inode, params, &buffers[level]); + err =3D hash_one_block(params, &buffers[level]); if (err) goto out; err =3D write_merkle_tree_block(inode, buffers[level].data, level_offset[level], diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h index 5fe854a5b9ad3..d0458877afea4 100644 --- a/fs/verity/fsverity_private.h +++ b/fs/verity/fsverity_private.h @@ -87,11 +87,11 @@ const struct fsverity_hash_alg *fsverity_get_hash_alg(c= onst struct inode *inode, unsigned int num); union fsverity_hash_ctx * fsverity_prepare_hash_state(const struct fsverity_hash_alg *alg, const u8 *salt, size_t salt_size); void fsverity_hash_block(const struct merkle_tree_params *params, - const struct inode *inode, const void *data, u8 *out); + const void *data, u8 *out); void fsverity_hash_buffer(const struct fsverity_hash_alg *alg, const void *data, size_t size, u8 *out); void __init fsverity_check_hash_algs(void); =20 /* init.c */ diff --git a/fs/verity/hash_algs.c b/fs/verity/hash_algs.c index 9bb3c6344907e..de53e14c8aa78 100644 --- a/fs/verity/hash_algs.c +++ b/fs/verity/hash_algs.c @@ -92,19 +92,18 @@ fsverity_prepare_hash_state(const struct fsverity_hash_= alg *alg, } =20 /** * fsverity_hash_block() - hash a single data or hash block * @params: the Merkle tree's parameters - * @inode: inode for which the hashing is being done * @data: virtual address of a buffer containing the block to hash * @out: output digest, size 'params->digest_size' bytes * * Hash a single data or hash block. The hash is salted if a salt is spec= ified * in the Merkle tree parameters. */ void fsverity_hash_block(const struct merkle_tree_params *params, - const struct inode *inode, const void *data, u8 *out) + const void *data, u8 *out) { union fsverity_hash_ctx ctx; =20 if (!params->hashstate) { fsverity_hash_buffer(params->hash_alg, data, params->block_size, diff --git a/fs/verity/verify.c b/fs/verity/verify.c index a1f00c3fd3b27..d7d5f65700b03 100644 --- a/fs/verity/verify.c +++ b/fs/verity/verify.c @@ -200,11 +200,11 @@ verify_data_block(struct inode *inode, struct fsverit= y_info *vi, struct page *hpage =3D hblocks[level - 1].page; const void *haddr =3D hblocks[level - 1].addr; unsigned long hblock_idx =3D hblocks[level - 1].index; unsigned int hoffset =3D hblocks[level - 1].hoffset; =20 - fsverity_hash_block(params, inode, haddr, real_hash); + fsverity_hash_block(params, haddr, real_hash); if (memcmp(want_hash, real_hash, hsize) !=3D 0) goto corrupted; /* * Mark the hash block as verified. This must be atomic and * idempotent, as the same hash block might be verified by @@ -219,11 +219,11 @@ verify_data_block(struct inode *inode, struct fsverit= y_info *vi, kunmap_local(haddr); put_page(hpage); } =20 /* Finally, verify the data block. */ - fsverity_hash_block(params, inode, data, real_hash); + fsverity_hash_block(params, data, real_hash); if (memcmp(want_hash, real_hash, hsize) !=3D 0) goto corrupted; return true; =20 corrupted: --=20 2.50.1 From nobody Thu Oct 2 22:39:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D95A828B501; Tue, 9 Sep 2025 23:16:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459789; cv=none; b=kmXdpdclavVD4DgRLAyUUqTle3mw0Q1r8p28YhCU6KncdoIZ9YHMTBo+iffAo0tSsWtNlpF0gmo+PLKZ3LxIK+pv5xhkzy3Jihed3HjeB2XAYGVKf6quNfPPj9ZKGhiKme5psYW8xU9y3qHI81lue1GZAGx2vG1Mvl/VJA3ZmW0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757459789; c=relaxed/simple; bh=yOte++q8p6jqC8tT4uRqUh3/T6hbvaXSszTlzpMDbXE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=omJMfYC9t4cUvZPoVYzJ1rwvvugfo4YbQU2t4qg6hHSDKrMLHKeMUqOF+viYc35LEkmh5SE5byhFnCTLzh+kFS704w2Ya+qzWm9TSBiAvUB6lLnbsCBlQxIYUnqddNplYYlKjsw4iu2asfOGEzZxlaUOxCiSHzkxDnoGynhShWU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qbnG4BZm; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qbnG4BZm" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6FD76C4CEF8; Tue, 9 Sep 2025 23:16:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757459788; bh=yOte++q8p6jqC8tT4uRqUh3/T6hbvaXSszTlzpMDbXE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=qbnG4BZmy4Gd9CqiZ0/4goByTkMc81NLzxpH5RO5d77rJlwvJtBIuqLpfMArscktQ GH/Z8hkZPB+AeDzS2LcPlG1BDczphjAcf2zw9x/iIPAburg54ImTeN+qekb9WAVGol 1iLndTVEDJ/QQwYmEw0VENiaMg4cHIwzGhT78vQ40LdFLw+VrwPPVCkeEoRFMdjHkM 3ymXEPLartqtb1FUiSGe8eW8xiPAxdPQA7pyYKRgKYDDx1PSKce2nYjtkrn3zqWNB/ bEBVCHg3Px6XdhTeQT1SIvBTZ69EXi8soBRYz0mHrETQphHrtU7mFB7z0Jcxwj1YDp MzS4tfChwsKdw== From: Eric Biggers To: linux-crypto@vger.kernel.org, fsverity@lists.linux.dev Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , "Jason A . Donenfeld" , x86@kernel.org, Sami Tolvanen , Mikulas Patocka , linux-arm-kernel@lists.infradead.org, Eric Biggers Subject: [PATCH 6/6] fsverity: Use 2-way interleaved SHA-256 hashing when supported Date: Tue, 9 Sep 2025 16:13:26 -0700 Message-ID: <20250909231326.1074672-7-ebiggers@kernel.org> X-Mailer: git-send-email 2.50.1 In-Reply-To: <20250909231326.1074672-1-ebiggers@kernel.org> References: <20250909231326.1074672-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When the crypto library provides an optimized implementation of sha256_finup_2x(), use it to interleave the hashing of pairs of data blocks. On some CPUs this nearly doubles hashing performance. The increase in overall throughput of cold-cache fsverity reads that I'm seeing on arm64 and x86_64 is roughly 35% (though this metric is hard to measure as it jumps around a lot). For now this is only done on the verification path, and only for data blocks, not Merkle tree blocks. We could use sha256_finup_2x() on Merkle tree blocks too, but that is less important as there aren't as many Merkle tree blocks as data blocks, and that would require some additional code restructuring. We could also use sha256_finup_2x() to accelerate building the Merkle tree, but verification performance is more important. Signed-off-by: Eric Biggers --- fs/verity/verify.c | 173 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 139 insertions(+), 34 deletions(-) diff --git a/fs/verity/verify.c b/fs/verity/verify.c index d7d5f65700b03..6dd03468e2a26 100644 --- a/fs/verity/verify.c +++ b/fs/verity/verify.c @@ -8,10 +8,35 @@ #include "fsverity_private.h" =20 #include #include =20 +#define FS_VERITY_MAX_PENDING_BLOCKS 2 + +struct fsverity_pending_block { + const void *data; + u64 pos; + u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE]; +}; + +struct fsverity_verification_context { + struct inode *inode; + struct fsverity_info *vi; + unsigned long max_ra_pages; + + /* + * This is the queue of data blocks that are pending verification. When + * the crypto layer supports interleaved hashing, we allow multiple + * blocks to be queued up in order to utilize it. This can improve + * performance significantly vs. sequential hashing of each block. + */ + int num_pending; + int max_pending; + struct fsverity_pending_block + pending_blocks[FS_VERITY_MAX_PENDING_BLOCKS]; +}; + static struct workqueue_struct *fsverity_read_workqueue; =20 /* * Returns true if the hash block with index @hblock_idx in the tree, loca= ted in * @hpage, has already been verified. @@ -77,23 +102,24 @@ static bool is_hash_block_verified(struct fsverity_inf= o *vi, struct page *hpage, SetPageChecked(hpage); return false; } =20 /* - * Verify a single data block against the file's Merkle tree. + * Verify the hash of a single data block against the file's Merkle tree. * * In principle, we need to verify the entire path to the root node. Howe= ver, * for efficiency the filesystem may cache the hash blocks. Therefore we = need * only ascend the tree until an already-verified hash block is seen, and = then * verify the path to that block. * * Return: %true if the data block is valid, else %false. */ -static bool -verify_data_block(struct inode *inode, struct fsverity_info *vi, - const void *data, u64 data_pos, unsigned long max_ra_pages) +static bool verify_data_block(struct inode *inode, struct fsverity_info *v= i, + const struct fsverity_pending_block *dblock, + unsigned long max_ra_pages) { + const u64 data_pos =3D dblock->pos; const struct merkle_tree_params *params =3D &vi->tree_params; const unsigned int hsize =3D params->digest_size; int level; u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE]; const u8 *want_hash; @@ -113,23 +139,27 @@ verify_data_block(struct inode *inode, struct fsverit= y_info *vi, * The index of the previous level's block within that level; also the * index of that block's hash within the current level. */ u64 hidx =3D data_pos >> params->log_blocksize; =20 - /* Up to 1 + FS_VERITY_MAX_LEVELS pages may be mapped at once */ - BUILD_BUG_ON(1 + FS_VERITY_MAX_LEVELS > KM_MAX_IDX); + /* + * Up to FS_VERITY_MAX_PENDING_BLOCKS + FS_VERITY_MAX_LEVELS pages may + * be mapped at once. + */ + static_assert(FS_VERITY_MAX_PENDING_BLOCKS + FS_VERITY_MAX_LEVELS <=3D + KM_MAX_IDX); =20 if (unlikely(data_pos >=3D inode->i_size)) { /* * This can happen in the data page spanning EOF when the Merkle * tree block size is less than the page size. The Merkle tree * doesn't cover data blocks fully past EOF. But the entire * page spanning EOF can be visible to userspace via a mmap, and * any part past EOF should be all zeroes. Therefore, we need * to verify that any data blocks fully past EOF are all zeroes. */ - if (memchr_inv(data, 0, params->block_size)) { + if (memchr_inv(dblock->data, 0, params->block_size)) { fsverity_err(inode, "FILE CORRUPTED! Data past EOF is not zeroed"); return false; } return true; @@ -218,53 +248,110 @@ verify_data_block(struct inode *inode, struct fsveri= ty_info *vi, want_hash =3D _want_hash; kunmap_local(haddr); put_page(hpage); } =20 - /* Finally, verify the data block. */ - fsverity_hash_block(params, data, real_hash); - if (memcmp(want_hash, real_hash, hsize) !=3D 0) + /* Finally, verify the hash of the data block. */ + if (memcmp(want_hash, dblock->real_hash, hsize) !=3D 0) goto corrupted; return true; =20 corrupted: - fsverity_err(inode, - "FILE CORRUPTED! pos=3D%llu, level=3D%d, want_hash=3D%s:%*phN, real= _hash=3D%s:%*phN", - data_pos, level - 1, - params->hash_alg->name, hsize, want_hash, - params->hash_alg->name, hsize, real_hash); + fsverity_err( + inode, + "FILE CORRUPTED! pos=3D%llu, level=3D%d, want_hash=3D%s:%*phN, real_hash= =3D%s:%*phN", + data_pos, level - 1, params->hash_alg->name, hsize, want_hash, + params->hash_alg->name, hsize, + level =3D=3D 0 ? dblock->real_hash : real_hash); error: for (; level > 0; level--) { kunmap_local(hblocks[level - 1].addr); put_page(hblocks[level - 1].page); } return false; } =20 -static bool -verify_data_blocks(struct folio *data_folio, size_t len, size_t offset, - unsigned long max_ra_pages) +static void +fsverity_init_verification_context(struct fsverity_verification_context *c= tx, + struct inode *inode, + unsigned long max_ra_pages) { - struct inode *inode =3D data_folio->mapping->host; struct fsverity_info *vi =3D inode->i_verity_info; - const unsigned int block_size =3D vi->tree_params.block_size; + + ctx->inode =3D inode; + ctx->vi =3D vi; + ctx->max_ra_pages =3D max_ra_pages; + ctx->num_pending =3D 0; + if (vi->tree_params.hash_alg->algo_id =3D=3D HASH_ALGO_SHA256 && + sha256_finup_2x_is_optimized()) + ctx->max_pending =3D 2; + else + ctx->max_pending =3D 1; +} + +static void +fsverity_clear_pending_blocks(struct fsverity_verification_context *ctx) +{ + int i; + + for (i =3D ctx->num_pending - 1; i >=3D 0; i--) { + kunmap_local(ctx->pending_blocks[i].data); + ctx->pending_blocks[i].data =3D NULL; + } + ctx->num_pending =3D 0; +} + +static bool +fsverity_verify_pending_blocks(struct fsverity_verification_context *ctx) +{ + struct fsverity_info *vi =3D ctx->vi; + const struct merkle_tree_params *params =3D &vi->tree_params; + int i; + + if (ctx->num_pending =3D=3D 2) { + /* Implies algo_id =3D=3D HASH_ALGO_SHA256 */ + sha256_finup_2x(params->hashstate ? ¶ms->hashstate->sha256 : + NULL, + ctx->pending_blocks[0].data, + ctx->pending_blocks[1].data, params->block_size, + ctx->pending_blocks[0].real_hash, + ctx->pending_blocks[1].real_hash); + } else { + for (i =3D 0; i < ctx->num_pending; i++) + fsverity_hash_block(params, ctx->pending_blocks[i].data, + ctx->pending_blocks[i].real_hash); + } + + for (i =3D 0; i < ctx->num_pending; i++) { + if (!verify_data_block(ctx->inode, vi, &ctx->pending_blocks[i], + ctx->max_ra_pages)) + return false; + } + fsverity_clear_pending_blocks(ctx); + return true; +} + +static bool fsverity_add_data_blocks(struct fsverity_verification_context = *ctx, + struct folio *data_folio, size_t len, + size_t offset) +{ + struct fsverity_info *vi =3D ctx->vi; + const struct merkle_tree_params *params =3D &vi->tree_params; + const unsigned int block_size =3D params->block_size; u64 pos =3D (u64)data_folio->index << PAGE_SHIFT; =20 if (WARN_ON_ONCE(len <=3D 0 || !IS_ALIGNED(len | offset, block_size))) return false; if (WARN_ON_ONCE(!folio_test_locked(data_folio) || folio_test_uptodate(data_folio))) return false; do { - void *data; - bool valid; - - data =3D kmap_local_folio(data_folio, offset); - valid =3D verify_data_block(inode, vi, data, pos + offset, - max_ra_pages); - kunmap_local(data); - if (!valid) + ctx->pending_blocks[ctx->num_pending].data =3D + kmap_local_folio(data_folio, offset); + ctx->pending_blocks[ctx->num_pending].pos =3D pos + offset; + if (++ctx->num_pending =3D=3D ctx->max_pending && + !fsverity_verify_pending_blocks(ctx)) return false; offset +=3D block_size; len -=3D block_size; } while (len); return true; @@ -282,11 +369,19 @@ verify_data_blocks(struct folio *data_folio, size_t l= en, size_t offset, * * Return: %true if the data is valid, else %false. */ bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset) { - return verify_data_blocks(folio, len, offset, 0); + struct fsverity_verification_context ctx; + + fsverity_init_verification_context(&ctx, folio->mapping->host, 0); + + if (fsverity_add_data_blocks(&ctx, folio, len, offset) && + fsverity_verify_pending_blocks(&ctx)) + return true; + fsverity_clear_pending_blocks(&ctx); + return false; } EXPORT_SYMBOL_GPL(fsverity_verify_blocks); =20 #ifdef CONFIG_BLOCK /** @@ -303,10 +398,12 @@ EXPORT_SYMBOL_GPL(fsverity_verify_blocks); * filesystems) must instead call fsverity_verify_page() directly on each = page. * All filesystems must also call fsverity_verify_page() on holes. */ void fsverity_verify_bio(struct bio *bio) { + struct inode *inode =3D bio_first_folio_all(bio)->mapping->host; + struct fsverity_verification_context ctx; struct folio_iter fi; unsigned long max_ra_pages =3D 0; =20 if (bio->bi_opf & REQ_RAHEAD) { /* @@ -319,17 +416,25 @@ void fsverity_verify_bio(struct bio *bio) * reduces the number of I/O requests made to the Merkle tree. */ max_ra_pages =3D bio->bi_iter.bi_size >> (PAGE_SHIFT + 2); } =20 + fsverity_init_verification_context(&ctx, inode, max_ra_pages); + bio_for_each_folio_all(fi, bio) { - if (!verify_data_blocks(fi.folio, fi.length, fi.offset, - max_ra_pages)) { - bio->bi_status =3D BLK_STS_IOERR; - break; - } + if (!fsverity_add_data_blocks(&ctx, fi.folio, fi.length, + fi.offset)) + goto ioerr; } + + if (!fsverity_verify_pending_blocks(&ctx)) + goto ioerr; + return; + +ioerr: + fsverity_clear_pending_blocks(&ctx); + bio->bi_status =3D BLK_STS_IOERR; } EXPORT_SYMBOL_GPL(fsverity_verify_bio); #endif /* CONFIG_BLOCK */ =20 /** --=20 2.50.1