Add support for verifying ML-DSA signatures.
ML-DSA (Module-Lattice-Based Digital Signature Algorithm) is specified
in FIPS 204 and is the standard version of Dilithium. Unlike RSA and
elliptic-curve cryptography, ML-DSA is believed to be secure even
against adversaries in possession of a large-scale quantum computer.
Compared to the earlier patch
(https://lore.kernel.org/r/20251117145606.2155773-3-dhowells@redhat.com/)
that was based on "leancrypto", this implementation:
- Is about 600 lines of source code instead of 4800.
- Generates about 4 KB of object code instead of 28 KB.
- Uses 9-13 KB of memory to verify a signature instead of 31-84 KB.
- Is 3-5% faster, depending on the ML-DSA parameter set.
The API just consists of a single function mldsa_verify(), supporting
the standard parameter sets (ML-DSA-44, ML-DSA-65, and ML-DSA-87) as
selected by an enum. That's all that's actually needed.
HashML-DSA, incremental message hashing, and nonempty contexts aren't
supported, as they aren't needed yet. Likewise, only verification
support is included, since it's all the kernel needs. It's much simpler
than full keygen+sign+verify support, and it means that constant-time
code isn't needed either. (I've still used constant-time patterns in
some places anyway, but technically it's not needed. And some steps in
ML-DSA verification are inherently variable-time anyway.)
Note that mldsa_verify() allocates memory, so it can sleep and can fail
with ENOMEM. Unfortunately we don't have much choice about that, since
ML-DSA needs a lot of memory. At least callers have to check for errors
anyway, since the signature could be invalid.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
---
include/crypto/mldsa.h | 51 ++++
lib/crypto/Kconfig | 7 +
lib/crypto/Makefile | 5 +
lib/crypto/mldsa.c | 566 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 629 insertions(+)
create mode 100644 include/crypto/mldsa.h
create mode 100644 lib/crypto/mldsa.c
diff --git a/include/crypto/mldsa.h b/include/crypto/mldsa.h
new file mode 100644
index 000000000000..f0c212e9e4f1
--- /dev/null
+++ b/include/crypto/mldsa.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Support for verifying ML-DSA signatures
+ *
+ * Copyright 2025 Google LLC
+ */
+#ifndef _CRYPTO_MLDSA_H
+#define _CRYPTO_MLDSA_H
+
+#include <linux/types.h>
+
+/* Identifier for an ML-DSA parameter set */
+enum mldsa_alg {
+ MLDSA44, /* ML-DSA-44 */
+ MLDSA65, /* ML-DSA-65 */
+ MLDSA87, /* ML-DSA-87 */
+};
+
+/* Lengths of ML-DSA public keys and signatures in bytes */
+#define MLDSA44_PUBLIC_KEY_SIZE 1312
+#define MLDSA65_PUBLIC_KEY_SIZE 1952
+#define MLDSA87_PUBLIC_KEY_SIZE 2592
+#define MLDSA44_SIGNATURE_SIZE 2420
+#define MLDSA65_SIGNATURE_SIZE 3309
+#define MLDSA87_SIGNATURE_SIZE 4627
+
+/**
+ * mldsa_verify() - Verify an ML-DSA signature
+ * @alg: The ML-DSA parameter set to use
+ * @sig: The signature
+ * @sig_len: Length of the signature in bytes. Should match the
+ * MLDSA*_SIGNATURE_SIZE constant associated with @alg,
+ * otherwise -EBADMSG will be returned right away.
+ * @msg: The message
+ * @msg_len: Length of the message in bytes
+ * @pk: The public key
+ * @pk_len: Length of the public key in bytes. Should match the
+ * MLDSA*_PUBLIC_KEY_SIZE constant associated with @alg,
+ * otherwise -EBADMSG will be returned right away.
+ *
+ * This verifies an ML-DSA signature using the specified ML-DSA parameter set.
+ * The context string is assumed to be empty.
+ *
+ * Context: Might sleep
+ * Return: 0 if the signature is valid, -EBADMSG if the signature is invalid, or
+ * -ENOMEM if out of memory so the validity of the signature is unknown
+ */
+int mldsa_verify(enum mldsa_alg alg, const u8 *sig, size_t sig_len,
+ const u8 *msg, size_t msg_len, const u8 *pk, size_t pk_len);
+
+#endif /* _CRYPTO_MLDSA_H */
diff --git a/lib/crypto/Kconfig b/lib/crypto/Kconfig
index 9d04b3771ce2..51ac3186ebc2 100644
--- a/lib/crypto/Kconfig
+++ b/lib/crypto/Kconfig
@@ -98,10 +98,17 @@ config CRYPTO_LIB_MD5_ARCH
depends on CRYPTO_LIB_MD5 && !UML
default y if MIPS && CPU_CAVIUM_OCTEON
default y if PPC
default y if SPARC64
+config CRYPTO_LIB_MLDSA
+ tristate
+ select CRYPTO_LIB_SHA3
+ help
+ The ML-DSA library functions. Select this if your module uses any of
+ the functions from <crypto/mldsa.h>.
+
config CRYPTO_LIB_POLY1305
tristate
help
The Poly1305 library functions. Select this if your module uses any
of the functions from <crypto/poly1305.h>.
diff --git a/lib/crypto/Makefile b/lib/crypto/Makefile
index 6580991f8e12..fb83ec480ec0 100644
--- a/lib/crypto/Makefile
+++ b/lib/crypto/Makefile
@@ -125,10 +125,15 @@ libmd5-$(CONFIG_PPC) += powerpc/md5-asm.o
libmd5-$(CONFIG_SPARC) += sparc/md5_asm.o
endif # CONFIG_CRYPTO_LIB_MD5_ARCH
################################################################################
+obj-$(CONFIG_CRYPTO_LIB_MLDSA) += libmldsa.o
+libmldsa-y := mldsa.o
+
+################################################################################
+
obj-$(CONFIG_CRYPTO_LIB_POLY1305) += libpoly1305.o
libpoly1305-y := poly1305.o
ifeq ($(CONFIG_ARCH_SUPPORTS_INT128),y)
libpoly1305-$(CONFIG_CRYPTO_LIB_POLY1305_GENERIC) += poly1305-donna64.o
else
diff --git a/lib/crypto/mldsa.c b/lib/crypto/mldsa.c
new file mode 100644
index 000000000000..94ac4df9a15f
--- /dev/null
+++ b/lib/crypto/mldsa.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Support for verifying ML-DSA signatures
+ *
+ * Copyright 2025 Google LLC
+ */
+
+#include <crypto/mldsa.h>
+#include <crypto/sha3.h>
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/unaligned.h>
+
+#define Q 8380417 /* The prime q = 2^23 - 2^13 + 1 */
+#define QINV_MOD_R 58728449 /* Multiplicative inverse of q mod 2^32 */
+#define R_MOD_Q 4193792 /* 2^32 mod q */
+#define N 256 /* Number of components per ring element */
+#define D 13 /* Number of dropped bits in t */
+#define RHO_LEN 32 /* Length of the public random seed in bytes */
+
+/*
+ * The zetas array in Montgomery form, i.e. with extra factor of 2^32.
+ * Reference: FIPS 204 Section 7.5 "NTT and NTT^-1"
+ * Generated by the following Python code:
+ * q=8380417; [a%q - q*(a%q > q//2) for a in [1753**(int(f'{i:08b}'[::-1], 2)) << 32 for i in range(256)]]
+ */
+static const s32 zetas_times_2_32[N] = {
+ -4186625, 25847, -2608894, -518909, 237124, -777960, -876248,
+ 466468, 1826347, 2353451, -359251, -2091905, 3119733, -2884855,
+ 3111497, 2680103, 2725464, 1024112, -1079900, 3585928, -549488,
+ -1119584, 2619752, -2108549, -2118186, -3859737, -1399561, -3277672,
+ 1757237, -19422, 4010497, 280005, 2706023, 95776, 3077325,
+ 3530437, -1661693, -3592148, -2537516, 3915439, -3861115, -3043716,
+ 3574422, -2867647, 3539968, -300467, 2348700, -539299, -1699267,
+ -1643818, 3505694, -3821735, 3507263, -2140649, -1600420, 3699596,
+ 811944, 531354, 954230, 3881043, 3900724, -2556880, 2071892,
+ -2797779, -3930395, -1528703, -3677745, -3041255, -1452451, 3475950,
+ 2176455, -1585221, -1257611, 1939314, -4083598, -1000202, -3190144,
+ -3157330, -3632928, 126922, 3412210, -983419, 2147896, 2715295,
+ -2967645, -3693493, -411027, -2477047, -671102, -1228525, -22981,
+ -1308169, -381987, 1349076, 1852771, -1430430, -3343383, 264944,
+ 508951, 3097992, 44288, -1100098, 904516, 3958618, -3724342,
+ -8578, 1653064, -3249728, 2389356, -210977, 759969, -1316856,
+ 189548, -3553272, 3159746, -1851402, -2409325, -177440, 1315589,
+ 1341330, 1285669, -1584928, -812732, -1439742, -3019102, -3881060,
+ -3628969, 3839961, 2091667, 3407706, 2316500, 3817976, -3342478,
+ 2244091, -2446433, -3562462, 266997, 2434439, -1235728, 3513181,
+ -3520352, -3759364, -1197226, -3193378, 900702, 1859098, 909542,
+ 819034, 495491, -1613174, -43260, -522500, -655327, -3122442,
+ 2031748, 3207046, -3556995, -525098, -768622, -3595838, 342297,
+ 286988, -2437823, 4108315, 3437287, -3342277, 1735879, 203044,
+ 2842341, 2691481, -2590150, 1265009, 4055324, 1247620, 2486353,
+ 1595974, -3767016, 1250494, 2635921, -3548272, -2994039, 1869119,
+ 1903435, -1050970, -1333058, 1237275, -3318210, -1430225, -451100,
+ 1312455, 3306115, -1962642, -1279661, 1917081, -2546312, -1374803,
+ 1500165, 777191, 2235880, 3406031, -542412, -2831860, -1671176,
+ -1846953, -2584293, -3724270, 594136, -3776993, -2013608, 2432395,
+ 2454455, -164721, 1957272, 3369112, 185531, -1207385, -3183426,
+ 162844, 1616392, 3014001, 810149, 1652634, -3694233, -1799107,
+ -3038916, 3523897, 3866901, 269760, 2213111, -975884, 1717735,
+ 472078, -426683, 1723600, -1803090, 1910376, -1667432, -1104333,
+ -260646, -3833893, -2939036, -2235985, -420899, -2286327, 183443,
+ -976891, 1612842, -3545687, -554416, 3919660, -48306, -1362209,
+ 3937738, 1400424, -846154, 1976782
+};
+
+/* Reference: FIPS 204 Section 4 "Parameter Sets" */
+static const struct mldsa_parameter_set {
+ u8 k; /* num rows in the matrix A */
+ u8 l; /* num columns in the matrix A */
+ u8 ctilde_len; /* length of commitment hash ctilde in bytes; lambda/4 */
+ u8 omega; /* max num of 1's in the hint vector h */
+ u8 tau; /* num of +-1's in challenge c */
+ u8 beta; /* tau times eta */
+ u16 pk_len; /* length of public keys in bytes */
+ u16 sig_len; /* length of signatures in bytes */
+ s32 gamma1; /* coefficient range of y */
+} mldsa_parameter_sets[] = {
+ [MLDSA44] = {
+ .k = 4,
+ .l = 4,
+ .ctilde_len = 32,
+ .omega = 80,
+ .tau = 39,
+ .beta = 78,
+ .pk_len = MLDSA44_PUBLIC_KEY_SIZE,
+ .sig_len = MLDSA44_SIGNATURE_SIZE,
+ .gamma1 = 1 << 17,
+ },
+ [MLDSA65] = {
+ .k = 6,
+ .l = 5,
+ .ctilde_len = 48,
+ .omega = 55,
+ .tau = 49,
+ .beta = 196,
+ .pk_len = MLDSA65_PUBLIC_KEY_SIZE,
+ .sig_len = MLDSA65_SIGNATURE_SIZE,
+ .gamma1 = 1 << 19,
+ },
+ [MLDSA87] = {
+ .k = 8,
+ .l = 7,
+ .ctilde_len = 64,
+ .omega = 75,
+ .tau = 60,
+ .beta = 120,
+ .pk_len = MLDSA87_PUBLIC_KEY_SIZE,
+ .sig_len = MLDSA87_SIGNATURE_SIZE,
+ .gamma1 = 1 << 19,
+ },
+};
+
+/* An element of the ring R_q (normal form) or the ring T_q (NTT form) */
+struct mldsa_ring_elem {
+ s32 x[N];
+};
+
+struct mldsa_verification_workspace {
+ /* SHAKE context for computing c, mu, and ctildeprime */
+ struct shake_ctx shake;
+ /* The fields in this union are used in their order of declaration. */
+ union {
+ /* The hash of the public key */
+ u8 tr[64];
+ /* The message representative mu */
+ u8 mu[64];
+ /* Encoded element of w'_1. Real length is either 128 or 192 */
+ u8 w1_encoded[192];
+ /* The commitment hash. Real length is params->ctilde_len */
+ u8 ctildeprime[64];
+ };
+ /* SHAKE context for generating the matrix A */
+ struct shake_ctx a_shake;
+ /*
+ * An element of the matrix A generated from the public seed, or an
+ * element of the vector t_1 decoded from the public key and pre-scaled
+ * by 2^d. Both are in NTT form. To reduce memory usage, we generate
+ * or decode these elements only as needed.
+ */
+ union {
+ struct mldsa_ring_elem a;
+ struct mldsa_ring_elem t1_scaled;
+ };
+ /* The challenge c, generated from ctilde */
+ struct mldsa_ring_elem c;
+ /* A temporary element used during calculations */
+ struct mldsa_ring_elem tmp;
+
+ /* The following fields are variable-length. */
+
+ /* The signer's response vector */
+ struct mldsa_ring_elem z[/* l */];
+
+ /* The signer's hint vector */
+ /* u8 h[k][N]; */
+};
+
+/*
+ * Compute a * b * 2^-32 mod q. a * b must be in the range [-2^31 * q, 2^31 *
+ * q) before reduction. This uses Montgomery reduction with R=2^32 and produces
+ * a product in the range (-q, q), i.e. almost fully reduced but not quite.
+ */
+static inline s32 Zq_mult(s32 a, s32 b)
+{
+ /* Compute the unreduced product c. */
+ s64 c = (s64)a * b;
+
+ /* Compute d = (c mod 2^32) * (q^-1 mod 2^32). */
+ s32 d = (s32)c * QINV_MOD_R;
+
+ /*
+ * Compute e = c - d * q. This makes the low 32 bits zero, since
+ * c - (c * q^-1) * q mod 2^32
+ * = c - c * (q^-1 * q) mod 2^32
+ * = c - c * 1 mod 2^32
+ * = c - c mod 2^32
+ * = 0 mod 2^32
+ */
+ s64 e = c - (s64)d * Q;
+
+ /* Finally, return e * 2^-32. */
+ return e >> 32;
+}
+
+/*
+ * Convert @w to its number-theoretically-transformed representation in-place.
+ * Reference: FIPS 204 Algorithm 41, NTT
+ *
+ * To prevent overflows of intermediate values, all input values should be in
+ * the range (-q, q). The output values are in the range [-9*(q-1), 9*(q-1)].
+ */
+static void ntt(struct mldsa_ring_elem *w)
+{
+ for (int m = 0, len = 128; len >= 1; len /= 2) {
+ for (int start = 0; start < 256; start += 2 * len) {
+ const s32 z = zetas_times_2_32[++m];
+
+ for (int j = start; j < start + len; j++) {
+ s32 t = Zq_mult(z, w->x[j + len]);
+
+ w->x[j + len] = w->x[j] - t;
+ w->x[j] += t;
+ }
+ }
+ }
+}
+
+/*
+ * Convert @w from its number-theoretically-transformed representation in-place.
+ * Reference: FIPS 204 Algorithm 42, NTT^-1
+ *
+ * This also multiplies the values by 2^32, undoing an extra factor of 2^-32
+ * introduced earlier.
+ *
+ * To prevent overflows of intermediate values, all input values should be in
+ * the range (-q, q). The output values are in the range (-q, q) as well.
+ */
+static void invntt_and_mul_2_32(struct mldsa_ring_elem *w)
+{
+ for (int m = 256, len = 1; len < 256; len *= 2) {
+ for (int start = 0; start < 256; start += 2 * len) {
+ const s32 z = -zetas_times_2_32[--m];
+
+ for (int j = start; j < start + len; j++) {
+ s32 t = w->x[j];
+
+ w->x[j] = t + w->x[j + len];
+ w->x[j + len] = Zq_mult(z, t - w->x[j + len]);
+ }
+ }
+ }
+ /*
+ * Multiply by 2^32 * 256^-1. 2^32 cancels the factor of 2^-32 from
+ * earlier Montgomery multiplications. 256^-1 is for NTT^-1. This
+ * itself uses Montgomery multiplication, so *another* 2^32 is needed.
+ * Thus the actual multiplicand is 2^32 * 2^32 * 256^-1 mod q = 41978.
+ */
+ for (int j = 0; j < 256; j++)
+ w->x[j] = Zq_mult(w->x[j], 41978);
+}
+
+/*
+ * Decode an element of t_1, the high d bits of t = As_1 + s_2.
+ * Multiply it by 2^d and convert it to NTT form.
+ */
+static const u8 *decode_t1_elem(struct mldsa_ring_elem *out,
+ const u8 *t1_encoded)
+{
+ for (int j = 0; j < N; j += 4, t1_encoded += 5) {
+ u32 v = get_unaligned_le32(t1_encoded);
+
+ out->x[j + 0] = ((v >> 0) & 0x3ff) << D;
+ out->x[j + 1] = ((v >> 10) & 0x3ff) << D;
+ out->x[j + 2] = ((v >> 20) & 0x3ff) << D;
+ out->x[j + 3] = ((v >> 30) | (t1_encoded[4] << 2)) << D;
+ static_assert(0x3ff << D < Q); /* All values < q. */
+ }
+ ntt(out);
+ return t1_encoded; /* Return updated pointer. */
+}
+
+/*
+ * Decode the signer's response vector 'z' from the signature.
+ * Reference: FIPS 204 Algorithm 27, sigDecode.
+ *
+ * This also validates that the coefficients of z are in range, corresponding
+ * the infinity norm check at the end of Algorithm 8, ML-DSA.Verify_internal.
+ *
+ * Finally, this also converts z to NTT form.
+ */
+static bool decode_z(struct mldsa_ring_elem z[/* l */], int l, s32 gamma1,
+ int beta, const u8 **sig_ptr)
+{
+ const u8 *sig = *sig_ptr;
+
+ for (int i = 0; i < l; i++) {
+ if (l == 4) { /* ML-DSA-44? */
+ /* 18-bit coefficients: decode 4 from 9 bytes. */
+ for (int j = 0; j < N; j += 4, sig += 9) {
+ u64 v = get_unaligned_le64(sig);
+
+ z[i].x[j + 0] = (v >> 0) & 0x3ffff;
+ z[i].x[j + 1] = (v >> 18) & 0x3ffff;
+ z[i].x[j + 2] = (v >> 36) & 0x3ffff;
+ z[i].x[j + 3] = (v >> 54) | (sig[8] << 10);
+ }
+ } else {
+ /* 20-bit coefficients: decode 4 from 10 bytes. */
+ for (int j = 0; j < N; j += 4, sig += 10) {
+ u64 v = get_unaligned_le64(sig);
+
+ z[i].x[j + 0] = (v >> 0) & 0xfffff;
+ z[i].x[j + 1] = (v >> 20) & 0xfffff;
+ z[i].x[j + 2] = (v >> 40) & 0xfffff;
+ z[i].x[j + 3] =
+ (v >> 60) |
+ (get_unaligned_le16(&sig[8]) << 4);
+ }
+ }
+ for (int j = 0; j < N; j++) {
+ z[i].x[j] = gamma1 - z[i].x[j];
+ if (z[i].x[j] <= -(gamma1 - beta) ||
+ z[i].x[j] >= gamma1 - beta)
+ return false;
+ }
+ ntt(&z[i]);
+ }
+ *sig_ptr = sig; /* Return updated pointer. */
+ return true;
+}
+
+/*
+ * Decode the hint vector 'h' from the signature. It should have at most omega
+ * 1's. Reference: FIPS 204 Algorithm 21, HintBitUnpack
+ */
+static bool decode_hint_vector(u8 h[/* k */][N], int k, int omega, const u8 *y)
+{
+ int index = 0;
+
+ memset(h, 0, k * N);
+ for (int i = 0; i < k; i++) {
+ int count = y[omega + i]; /* num 1's in elems 0 through i */
+ int prev = -1;
+
+ /* Cumulative count mustn't decrease or exceed omega. */
+ if (count < index || count > omega)
+ return false;
+ for (; index < count; index++) {
+ if (y[index] <= prev) /* Coefficients out of order? */
+ return false;
+ prev = y[index];
+ h[i][y[index]] = 1;
+ }
+ }
+ return mem_is_zero(&y[index], omega - index);
+}
+
+/*
+ * Use @seed to generate a ring element @c with coefficients in {-1, 0, 1},
+ * exactly @tau of them nonzero. Reference: FIPS 204 Algorithm 29, SampleInBall
+ */
+static void sample_in_ball(struct mldsa_ring_elem *c, const u8 *seed,
+ size_t seed_len, int tau, struct shake_ctx *shake)
+{
+ u64 signs;
+ u8 j;
+
+ shake256_init(shake);
+ shake_update(shake, seed, seed_len);
+ shake_squeeze(shake, (u8 *)&signs, sizeof(signs));
+ le64_to_cpus(&signs);
+ *c = (struct mldsa_ring_elem){};
+ for (int i = N - tau; i < N; i++, signs >>= 1) {
+ do {
+ shake_squeeze(shake, &j, 1);
+ } while (j > i);
+ c->x[i] = c->x[j];
+ c->x[j] = 1 - 2 * (s32)(signs & 1);
+ }
+}
+
+/*
+ * Expand the public seed @rho and @row_and_column into an element of T_q @out.
+ * Reference: FIPS 204 Algorithm 30, RejNTTPoly
+ */
+static void rej_ntt_poly(struct mldsa_ring_elem *out, const u8 rho[RHO_LEN],
+ __le16 row_and_column, struct shake_ctx *shake)
+{
+ u8 block[SHAKE128_BLOCK_SIZE + 1]; /* 1 extra to allow 4-byte loads */
+
+ shake128_init(shake);
+ shake_update(shake, rho, RHO_LEN);
+ shake_update(shake, (u8 *)&row_and_column, sizeof(row_and_column));
+ for (int i = 0; i < N;) {
+ shake_squeeze(shake, block, SHAKE128_BLOCK_SIZE);
+ static_assert(SHAKE128_BLOCK_SIZE % 3 == 0);
+ for (int j = 0; j < SHAKE128_BLOCK_SIZE && i < N; j += 3) {
+ u32 x = get_unaligned_le32(&block[j]) & 0x7fffff;
+
+ if (x < Q) /* Ignore values >= q. */
+ out->x[i++] = x;
+ }
+ }
+}
+
+/*
+ * Return the high bits of r adjusted according to hint h.
+ * Reference: FIPS 204 Algorithm 40, UseHint
+ */
+static s32 use_hint(u8 h, s32 r, int k)
+{
+ s32 r0, r1;
+
+ r1 = (r + 127) >> 7;
+ if (k == 4) { /* ML-DSA-44? */
+ const s32 gamma2 = (Q - 1) / 88;
+ const s32 m = 44; /* (q - 1) / (2 * gamma2) */
+
+ /* Algorithm 36, Decompose; specialized for gamma2 = (q-1)/88 */
+ /* Formula borrowed from the reference implementation */
+ r1 = (r1 * 11275 + (1 << 23)) >> 24;
+ r1 ^= ((m - 1 - r1) >> 31) & r1;
+ r0 = r - r1 * 2 * gamma2;
+ r0 -= (((Q - 1) / 2 - r0) >> 31) & Q;
+
+ if (h == 0)
+ return r1;
+ if (r0 > 0)
+ return (r1 == m - 1) ? 0 : r1 + 1;
+ else
+ return (r1 == 0) ? m - 1 : r1 - 1;
+ } else {
+ const s32 gamma2 = (Q - 1) / 32;
+ const s32 m = 16; /* (q - 1) / (2 * gamma2) */
+
+ /* Algorithm 36, Decompose; specialized for gamma2 = (q-1)/32 */
+ /* Formula borrowed from the reference implementation */
+ r1 = ((r1 * 1025 + (1 << 21)) >> 22) & (m - 1);
+ r0 = r - r1 * 2 * gamma2;
+ r0 -= (((Q - 1) / 2 - r0) >> 31) & Q;
+
+ if (h == 0)
+ return r1;
+ if (r0 > 0)
+ return (r1 + 1) & (m - 1);
+ else
+ return (r1 - 1) & (m - 1);
+ }
+}
+
+/* Reference: FIPS 204 Section 6.3, "ML-DSA Verifying (Internal)" */
+int mldsa_verify(enum mldsa_alg alg, const u8 *sig, size_t sig_len,
+ const u8 *msg, size_t msg_len, const u8 *pk, size_t pk_len)
+{
+ const struct mldsa_parameter_set *params = &mldsa_parameter_sets[alg];
+ const int k = params->k, l = params->l;
+ /* For now, this implementation doesn't support nonempty contexts. */
+ static const u8 msg_prefix[2] = { /* dom_sep= */ 0, /* ctx_len= */ 0 };
+ const u8 *ctilde; /* The signer's commitment hash */
+ const u8 *t1_encoded = &pk[RHO_LEN]; /* Next encoded element of t_1 */
+ u8 (*h)[N]; /* The signer's hint vector, length k */
+ int w1_pos;
+
+ /* Validate the public key and signature sizes. */
+ if (pk_len != params->pk_len || sig_len != params->sig_len)
+ return -EBADMSG;
+
+ /* Allocate the workspace, including variable-length fields. */
+ /* Size depends only on the ML-DSA parameter set, not the inputs. */
+ struct mldsa_verification_workspace *ws __free(kfree) = kmalloc(
+ sizeof(*ws) + (l * sizeof(ws->z[0])) + (k * N), GFP_KERNEL);
+ if (!ws)
+ return -ENOMEM;
+ h = (u8 (*)[N])&ws->z[l];
+
+ /* Decode the signature. Reference: FIPS 204 Algorithm 27, sigDecode */
+ ctilde = sig;
+ sig += params->ctilde_len;
+ if (!decode_z(ws->z, l, params->gamma1, params->beta, &sig))
+ return -EBADMSG;
+ if (!decode_hint_vector(h, k, params->omega, sig))
+ return -EBADMSG;
+
+ /* Recreate the challenge c from the signer's commitment hash. */
+ sample_in_ball(&ws->c, ctilde, params->ctilde_len, params->tau,
+ &ws->shake);
+ ntt(&ws->c);
+
+ /* Compute the message representative mu. */
+ shake256(pk, pk_len, ws->tr, sizeof(ws->tr));
+ shake256_init(&ws->shake);
+ shake_update(&ws->shake, ws->tr, sizeof(ws->tr));
+ shake_update(&ws->shake, msg_prefix, sizeof(msg_prefix));
+ shake_update(&ws->shake, msg, msg_len);
+ shake_squeeze(&ws->shake, ws->mu, sizeof(ws->mu));
+
+ /* Start computing ctildeprime = H(mu || w1Encode(w'_1)). */
+ shake256_init(&ws->shake);
+ shake_update(&ws->shake, ws->mu, sizeof(ws->mu));
+
+ /*
+ * Compute the commitment w'_1 from A, z, c, t_1, and h.
+ *
+ * The computation is the same for each of the k rows. Just do each row
+ * before moving on to the next, resulting in only one loop over k.
+ */
+ for (int i = 0; i < k; i++) {
+ /*
+ * tmp = NTT(A) * NTT(z) * 2^-32
+ * To reduce memory use, generate each element of A on-demand.
+ * Note that each element of A is used only once.
+ */
+ ws->tmp = (struct mldsa_ring_elem){};
+ for (int j = 0; j < l; j++) {
+ rej_ntt_poly(&ws->a, /* rho is first field of pk */ pk,
+ cpu_to_le16((i << 8) | j), &ws->a_shake);
+ for (int n = 0; n < N; n++)
+ ws->tmp.x[n] +=
+ Zq_mult(ws->a.x[n], ws->z[j].x[n]);
+ }
+ /* Coefficients of tmp now have abs value <= l*(q-1). */
+
+ /* Decode the next element of t_1. */
+ t1_encoded = decode_t1_elem(&ws->t1_scaled, t1_encoded);
+
+ /*
+ * tmp -= NTT(c) * NTT(t_1 * 2^d) * 2^-32
+ *
+ * Taking a conservative bound for the output of ntt(), the
+ * multiplicands can have coefficients with absolute value up to
+ * 9*q. That corresponds to a product with absolute value 81*q.
+ * That is within the limits of Zq_mult() which needs < ~256*q.
+ */
+ for (int j = 0; j < N; j++)
+ ws->tmp.x[j] -= Zq_mult(ws->c.x[j], ws->t1_scaled.x[j]);
+
+ /*
+ * Coefficients of tmp now have abs value <= (l+1)*(q-1).
+ * To safely do the inverse NTT, reduce them to abs value < q.
+ */
+ for (int j = 0; j < N; j++)
+ ws->tmp.x[j] = Zq_mult(ws->tmp.x[j], R_MOD_Q);
+
+ /* tmp = w'_Approx = NTT^-1(tmp) * 2^32 */
+ invntt_and_mul_2_32(&ws->tmp);
+
+ /* Reduce to [0, q), then tmp = w'_1 = UseHint(h, w'_Approx) */
+ for (int j = 0; j < N; j++) {
+ ws->tmp.x[j] += (ws->tmp.x[j] >> 31) & Q;
+ ws->tmp.x[j] = use_hint(h[i][j], ws->tmp.x[j], k);
+ }
+
+ /* w1Encode(w'_1) */
+ w1_pos = 0;
+ if (k == 4) { /* ML-DSA-44? */
+ /* 6 bits per value. Pack 4 at a time. */
+ for (int j = 0; j < N; j += 4) {
+ u32 v = (ws->tmp.x[j + 0] << 0) |
+ (ws->tmp.x[j + 1] << 6) |
+ (ws->tmp.x[j + 2] << 12) |
+ (ws->tmp.x[j + 3] << 18);
+ ws->w1_encoded[w1_pos++] = v >> 0;
+ ws->w1_encoded[w1_pos++] = v >> 8;
+ ws->w1_encoded[w1_pos++] = v >> 16;
+ }
+ } else {
+ /* 4 bits per value. Pack 2 at a time. */
+ for (int j = 0; j < N; j += 2)
+ ws->w1_encoded[w1_pos++] =
+ ws->tmp.x[j] | (ws->tmp.x[j + 1] << 4);
+ }
+ /* Update the hash with w1Encode(w'_1). */
+ shake_update(&ws->shake, ws->w1_encoded, w1_pos);
+ }
+
+ /* Finish computing ctildeprime. */
+ shake_squeeze(&ws->shake, ws->ctildeprime, params->ctilde_len);
+
+ /* Verify that ctilde == ctildeprime. */
+ if (memcmp(ws->ctildeprime, ctilde, params->ctilde_len) != 0)
+ return -EBADMSG;
+ /* ||z||_infinity < gamma1 - beta was already checked in decode_z(). */
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mldsa_verify);
--
2.51.2
It looks like this may be close, but for the record: The LF has a dedicated project for ML-DSA: https://github.com/pq-code-package/mldsa-native (part of the Post-Quantum Cryptography Alliance). It's derived from the reference implementation and adds automatically verified memory-safety + type-safety (= bounds-tracking) and a uniform backend interface for assembly optimizations; see the README for more details. It's licensed under Apache-2.0 OR MIT OR ISC. If you are sure that the kernel will never need sign/keygen support, or support for optimized assembly, the current ad-hoc patch may be fine. Otherwise, the challenges are likely just delayed, e.g. how to safely re-use parts of the current code for the timing-sensitive signing, or in contexts with other bounds assumptions, or how to integrate assembly optimizations. It may not seem so, but this is difficult to get right and where maintainability gets challenging. Verification here is a vehicle for maintainability: If you change any arithmetic code -- say you decide to do less modular reduction for performance -- you currently need very careful review that the bounds still check out in the worst case. In mldsa-native, this is re-checked automatically. mldsa-native is production-ready and in the process of being integrated into Amazon's AWS-LC crypto library; the sibling-project mlkem-native https://github.com/pq-code-package/mlkem-native already has been. mldsa-native is not yet a drop-in for the kernel, however. At the least, memory usage needs to be brought down and allocation be made flexible. We're working on it, and if the kernel community was interested in it, it'd give impetus to accelerate the work. This is just so you're aware. If mldsa-native is of interest, let us know -- it would be great to collaborate across the LF instead of duplicating efforts. Thanks, Hanno & Matthias (maintainers of mldsa-native)
On Sat, Nov 29, 2025 at 08:00:17PM +0000, Becker, Hanno wrote: > It looks like this may be close, but for the record: > > The LF has a dedicated project for ML-DSA: https://github.com/pq-code-package/mldsa-native (part of the Post-Quantum Cryptography Alliance). It's derived from the reference implementation and adds automatically verified memory-safety + type-safety (= bounds-tracking) and a uniform backend interface for assembly optimizations; see the README for more details. It's licensed under Apache-2.0 OR MIT OR ISC. > > If you are sure that the kernel will never need sign/keygen support, or support for optimized assembly, the current ad-hoc patch may be fine. Otherwise, the challenges are likely just delayed, e.g. how to safely re-use parts of the current code for the timing-sensitive signing, or in contexts with other bounds assumptions, or how to integrate assembly optimizations. It may not seem so, but this is difficult to get right and where maintainability gets challenging. > > Verification here is a vehicle for maintainability: If you change any arithmetic code -- say you decide to do less modular reduction for performance -- you currently need very careful review that the bounds still check out in the worst case. In mldsa-native, this is re-checked automatically. > > mldsa-native is production-ready and in the process of being integrated into Amazon's AWS-LC crypto library; the sibling-project mlkem-native https://github.com/pq-code-package/mlkem-native already has been. mldsa-native is not yet a drop-in for the kernel, however. At the least, memory usage needs to be brought down and allocation be made flexible. We're working on it, and if the kernel community was interested in it, it'd give impetus to accelerate the work. > > This is just so you're aware. If mldsa-native is of interest, let us know -- it would be great to collaborate across the LF instead of duplicating efforts. > > Thanks, > Hanno & Matthias (maintainers of mldsa-native) (Side note: this patch series is up to v2. See https://lore.kernel.org/linux-crypto/20251126203517.167040-1-ebiggers@kernel.org/ for the latest version as of this writing) For context, this is at least the third different userspace project that's been suggested to borrow ML-DSA code from, and not the first that is a fork of the Dilithium reference code. ML-DSA is also just one of dozens of algorithms the kernel supports. In none of them has the kernel community been successful with integrating a project wholesale, vs. just taking individual files. So while mldsa-native looks like a great project, for the task in question (adding basic ML-DSA verification support to the kernel) I'm not sure it brings much new to the table. Of course, there's also no corresponding kernel patch that proposes integrating mldsa-native into the kernel, so it's a bit hypothetical at this point too. The leancrypto proposal at least had a patch, so it was more concrete. I think you may be underestimating how much the requirements of the kernel differ from userspace. Consider the following: - Kernel stack is 8 KB to 16 KB. mldsa-native's signature verification code starts out by allocating ~100KB of memory on the stack. If that code was built into the kernel, it would immediately write out of bounds. Oops. So much for the formal verification of memory bounds. - Vector registers (e.g. AVX) can be used in the kernel only in some contexts, and only when they are explicitly saved and restored. So we have to do our own integration of any code that uses them anyway. There is also more overhead to each vector-optimized function than there is in userspace, so very fine-grained optimization (e.g. as is used in the Dilithium reference code) doesn't work too well. - The vector intrinsics like <immintrin.h> can't be used in the kernel, as they depend on userspace headers. Thus, vector instructions can generally be used only in assembly code. I believe this problem is solvable with a combination of changes to GCC, clang, and the kernel, and I'd like to see that happen. But someone would need to do it. Note that the kernel already has optimized Keccak code. That already covers the most performance-critical part of ML-DSA. Besides that part, I think we're fine with a portable implementation of ML-DSA. Consider that that's always been what we've done for RSA, for example. Signature verification performance just isn't that important in the kernel. But even if we decide the kernel needs optimized ML-DSA ring operations later, I don't think we get any free lunch. Userspace libraries aren't directly usable in the kernel anyway, for the reasons I outlined above. And we can always borrow things piecemeal, as we've always done. Microbenchmark throughput also isn't everything: memory usage and code size is very important too, often even more important. I haven't seen a proposal that even comes close to my mldsa_verify() on those metrics. We can't be 100% sure that the kernel will never need ML-DSA signing support. But it's not needed now, it's something that architecturally doesn't make much sense, and we'd prefer to avoid adding it. We shouldn't overengineer things around requirements that don't exist. Anyway, we also aren't stuck with one implementation forever. If someone can actually do ML-DSA better, whether that's for verification-only right now or for everything during a hypothetical future addition of signing support, we can replace my lib/crypto/mldsa.c with something else. *Usually* kernel code evolves incrementally, but not always. Especially with the crypto algorithms, there are examples where we've entirely swapped out an implementation. - Eric
Hi Hanno, Just to add to what Eric said... On Sat, Nov 29, 2025 at 04:19:11PM -0800, Eric Biggers wrote: > I think you may be underestimating how much the requirements of the > kernel differ from userspace. Consider the following: I've added a bit of formally verified code to the kernel, and also ported some userspace crypto. In these cases, I wound up working with the authors of the code to make it more suitable to the requirements of kernel space -- even down to the formatting level. For example, the HACL* project needed some changes to KReMLin to make the variety of code fit into what the kernel expected. Andy Polyakov's code needed some internal functions exposed so that the kernel could do cpu capability based dispatch. And so on and so forth. There's always _something_. I'd love to have a formally verified ML-DSA implementation (if we're to have ML-DSA in the kernel anyhow, but it looks like that's happening). But I almost guarantee that it's going to be some work to do. If those are efforts you'd consider undertaking seriously, I'd be happy to assist or help guide the considerations. There's also another approach, which would be to formally verify Eric's code, perhaps even using the same techniques as your own project, via CBMC and such. In this case, the name of the game is usually to port the kernel code to userspace. That generally winds up being a matter of shimming out some headers and adding a few typedefs. There's a decent amount of kernel test code or kernel tool code that does this, and lots of shim headers already in the kernel that can be borrowed for this. But usually, at least for crypto code, you can figure it out pretty quickly by just trying to compile it and plugging the missing headers and types as they come up. The model checking might be more work with this latter approach, since it's not already done like it is for the former, but the porting work is probably much less arduous. Anyway, the bigger picture is that I'm very enthusiastic about getting formally verified crypto in the kernel, so these types of efforts are really very appreciated and welcomed. But it just takes a bit more work than usual. Jason
Eric, Jason, Thanks for the fast replies! On 30/11/2025, 00:22, "Eric Biggers" <ebiggers@kernel.org <mailto:ebiggers@kernel.org>> wrote: > I think you may be underestimating how much the requirements of the > kernel differ from userspace. There is no doubt this is the case -- I am not a kernel guy -- so the points you raise are very valuable. Equally, you may be underestimating how much work it is to go from a static verification-only code to something that the community will be able to work with and extend in the future. There's clearly opportunity to learn from each other here. If this patch forms the 'mldsa-v1' for the kernel, it would be great to work together to see if 'mldsa-v2' could come from mldsa-native. > In none of them has the kernel community been successful with > integrating a project wholesale, vs. just taking individual files. I take that as a challenge. With AWS-LC we were also told that mlkem-native would never be able to integrate wholesale -- and now it is. It's a matter of goodwill and collaboration, and not a binary yes/no -- if selected but minimal patches are needed, that's still better than an entirely separate implementation, in my mind. > - Kernel stack is 8 KB to 16 KB. ... Yes, as mentioned we started working on a) bringing the memory usage down, and b) making the use of heap/stack configurable. > - Vector registers (e.g. AVX) can be used in the kernel only in some > contexts, and only when they are explicitly saved and restored. So > we have to do our own integration of any code that uses them anyway. > There is also more overhead to each vector-optimized function than > there is in userspace, so very fine-grained optimization (e.g. as is > used in the Dilithium reference code) doesn't work too well. That's very useful, can you say more? Would one want some sort of configurable preamble/postamble in the top-level API which takes care of the necessary save/restore logic? What is the per-function overhead? > - The vector intrinsics like <immintrin.h> can't be used in the > kernel, as they depend on userspace headers. Thus, vector > instructions can generally be used only in assembly code. I believe > this problem is solvable with a combination of changes to GCC, clang, > and the kernel, and I'd like to see that happen. But someone would > need to do it. The use of intrinsics is on the way out; the kernel isn't the only project who can't use them. Using assembly is also more suitable for our optimization and verification approach in mlkem-native and mldsa-native: We superoptimize some assembly using SLOTHY (https://github.com/slothy-optimizer/slothy/) and then do 'post-hoc' verification of the final object code using the HOL-Light/s2n-bignum (https://github.com/awslabs/s2n-bignum/) infrastructure. In mlkem-native, all AArch64 assembly is developed and verified in this way; in mldsa-native, we just completed the verification of the AVX2 assembly for the base multiplication and the NTT. > Note that the kernel already has optimized Keccak code. That already > covers the most performance-critical part of ML-DSA. No, this would need _batched_ Keccak. An ML-DSA implementation using only 1x-Keccak will never have competitive performance. See https://github.com/pq-code-package/mldsa-native/pull/754 for the performance loss from using unbatched Keccak only, on a variety of platforms; it's >2x for some. In turn, if you want to integrate batched Keccak -- but perhaps only on some platforms? -- you need to rewrite your entire code to make use of it. That's not a simple change, and part of what I mean when I say that the challenges are just deferred. Note that the official reference and AVX2 implementations duck this problem by duplicating the code and adjusting it, rather than looking for a common structure that could host both 'plain' and batched Keccak. I assume the amount of code duplication this brings would be unacceptable. On 30/11/2025, 01:06, "Jason A. Donenfeld" <Jason@zx2c4.com <mailto:Jason@zx2c4.com>> wrote: > I've added a bit of formally verified code to the kernel, and also > ported some userspace crypto. In these cases, I wound up working with > the authors of the code to make it more suitable to the requirements > of kernel space -- even down to the formatting level. For example, the > HACL* project needed some changes to KReMLin to make the variety of > code fit into what the kernel expected. Andy Polyakov's code needed > some internal functions exposed so that the kernel could do cpu > capability based dispatch. And so on and so forth. There's always > _something_. 100%. This is where we need support from someone in the kernel to even know what needs doing. The caveat regarding SIMD usage Eric mentioned is a good example. The CPU capability based dispatch, for example, was something we flushed out when we did the AWS-LC integration: dispatch is now configurable. > If those are efforts you'd consider undertaking seriously, I'd be > happy to assist or help guide the considerations. We are taking mlkem/mldsa-native seriously and want to make them as usable as possible. So, regardless of whether they'd ultimately end up in the kernel, any support of the form "If you wanted to integrate this in environment XXX [like the kernel], then you would need ..." is very useful and we'd be grateful for it. I don't expect this to be something we can rush through in a couple of days, but something that's achieved with steady progress and collaboration. > Anyway, the bigger picture is that I'm very enthusiastic about getting > formally verified crypto in the kernel, so these types of efforts are > really very appreciated and welcomed. But it just takes a bit more > work than usual. Thank you, Jason, this is great to hear, and if you had time to work with us, we'd really appreciate it. Thanks, Hanno & Matthias
On Sun, Nov 30, 2025 at 07:15:22AM +0000, Becker, Hanno wrote:
> > - Vector registers (e.g. AVX) can be used in the kernel only in some
> > contexts, and only when they are explicitly saved and restored. So
> > we have to do our own integration of any code that uses them anyway.
> > There is also more overhead to each vector-optimized function than
> > there is in userspace, so very fine-grained optimization (e.g. as is
> > used in the Dilithium reference code) doesn't work too well.
>
> That's very useful, can you say more? Would one want some sort of
> configurable preamble/postamble in the top-level API which takes care of
> the necessary save/restore logic?
>
> What is the per-function overhead?
It varies by architecture, but usually it looks something like:
if (irq_fpu_usable()) {
kernel_fpu_begin();
avx_function();
kernel_fpu_end();
} else {
generic_function();
}
The overhead varies significantly by CPU, kernel config options, and
whether it's the first use since the current task last entered the
kernel. But it can be up to a few hundred cycles.
> > Note that the kernel already has optimized Keccak code. That already
> > covers the most performance-critical part of ML-DSA.
>
> No, this would need _batched_ Keccak. An ML-DSA implementation using
> only 1x-Keccak will never have competitive performance. See
> https://github.com/pq-code-package/mldsa-native/pull/754 for the
> performance loss from using unbatched Keccak only, on a variety of
> platforms; it's >2x for some.
>
> In turn, if you want to integrate batched Keccak -- but perhaps only on
> some platforms? -- you need to rewrite your entire code to make use of
> it. That's not a simple change, and part of what I mean when I say that
> the challenges are just deferred. Note that the official reference and
> AVX2 implementations duck this problem by duplicating the code and
> adjusting it, rather than looking for a common structure that could host
> both 'plain' and batched Keccak. I assume the amount of code duplication
> this brings would be unacceptable.
At least in my code, only the matrix expansion code would need to change
to take advantage of interleaved Keccak. The fact that other
implementations apparently are having trouble with this actually
suggests to me that perhaps they're not good implementations to use.
Anyway, no one has said they want this particular optimization in the
kernel anyway. And hopefully the future is native Keccak support
anyway; s390 already has it, and (at least) RISC-V is working on it.
- Eric
Eric Biggers <ebiggers@kernel.org> wrote:
> + /* Compute d = (c mod 2^32) * (q^-1 mod 2^32). */
> + s32 d = (s32)c * QINV_MOD_R;
Hmmm... is "(s32)c" actually "(c mod 2^32)"? Should that be:
u32 d = (u32)c * QINV_MOD_R;
This is followed up by casting 'd' to "s64". I don't think that should
sign-extend it, but...
> + for (int m = 0, len = 128; len >= 1; len /= 2) {
Can you put "int m = 0" outside of the for-statement? I know putting it
inside saves a line or two, but 'm' is not the loop counter - which it seems
like it should be by virtue of being listed first.
> + for (int m = 256, len = 1; len < 256; len *= 2) {
Ditto.
> +static const u8 *decode_t1_elem(struct mldsa_ring_elem *out,
> + const u8 *t1_encoded)
I think this is (more or less) pkDecode()? Can you put something like:
* Decode the vector 't1' from the public key.
* Reference: FIPS 204 Algorithm 23, sigDecode.
in the comment before it?
> +/*
> + * Use @seed to generate a ring element @c with coefficients in {-1, 0, 1},
> + * exactly @tau of them nonzero. Reference: FIPS 204 Algorithm 29, SampleInBall
> + */
> +static void sample_in_ball(struct mldsa_ring_elem *c, const u8 *seed,
> + size_t seed_len, int tau, struct shake_ctx *shake)
Should "seed" actually be labelled "rho"? I know a seed is what it is, but
the algo description has a different label - and the caller passes it ctilde,
not rho:-/.
> + u8 (*h)[N]; /* The signer's hint vector, length k */
> + h = (u8 (*)[N])&ws->z[l];
C is weird sometimes.
> + /* Reduce to [0, q), then tmp = w'_1 = UseHint(h, w'_Approx) */
Bracket mismatch. "[0, q]"
> + /* w1Encode(w'_1) */
> + w1_pos = 0;
> ...
Given you put the decode functions into helpers, don't you want to do that
with this?
> + if (memcmp(ws->ctildeprime, ctilde, params->ctilde_len) != 0)
> + return -EBADMSG;
Actually, this should return -EKEYREJECTED, not -EBADMSG.
I guess you don't need to use crypto_memneq() as timing doesn't matter.
The maths look okay, I think. You can add:
Reviewed-by: David Howells <dhowells@redhat.com>
David
On Thu, Nov 20, 2025 at 01:55:18PM +0000, David Howells wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:
>
> > + /* Compute d = (c mod 2^32) * (q^-1 mod 2^32). */
> > + s32 d = (s32)c * QINV_MOD_R;
>
> Hmmm... is "(s32)c" actually "(c mod 2^32)"? Should that be:
>
> u32 d = (u32)c * QINV_MOD_R;
>
> This is followed up by casting 'd' to "s64". I don't think that should
> sign-extend it, but...
It selects the representative in the range [INT32_MIN, INT32_MAX],
rather than the representative in the range [0, UINT32_MAX]. The sign
extension is intentional. This makes the reduction more symmetric so
that the range of supported unreduced products is roughly symmetric.
I'll update the comments to clarify this.
> > + for (int m = 0, len = 128; len >= 1; len /= 2) {
>
> Can you put "int m = 0" outside of the for-statement? I know putting it
> inside saves a line or two, but 'm' is not the loop counter - which it seems
> like it should be by virtue of being listed first.
>
> > + for (int m = 256, len = 1; len < 256; len *= 2) {
>
> Ditto.
Sure.
>
> > +static const u8 *decode_t1_elem(struct mldsa_ring_elem *out,
> > + const u8 *t1_encoded)
>
> I think this is (more or less) pkDecode()? Can you put something like:
>
> * Decode the vector 't1' from the public key.
> * Reference: FIPS 204 Algorithm 23, sigDecode.
>
> in the comment before it?
Sure.
> > +/*
> > + * Use @seed to generate a ring element @c with coefficients in {-1, 0, 1},
> > + * exactly @tau of them nonzero. Reference: FIPS 204 Algorithm 29, SampleInBall
> > + */
> > +static void sample_in_ball(struct mldsa_ring_elem *c, const u8 *seed,
> > + size_t seed_len, int tau, struct shake_ctx *shake)
>
> Should "seed" actually be labelled "rho"? I know a seed is what it is, but
> the algo description has a different label - and the caller passes it ctilde,
> not rho:-/.
FIPS 204 Algorithm 29 SampleInBall uses the variable rho for the seed,
while also calling it a "seed" in the descriptive text. However,
elsewhere rho refers specifically to the public key's random seed. I
think just calling it "seed" makes sense here.
> > + u8 (*h)[N]; /* The signer's hint vector, length k */
> > + h = (u8 (*)[N])&ws->z[l];
>
> C is weird sometimes.
We could make it a 'u8 *', but then we'd have to use array indices like
h[i*k + j] rather than h[i][j]. May be worth it anyway, to avoid the
slightly-unusual syntax.
> > + /* Reduce to [0, q), then tmp = w'_1 = UseHint(h, w'_Approx) */
>
> Bracket mismatch. "[0, q]"
It's intentional, since it denotes a mathematical range. Elsewhere I
used the words "the range" explicitly, so I'll add that above too. (Or
maybe reword it differently.)
>
> > + /* w1Encode(w'_1) */
> > + w1_pos = 0;
> > ...
>
> Given you put the decode functions into helpers, don't you want to do that
> with this?
Sure, I'll move the w1Encode part into a helper function.
> > + if (memcmp(ws->ctildeprime, ctilde, params->ctilde_len) != 0)
> > + return -EBADMSG;
>
> Actually, this should return -EKEYREJECTED, not -EBADMSG.
Who/what decided that? A lot of the crypto code uses -EBADMSG already.
crypto_aead uses it, for example.
> I guess you don't need to use crypto_memneq() as timing doesn't matter.
Correct.
> The maths look okay, I think. You can add:
>
> Reviewed-by: David Howells <dhowells@redhat.com>
Thanks,
- Eric
Eric Biggers <ebiggers@kernel.org> wrote: > > > + if (memcmp(ws->ctildeprime, ctilde, params->ctilde_len) != 0) > > > + return -EBADMSG; > > > > Actually, this should return -EKEYREJECTED, not -EBADMSG. > > Who/what decided that? I did. When I added RSA support in 2012 for module signing. Note that it was originally added as part of crypto/asymmetric_keys/ and was not covered by a crypto API. The RSA code has since been moved to crypto/ and is now accessed through the crypto API, but it has retained this error code and this is also used by other public key algos. > A lot of the crypto code uses -EBADMSG already. > crypto_aead uses it, for example. ecdsa.c:60: return -EKEYREJECTED; ecrdsa.c:111: return -EKEYREJECTED; ecrdsa.c:139: return -EKEYREJECTED; ecrdsa.c:239: return -EKEYREJECTED; rsassa-pkcs1.c:293: return -EKEYREJECTED; rsassa-pkcs1.c:295: return -EKEYREJECTED; David
On Fri, Nov 21, 2025 at 09:39:31PM +0000, David Howells wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:
>
> > > > + if (memcmp(ws->ctildeprime, ctilde, params->ctilde_len) != 0)
> > > > + return -EBADMSG;
> > >
> > > Actually, this should return -EKEYREJECTED, not -EBADMSG.
> >
> > Who/what decided that?
>
> I did. When I added RSA support in 2012 for module signing. Note that it
> was originally added as part of crypto/asymmetric_keys/ and was not covered by
> a crypto API. The RSA code has since been moved to crypto/ and is now
> accessed through the crypto API, but it has retained this error code and this
> is also used by other public key algos.
>
> > A lot of the crypto code uses -EBADMSG already.
> > crypto_aead uses it, for example.
>
> ecdsa.c:60: return -EKEYREJECTED;
> ecrdsa.c:111: return -EKEYREJECTED;
> ecrdsa.c:139: return -EKEYREJECTED;
> ecrdsa.c:239: return -EKEYREJECTED;
> rsassa-pkcs1.c:293: return -EKEYREJECTED;
> rsassa-pkcs1.c:295: return -EKEYREJECTED;
crypto/aegis128-core.c:442: return -EBADMSG;
crypto/aegis128-core.c:499: return -EBADMSG;
crypto/algif_aead.c:313: if (err == -EIOCBQUEUED || err == -EBADMSG || !ret)
crypto/authenc.c:223: return -EBADMSG;
crypto/authencesn.c:220: return -EBADMSG;
crypto/ccm.c:336: err = -EBADMSG;
crypto/ccm.c:384: return -EBADMSG;
crypto/chacha20poly1305.c:90: return -EBADMSG;
crypto/dh.c:207: ret = -EBADMSG;
crypto/dh.c:221: ret = -EBADMSG;
crypto/dh.c:242: ret = -EBADMSG;
crypto/ecdsa.c:37: return -EBADMSG;
crypto/ecrdsa.c:101: return -EBADMSG;
crypto/gcm.c:471: return crypto_memneq(iauth_tag, auth_tag, authsize) ? -EBADMSG : 0;
crypto/krb5enc.c:259: return -EBADMSG;
crypto/rsa.c:150: ret = -EBADMSG;
crypto/rsa.c:189: ret = -EBADMSG;
crypto/rsassa-pkcs1.c:275: return -EBADMSG;
crypto/rsassa-pkcs1.c:282: return -EBADMSG;
crypto/rsassa-pkcs1.c:286: return -EBADMSG;
crypto/rsassa-pkcs1.c:288: return -EBADMSG;
crypto/testmgr.c:90: * algorithm might result in EINVAL rather than EBADMSG, due to other
crypto/testmgr.c:2179: (err != vec->crypt_error && !(err == -EBADMSG && vec->novrfy))) {
crypto/testmgr.c:2183: vec->crypt_error != 0 && vec->crypt_error != -EBADMSG)
crypto/testmgr.c:2184: sprintf(expected_error, "-EBADMSG or %d",
crypto/testmgr.c:2187: sprintf(expected_error, "-EBADMSG");
include/crypto/aead.h:37: * operation is that the caller should explicitly check for -EBADMSG of the
include/crypto/aead.h:39: * a breach in the integrity of the message. In essence, that -EBADMSG error
include/crypto/aead.h:375: * Return: 0 if the cipher operation was successful; -EBADMSG: The AEAD
That list actually includes the same three files that use -EKEYREJECTED.
It looks like if the signature verification fails "early" it's -EBADMSG,
whereas if it fails "late" it's -EKEYREJECTED? I'm skeptical that
that's a meaningful difference. And it's not like this is documented
either; crypto_sig_verify() just says "error code in case of error".
- Eric
On Fri, Nov 21, 2025 at 10:23:09PM +0000, Eric Biggers wrote: > That list actually includes the same three files that use -EKEYREJECTED. > It looks like if the signature verification fails "early" it's -EBADMSG, > whereas if it fails "late" it's -EKEYREJECTED? -EBADMSG denotes malformed data (e.g. incorrectly formatted ASN.1 payload). -EKEYREJECTED denotes a well-formed, but incorrect signature (e.g. made by a wrong key). I think it's important and useful to be able to differentiate that. Thanks, Lukas
On Fri, Nov 21, 2025 at 11:29:16PM +0100, Lukas Wunner wrote: > On Fri, Nov 21, 2025 at 10:23:09PM +0000, Eric Biggers wrote: > > That list actually includes the same three files that use -EKEYREJECTED. > > It looks like if the signature verification fails "early" it's -EBADMSG, > > whereas if it fails "late" it's -EKEYREJECTED? > > -EBADMSG denotes malformed data (e.g. incorrectly formatted ASN.1 payload). > > -EKEYREJECTED denotes a well-formed, but incorrect signature (e.g. made > by a wrong key). > > I think it's important and useful to be able to differentiate that. I guess. The pseudocode in the ML-DSA specification is clear that signature verification returns a boolean, regardless of whether the signature is invalid due to the ctilde check, the coefficients of the reponse vector being out of range, or the encoded hint vector being malformed. But if we really think it's useful we could disregard that and use EKEYREJECTED for the ctilde check and EBADMSG for the other cases. I think that would align with what you're suggesting. This is inconsistent with the kernel's symmetric crypto code, but oh well. - Eric
Eric Biggers <ebiggers@kernel.org> wrote: > On Thu, Nov 20, 2025 at 01:55:18PM +0000, David Howells wrote: > > Eric Biggers <ebiggers@kernel.org> wrote: > > > > > + /* Compute d = (c mod 2^32) * (q^-1 mod 2^32). */ > > > + s32 d = (s32)c * QINV_MOD_R; > > > > Hmmm... is "(s32)c" actually "(c mod 2^32)"? Should that be: > > > > u32 d = (u32)c * QINV_MOD_R; > > > > This is followed up by casting 'd' to "s64". I don't think that should > > sign-extend it, but... > > It selects the representative in the range [INT32_MIN, INT32_MAX], > rather than the representative in the range [0, UINT32_MAX]. The sign > extension is intentional. I'm concerned about the basis on which it becomes positive or negative. It looks like the sign bit ends up being chosen arbitrarily. > > > + /* Reduce to [0, q), then tmp = w'_1 = UseHint(h, w'_Approx) */ > > > > Bracket mismatch. "[0, q]" > > It's intentional, since it denotes a mathematical range. Elsewhere I > used the words "the range" explicitly, so I'll add that above too. (Or > maybe reword it differently.) I meant you have an opening square bracket and a closing round bracket in "[0, q)". David
On Fri, Nov 21, 2025 at 12:41:41PM +0000, David Howells wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:
>
> > On Thu, Nov 20, 2025 at 01:55:18PM +0000, David Howells wrote:
> > > Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > > + /* Compute d = (c mod 2^32) * (q^-1 mod 2^32). */
> > > > + s32 d = (s32)c * QINV_MOD_R;
> > >
> > > Hmmm... is "(s32)c" actually "(c mod 2^32)"? Should that be:
> > >
> > > u32 d = (u32)c * QINV_MOD_R;
> > >
> > > This is followed up by casting 'd' to "s64". I don't think that should
> > > sign-extend it, but...
> >
> > It selects the representative in the range [INT32_MIN, INT32_MAX],
> > rather than the representative in the range [0, UINT32_MAX]. The sign
> > extension is intentional.
>
> I'm concerned about the basis on which it becomes positive or negative. It
> looks like the sign bit ends up being chosen arbitrarily.
Right, it's unrelated to the sign of the s64 value, unless the s64 value
happens to fit in a s32. And that's okay: the worst-case analysis,
which considers the largest possible absolute value that can be built
up, assumes the signs happen to match repeatedly.
By the way, lest you think I'd doing anything particularly novel here,
the Dilithium reference code also uses this same (very-nearly-symmetric)
Montgomery reduction formula including the sign extension, where it made
its way into leancrypto and your patchset too. It also appears in FIPS
204 as Algorithm 49, MontgomeryReduce. There are other ways to
implement this stuff, but they would be less efficient.
However, unfortunately neither source explains it properly, and they
actually provide incorrect information. The comment in the reference
code says the the input can be in "-2^{31}Q <= a <= Q*2^31", which isn't
quite correct; the upper bound is actually exclusive. In my code, I
correctly document the upper bound as being exclusive.
FIPS 204 documents the same incorrect interval, but then sort of gets
around it by only claiming that the output is less than 2q in absolute
value (rather than q) and also by not clarifying whether sign extension
is done. They may have thought that sign extension shouldn't be done,
as you seem to have thought. Either way, their explanation is
misleading. The very-nearly-symmetric version that produces an output
less than q in absolute value is the logical version when working with
signed values, and it seems to be what the Dilithium authors intended.
Anyway, it's clear that my code comments still didn't explain it
properly either, so I'll work on that.
> > > > + /* Reduce to [0, q), then tmp = w'_1 = UseHint(h, w'_Approx) */
> > >
> > > Bracket mismatch. "[0, q]"
> >
> > It's intentional, since it denotes a mathematical range. Elsewhere I
> > used the words "the range" explicitly, so I'll add that above too. (Or
> > maybe reword it differently.)
>
> I meant you have an opening square bracket and a closing round bracket in
> "[0, q)".
That means the lower end is inclusive and the upper end is exclusive.
We're taking mod q, so we do *not* want q to be included.
I could write it another way that wouldn't assume familiarity with open
interval notation, like [0, q - 1] or 0 <= val < q.
- Eric
On Fri, Nov 21, 2025 at 09:14:21AM -0800, Eric Biggers wrote:
> However, unfortunately neither source explains it properly, and they
> actually provide incorrect information. The comment in the reference
> code says the the input can be in "-2^{31}Q <= a <= Q*2^31", which isn't
> quite correct; the upper bound is actually exclusive. In my code, I
> correctly document the upper bound as being exclusive.
I opened https://github.com/pq-crystals/dilithium/issues/108 against the
reference implementation. So hopefully that comment will get fixed.
> FIPS 204 documents the same incorrect interval, but then sort of gets
> around it by only claiming that the output is less than 2q in absolute
> value (rather than q) and also by not clarifying whether sign extension
> is done. They may have thought that sign extension shouldn't be done,
> as you seem to have thought. Either way, their explanation is
> misleading. The very-nearly-symmetric version that produces an output
> less than q in absolute value is the logical version when working with
> signed values, and it seems to be what the Dilithium authors intended.
I'm collecting the mistakes that I've found in FIPS 204 into a list,
which I'll send in to NIST as an errata request at some point...
- Eric
Eric Biggers <ebiggers@kernel.org> wrote: > I could write it another way that wouldn't assume familiarity with open > interval notation, like [0, q - 1] or 0 <= val < q. "[0, q-1]" would be less prone to confusion, thanks - and editors flagging the bracket mismatch. David
Eric Biggers <ebiggers@kernel.org> wrote:
> - Is about 600 lines of source code instead of 4800.
There's less shareable code for other algos that I'm sure people are going to
ask for, but that's probably fine.
> - Generates about 4 KB of object code instead of 28 KB.
> - Uses 9-13 KB of memory to verify a signature instead of 31-84 KB.
That's definitely good.
> - Is 3-5% faster, depending on the ML-DSA parameter set.
That's not quite what I see. For Leancrypto:
# benchmark_mldsa44: 8672 ops/s
# benchmark_mldsa65: 5470 ops/s
# benchmark_mldsa87: 3350 ops/s
For your implementation:
# benchmark_mldsa44: 8707 ops/s
# benchmark_mldsa65: 5423 ops/s
# benchmark_mldsa87: 3352 ops/s
This may reflect differences in CPU (mine's an i3-4170).
The numbers are pretty stable with the cpu frequency governor set to
performance and without rebooting betweentimes.
Interesting that your mldsa44 is consistently faster, but your mldsa65 is
consistently slower. mldsa87 is consistently about the same.
I don't think the time differences are particularly significant.
David
On Thu, Nov 20, 2025 at 09:10:00AM +0000, David Howells wrote: > Eric Biggers <ebiggers@kernel.org> wrote: > > > - Is about 600 lines of source code instead of 4800. > > There's less shareable code for other algos that I'm sure people are going to > ask for, but that's probably fine. The "advanced" verification features that people could conceivably want in the future (public key preloading, nonempty contexts, HashML-DSA, external mu, incremental message hashing) would all be fairly straightforward to add, in the event that that they ever become needed. Signing support would of course be challenging. But that's expected, and we should try to keep that out of the kernel anyway. > > - Generates about 4 KB of object code instead of 28 KB. > > - Uses 9-13 KB of memory to verify a signature instead of 31-84 KB. > > That's definitely good. > > > - Is 3-5% faster, depending on the ML-DSA parameter set. > > That's not quite what I see. For Leancrypto: > > # benchmark_mldsa44: 8672 ops/s > # benchmark_mldsa65: 5470 ops/s > # benchmark_mldsa87: 3350 ops/s > > For your implementation: > > # benchmark_mldsa44: 8707 ops/s > # benchmark_mldsa65: 5423 ops/s > # benchmark_mldsa87: 3352 ops/s > > This may reflect differences in CPU (mine's an i3-4170). > > The numbers are pretty stable with the cpu frequency governor set to > performance and without rebooting betweentimes. > > Interesting that your mldsa44 is consistently faster, but your mldsa65 is > consistently slower. mldsa87 is consistently about the same. > > I don't think the time differences are particularly significant. Sure, I had just tested one CPU. Slightly different results on different CPUs are expected. It's also expected that the ops/s for verification in a loop is still in roughly the same ballpark as your integration of leancrypto (or the Dilithium reference code which leancrypto seems to be based on, for that matter). There aren't too many ways to implement the most time-consuming parts. Generally, arch-optimized code would be needed to do significantly better. Of course, the greatly reduced icache and dcache usage is much more important for performance. But that doesn't show up in the "just verify the same signature in a loop repeatedly" benchmark. I'll clarify that part of the commit message accordingly. - Eric
You need to add something like:
#include <linux/module.h>
...
MODULE_DESCRIPTION("ML-DSA signature verification");
MODULE_LICENSE("GPL");
David
On Thu, Nov 20, 2025 at 08:14:56AM +0000, David Howells wrote:
> You need to add something like:
>
> #include <linux/module.h>
> ...
> MODULE_DESCRIPTION("ML-DSA signature verification");
> MODULE_LICENSE("GPL");
>
> David
Yep. Sorry, I usually build my test kernels with CONFIG_MODULES=n
because it's easier.
- Eric
© 2016 - 2025 Red Hat, Inc.