[PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path

Herbert Xu posted 1 patch 3 months, 4 weeks ago
[PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Posted by Herbert Xu 3 months, 4 weeks ago
On Thu, Jun 12, 2025 at 10:54:39PM -0700, Eric Biggers wrote:
>
> Actually, crypto_ahash::base::fb is initialized if CRYPTO_ALG_NEED_FALLBACK,
> which many of the drivers already set.  Then crypto_ahash_update() calls
> ahash_do_req_chain() if the algorithm does *not* have
> CRYPTO_AHASH_ALG_BLOCK_ONLY set.  Which then exports the driver's custom state
> and tries to import it into the fallback.
> 
> As far as I can tell, it's just broken for most of the existing drivers.

This fallback path is only meant to be used for drivers that have
been converted.  But you're right there is a check missing in there.

Thanks,

---8<---
Ensure that drivers that have not been converted to the ahash API
do not use the ahash_request_set_virt fallback path as they cannot
use the software fallback.

Reported-by: Eric Biggers <ebiggers@kernel.org>
Fixes: 9d7a0ab1c753 ("crypto: ahash - Handle partial blocks in API")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/crypto/ahash.c b/crypto/ahash.c
index e10bc2659ae4..992228a9f283 100644
--- a/crypto/ahash.c
+++ b/crypto/ahash.c
@@ -347,6 +347,9 @@ static int ahash_do_req_chain(struct ahash_request *req,
 	if (crypto_ahash_statesize(tfm) > HASH_MAX_STATESIZE)
 		return -ENOSYS;
 
+	if (crypto_hash_no_export_core(tfm))
+		return -ENOSYS;
+
 	{
 		u8 state[HASH_MAX_STATESIZE];
 
diff --git a/include/crypto/internal/hash.h b/include/crypto/internal/hash.h
index 0f85c543f80b..f052afa6e7b0 100644
--- a/include/crypto/internal/hash.h
+++ b/include/crypto/internal/hash.h
@@ -91,6 +91,12 @@ static inline bool crypto_hash_alg_needs_key(struct hash_alg_common *alg)
 		!(alg->base.cra_flags & CRYPTO_ALG_OPTIONAL_KEY);
 }
 
+static inline bool crypto_hash_no_export_core(struct crypto_ahash *tfm)
+{
+	return crypto_hash_alg_common(tfm)->base.cra_flags &
+	       CRYPTO_AHASH_ALG_NO_EXPORT_CORE;
+}
+
 int crypto_grab_ahash(struct crypto_ahash_spawn *spawn,
 		      struct crypto_instance *inst,
 		      const char *name, u32 type, u32 mask);
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Posted by Eric Biggers 3 months, 3 weeks ago
On Fri, Jun 13, 2025 at 04:51:38PM +0800, Herbert Xu wrote:
> On Thu, Jun 12, 2025 at 10:54:39PM -0700, Eric Biggers wrote:
> >
> > Actually, crypto_ahash::base::fb is initialized if CRYPTO_ALG_NEED_FALLBACK,
> > which many of the drivers already set.  Then crypto_ahash_update() calls
> > ahash_do_req_chain() if the algorithm does *not* have
> > CRYPTO_AHASH_ALG_BLOCK_ONLY set.  Which then exports the driver's custom state
> > and tries to import it into the fallback.
> > 
> > As far as I can tell, it's just broken for most of the existing drivers.
> 
> This fallback path is only meant to be used for drivers that have
> been converted.  But you're right there is a check missing in there.
> 
> Thanks,
> 
> ---8<---
> Ensure that drivers that have not been converted to the ahash API
> do not use the ahash_request_set_virt fallback path as they cannot
> use the software fallback.
> 
> Reported-by: Eric Biggers <ebiggers@kernel.org>
> Fixes: 9d7a0ab1c753 ("crypto: ahash - Handle partial blocks in API")
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Okay.  Out of curiosity I decided to actually test the Qualcomm Crypto Engine
driver on a development board that has a Qualcomm SoC, using latest mainline.

Even with your patch applied, it overflows the stack when running the crypto
self-tests, apparently due to crypto/ahash.c calling into itself recursively:

    [    9.230887] Insufficient stack space to handle exception!
    [    9.230889] ESR: 0x0000000096000047 -- DABT (current EL)
    [    9.230891] FAR: 0xffff800084927fe0
    [    9.230891] Task stack:     [0xffff800084928000..0xffff80008492c000]
    [    9.230893] IRQ stack:      [0xffff800080030000..0xffff800080034000]
    [    9.230894] Overflow stack: [0xffff000a72dd2100..0xffff000a72dd3100]
    [    9.230896] CPU: 6 UID: 0 PID: 747 Comm: cryptomgr_test Tainted: G S                  6.16.0-rc1-00237-g84ffcd88616f #7 PREEMPT 
    [    9.230900] Tainted: [S]=CPU_OUT_OF_SPEC
    [    9.230901] Hardware name: Qualcomm Technologies, Inc. SM8650 HDK (DT)
    [    9.230901] pstate: 01400005 (nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
    [    9.230903] pc : qce_ahash_update+0x4/0x1f4
    [    9.230910] lr : ahash_do_req_chain+0xb4/0x19c
    [    9.230915] sp : ffff800084928030
    [    9.230915] x29: ffff8000849281a0 x28: 0000000000000003 x27: 0000000000000001
    [    9.230918] x26: ffff0008022d8060 x25: ffff000800a33500 x24: ffff80008492b8d8
    [    9.230920] x23: ffff80008492b918 x22: 0000000000000400 x21: ffff000800a33510
    [    9.230922] x20: ffff000800b62030 x19: ffff00080122d400 x18: 00000000ffffffff
    [    9.230923] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [    9.230925] x14: 0000000000000001 x13: 0000000000000000 x12: 0000000000000000
    [    9.230927] x11: eee1c132902c61e2 x10: 0000000000000063 x9 : 0000000000000000
    [    9.230928] x8 : 0000000000000062 x7 : a54ff53a3c6ef372 x6 : 0000000000000400
    [    9.230930] x5 : fefefefefefefefe x4 : ffff000800a33510 x3 : 0000000000000000
    [    9.230931] x2 : ffff000805d76900 x1 : ffffcea2349738cc x0 : ffff00080122d400
    [    9.230933] Kernel panic - not syncing: kernel stack overflow
    [    9.230934] CPU: 6 UID: 0 PID: 747 Comm: cryptomgr_test Tainted: G S                  6.16.0-rc1-00237-g84ffcd88616f #7 PREEMPT 
    [    9.230936] Tainted: [S]=CPU_OUT_OF_SPEC
    [    9.230937] Hardware name: Qualcomm Technologies, Inc. SM8650 HDK (DT)
    [    9.230938] Call trace:
    [    9.230939]  show_stack+0x18/0x24 (C)
    [    9.230943]  dump_stack_lvl+0x60/0x80
    [    9.230947]  dump_stack+0x18/0x24
    [    9.230949]  panic+0x168/0x360
    [    9.230952]  add_taint+0x0/0xbc
    [    9.230955]  panic_bad_stack+0x108/0x120
    [    9.230958]  handle_bad_stack+0x34/0x40
    [    9.230962]  __bad_stack+0x80/0x84
    [    9.230963]  qce_ahash_update+0x4/0x1f4 (P)
    [    9.230965]  crypto_ahash_update+0x17c/0x18c
    [    9.230967]  crypto_ahash_finup+0x184/0x1e4
    [    9.230969]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230970]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230972]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230973]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230974]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230976]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230977]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230979]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230980]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230981]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230983]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230984]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230986]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230988]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230989]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230991]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230993]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230995]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230996]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230998]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.230999]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231001]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231002]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231004]  crypto_ahash_finup+0x1ac/0x1e4
    [    9.231005]  crypto_ahash_finup+0x1ac/0x1e4

    [the above line repeated a few hundred times more...]

    [    9.231571]  test_ahash_vec_cfg+0x508/0x8f8
    [    9.231573]  test_hash_vec+0xb8/0x21c
    [    9.231575]  __alg_test_hash+0x144/0x2e0
    [    9.231577]  alg_test_hash+0xc0/0x178
    [    9.231578]  alg_test+0x148/0x5ec
    [    9.231579]  cryptomgr_test+0x24/0x40
    [    9.231581]  kthread+0x12c/0x204
    [    9.231583]  ret_from_fork+0x10/0x20
    [    9.231587] SMP: stopping secondary CPUs
    [    9.240072] Kernel Offset: 0x4ea1b2a80000 from 0xffff800080000000
    [    9.240073] PHYS_OFFSET: 0xfff1000080000000
    [    9.240074] CPU features: 0x6000,000001c0,62130cb1,357e7667
    [    9.240075] Memory Limit: none
    [   11.373410] ---[ end Kernel panic - not syncing: kernel stack overflow ]---

After disabling the crypto self-tests, I was then able to run a benchmark of
SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
recursion bug.  I got the following results:

    ARMv8 crypto extensions: 1864 MB/s
    Generic C code: 358 MB/s
    Qualcomm Crypto Engine: 55 MB/s

So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
Crypto Engine one are useful, and the changes that you're requiring to the
CPU-based code are to support these drivers?

- Eric
Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Posted by Ard Biesheuvel 3 months, 3 weeks ago
On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
>
...
> After disabling the crypto self-tests, I was then able to run a benchmark of
> SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> recursion bug.  I got the following results:
>
>     ARMv8 crypto extensions: 1864 MB/s
>     Generic C code: 358 MB/s
>     Qualcomm Crypto Engine: 55 MB/s
>
> So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> Crypto Engine one are useful, and the changes that you're requiring to the
> CPU-based code are to support these drivers?
>

And this offload engine only has one internal queue, right? Whereas
the CPU results may be multiplied by the number of cores on the soc.
It would still be interesting how much of this is due to latency
rather than limited throughput but it seems highly unlikely that there
are any message sizes large enough where QCE would catch up with the
CPUs. (AIUI, the only use case we have in the kernel today for message
sizes that are substantially larger than this is kTLS, but I'm not
sure how well it works with crypto_aead compared to offload at a more
suitable level in the networking stack, and this driver does not
implement GCM in the first place)

On ARM socs, these offload engines usually exist primarily for the
benefit of the verified boot implementation in mask ROM, which
obviously needs to be minimal but doesn't have to be very fast in
order to get past the first boot stages and hand over to software.
Then, since the IP block is there, it's listed as a feature in the
data sheet, even though it is not very useful when running under the
OS.
Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Posted by Eric Biggers 3 months, 3 weeks ago
On Sun, Jun 15, 2025 at 09:22:51AM +0200, Ard Biesheuvel wrote:
> On Sun, 15 Jun 2025 at 05:18, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> ...
> > After disabling the crypto self-tests, I was then able to run a benchmark of
> > SHA-256 hashing 4096-byte messages, which fortunately didn't encounter the
> > recursion bug.  I got the following results:
> >
> >     ARMv8 crypto extensions: 1864 MB/s
> >     Generic C code: 358 MB/s
> >     Qualcomm Crypto Engine: 55 MB/s
> >
> > So just to clarify, you believe that asynchronous hash drivers like the Qualcomm
> > Crypto Engine one are useful, and the changes that you're requiring to the
> > CPU-based code are to support these drivers?
> >
> 
> And this offload engine only has one internal queue, right? Whereas
> the CPU results may be multiplied by the number of cores on the soc.
> It would still be interesting how much of this is due to latency
> rather than limited throughput but it seems highly unlikely that there
> are any message sizes large enough where QCE would catch up with the
> CPUs. (AIUI, the only use case we have in the kernel today for message
> sizes that are substantially larger than this is kTLS, but I'm not
> sure how well it works with crypto_aead compared to offload at a more
> suitable level in the networking stack, and this driver does not
> implement GCM in the first place)
> 
> On ARM socs, these offload engines usually exist primarily for the
> benefit of the verified boot implementation in mask ROM, which
> obviously needs to be minimal but doesn't have to be very fast in
> order to get past the first boot stages and hand over to software.
> Then, since the IP block is there, it's listed as a feature in the
> data sheet, even though it is not very useful when running under the
> OS.

With 1 MiB messages, I get 1913 MB/s with ARMv8 CE and 142 MB/s with QCE.

(BTW, that's single-buffer ARMv8 CE.  My two-buffer code is over 3000 MB/s.)

I then changed my benchmark code to take full advantage of the async API and
submit as many requests as the hardware can handle.  (This would be a best-case
scenario for QCE; in many real use cases this is not possible.)  Result with QCE
was 58 MB/s with 4 KiB messages or 155 MB/s for 1 MiB messages.

So yes, QCE seems to have only one queue, and even that one queue is *much*
slower than just using the CPU.  It's even slower than the generic C code.

And until I fixed it recently, the Crypto API defaulted to using QCE instead of
ARMv8 CE.

But this seems to be a common pattern among the offload engines.
I noticed a similar issue with Intel QAT, which I elaborate on in this patch:
https://lore.kernel.org/r/20250615045145.224567-1-ebiggers@kernel.org

- Eric
Re: [PATCH] crypto: ahash - Stop legacy tfms from using the set_virt fallback path
Posted by Linus Torvalds 3 months, 3 weeks ago
On Sun, 15 Jun 2025 at 11:47, Eric Biggers <ebiggers@kernel.org> wrote:
>
> So yes, QCE seems to have only one queue, and even that one queue is *much*
> slower than just using the CPU.  It's even slower than the generic C code.

Honestly, I have *NEVER* seen an external crypto accelerator that is
worth using unless it's integrated with the target IO.

Now, it's not my area of expertise either, so there may well be some
random case that I haven't heard about, but the only sensible use-case
I'm aware of is when the network card just does all the offloading and
just does the whole SSL thing (or IPsec or whatever, but if you care
about performance you'd be better off using wireguard and doing it all
on the CPU anyway)

And even then, people tend to not be happy with the results, because
the hardware is too inflexible or too rare.

(Replace "network card" with "disk controller" if that's your thing -
the basic idea is the same: it's worthwhile if it's done natively by
the IO target, not done by some third party accelerator - and while
I'm convinced encryption on the disk controller makes sense, I'm not
sure I'd actually *trust* it from a real cryptographic standpoint if
you really care about it, because some of those are most definitely
black boxes with the trust model seemingly being based on the "Trust
me, Bro" approach to security).

The other case is the "key is physically separate and isn't even under
kernel control at all", but then it's never about performance in the
first place (ie security keys etc).

Even if the hardware crypto engine is fast - and as you see, no they
aren't - any possible performance is absolutely killed by lack of
caches and the IO overhead.

This seems to also be pretty much true of async SMP crypto on the CPU
as well.  You can get better benchmarks by offloading the crypto to
other CPU's, but I'm not convinced it's actually a good trade-off in
reality. The cost of scheduling and just all the overhead of
synchronization is very very real, and the benchmarks where it looks
good tend to be the "we do nothing else, and we don't actually touch
the data anyway, it's just purely about pointless benchmarking".

Just the set-up costs for doing things asynchronously can be higher
than the cost of just doing the operation itself.

             Linus
[PATCH] crypto: ahash - Fix infinite recursion in ahash_def_finup
Posted by Herbert Xu 3 months, 3 weeks ago
On Sat, Jun 14, 2025 at 08:18:07PM -0700, Eric Biggers wrote:
>
> Even with your patch applied, it overflows the stack when running the crypto
> self-tests, apparently due to crypto/ahash.c calling into itself recursively:

Thanks for the report.  This driver doesn't provide a finup function
which triggered a bug in the default finup implementation:

---8<---
Invoke the final function directly in the default finup implementation
since crypto_ahash_final is now just a wrapper around finup.

Reported-by: Eric Biggers <ebiggers@kernel.org>
Fixes: 9d7a0ab1c753 ("crypto: ahash - Handle partial blocks in API")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/crypto/ahash.c b/crypto/ahash.c
index bd9e49950201..3878b4da3cfd 100644
--- a/crypto/ahash.c
+++ b/crypto/ahash.c
@@ -603,12 +603,14 @@ static void ahash_def_finup_done2(void *data, int err)
 
 static int ahash_def_finup_finish1(struct ahash_request *req, int err)
 {
+	struct crypto_ahash *tfm = crypto_ahash_reqtfm(req);
+
 	if (err)
 		goto out;
 
 	req->base.complete = ahash_def_finup_done2;
 
-	err = crypto_ahash_final(req);
+	err = crypto_ahash_alg(tfm)->final(req);
 	if (err == -EINPROGRESS || err == -EBUSY)
 		return err;
 
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt