[PATCH v2 00/18] crypto: Provide clmul.h and host accel

Richard Henderson posted 18 patches 9 months ago
Only 16 patches received!
There is a newer version of this series
host/include/aarch64/host/cpuinfo.h      |   1 +
host/include/aarch64/host/crypto/clmul.h |  41 +++++
host/include/generic/host/crypto/clmul.h |  15 ++
host/include/i386/host/cpuinfo.h         |   1 +
host/include/i386/host/crypto/clmul.h    |  29 ++++
host/include/x86_64/host/crypto/clmul.h  |   1 +
include/crypto/clmul.h                   |  83 ++++++++++
include/qemu/cpuid.h                     |   3 +
target/arm/tcg/vec_internal.h            |  11 --
crypto/clmul.c                           | 112 ++++++++++++++
target/arm/tcg/mve_helper.c              |  16 +-
target/arm/tcg/vec_helper.c              | 102 ++-----------
target/ppc/int_helper.c                  |  64 ++++----
target/s390x/tcg/vec_int_helper.c        | 186 ++++++++++-------------
util/cpuinfo-aarch64.c                   |   4 +-
util/cpuinfo-i386.c                      |   1 +
crypto/meson.build                       |   9 +-
17 files changed, 425 insertions(+), 254 deletions(-)
create mode 100644 host/include/aarch64/host/crypto/clmul.h
create mode 100644 host/include/generic/host/crypto/clmul.h
create mode 100644 host/include/i386/host/crypto/clmul.h
create mode 100644 host/include/x86_64/host/crypto/clmul.h
create mode 100644 include/crypto/clmul.h
create mode 100644 crypto/clmul.c
[PATCH v2 00/18] crypto: Provide clmul.h and host accel
Posted by Richard Henderson 9 months ago
Inspired by Ard Biesheuvel's RFC patches [1] for accelerating
carry-less multiply under emulation.

Changes for v2:
  * Only accelerate clmul_64; keep generic helpers for other sizes.
  * Drop most of the Int128 interfaces, except for clmul_64.
  * Use the same acceleration format as aes-round.h.


r~


[1] https://patchew.org/QEMU/20230601123332.3297404-1-ardb@kernel.org/

Richard Henderson (18):
  crypto: Add generic 8-bit carry-less multiply routines
  target/arm: Use clmul_8* routines
  target/s390x: Use clmul_8* routines
  target/ppc: Use clmul_8* routines
  crypto: Add generic 16-bit carry-less multiply routines
  target/arm: Use clmul_16* routines
  target/s390x: Use clmul_16* routines
  target/ppc: Use clmul_16* routines
  crypto: Add generic 32-bit carry-less multiply routines
  target/arm: Use clmul_32* routines
  target/s390x: Use clmul_32* routines
  target/ppc: Use clmul_32* routines
  crypto: Add generic 64-bit carry-less multiply routine
  target/arm: Use clmul_64
  target/s390x: Use clmul_64
  target/ppc: Use clmul_64
  host/include/i386: Implement clmul.h
  host/include/aarch64: Implement clmul.h

 host/include/aarch64/host/cpuinfo.h      |   1 +
 host/include/aarch64/host/crypto/clmul.h |  41 +++++
 host/include/generic/host/crypto/clmul.h |  15 ++
 host/include/i386/host/cpuinfo.h         |   1 +
 host/include/i386/host/crypto/clmul.h    |  29 ++++
 host/include/x86_64/host/crypto/clmul.h  |   1 +
 include/crypto/clmul.h                   |  83 ++++++++++
 include/qemu/cpuid.h                     |   3 +
 target/arm/tcg/vec_internal.h            |  11 --
 crypto/clmul.c                           | 112 ++++++++++++++
 target/arm/tcg/mve_helper.c              |  16 +-
 target/arm/tcg/vec_helper.c              | 102 ++-----------
 target/ppc/int_helper.c                  |  64 ++++----
 target/s390x/tcg/vec_int_helper.c        | 186 ++++++++++-------------
 util/cpuinfo-aarch64.c                   |   4 +-
 util/cpuinfo-i386.c                      |   1 +
 crypto/meson.build                       |   9 +-
 17 files changed, 425 insertions(+), 254 deletions(-)
 create mode 100644 host/include/aarch64/host/crypto/clmul.h
 create mode 100644 host/include/generic/host/crypto/clmul.h
 create mode 100644 host/include/i386/host/crypto/clmul.h
 create mode 100644 host/include/x86_64/host/crypto/clmul.h
 create mode 100644 include/crypto/clmul.h
 create mode 100644 crypto/clmul.c

-- 
2.34.1
Re: [PATCH v2 00/18] crypto: Provide clmul.h and host accel
Posted by Ard Biesheuvel 8 months, 4 weeks ago
On Sat, 19 Aug 2023 at 03:02, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Inspired by Ard Biesheuvel's RFC patches [1] for accelerating
> carry-less multiply under emulation.
>
> Changes for v2:
>   * Only accelerate clmul_64; keep generic helpers for other sizes.
>   * Drop most of the Int128 interfaces, except for clmul_64.
>   * Use the same acceleration format as aes-round.h.
>
>
> r~
>
>
> [1] https://patchew.org/QEMU/20230601123332.3297404-1-ardb@kernel.org/
>
> Richard Henderson (18):
>   crypto: Add generic 8-bit carry-less multiply routines
>   target/arm: Use clmul_8* routines
>   target/s390x: Use clmul_8* routines
>   target/ppc: Use clmul_8* routines
>   crypto: Add generic 16-bit carry-less multiply routines
>   target/arm: Use clmul_16* routines
>   target/s390x: Use clmul_16* routines
>   target/ppc: Use clmul_16* routines
>   crypto: Add generic 32-bit carry-less multiply routines
>   target/arm: Use clmul_32* routines
>   target/s390x: Use clmul_32* routines
>   target/ppc: Use clmul_32* routines
>   crypto: Add generic 64-bit carry-less multiply routine
>   target/arm: Use clmul_64
>   target/s390x: Use clmul_64
>   target/ppc: Use clmul_64
>   host/include/i386: Implement clmul.h
>   host/include/aarch64: Implement clmul.h
>

I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still
passes all its crypto selftests when running under TCG emulation on a
TX2 arm64 host, so

Tested-by: Ard Biesheuvel <ardb@kernel.org>

for the series.

Thanks,
Ard.
Re: [PATCH v2 00/18] crypto: Provide clmul.h and host accel
Posted by Richard Henderson 8 months, 4 weeks ago
On 8/21/23 07:57, Ard Biesheuvel wrote:
>> Richard Henderson (18):
>>    crypto: Add generic 8-bit carry-less multiply routines
>>    target/arm: Use clmul_8* routines
>>    target/s390x: Use clmul_8* routines
>>    target/ppc: Use clmul_8* routines
>>    crypto: Add generic 16-bit carry-less multiply routines
>>    target/arm: Use clmul_16* routines
>>    target/s390x: Use clmul_16* routines
>>    target/ppc: Use clmul_16* routines
>>    crypto: Add generic 32-bit carry-less multiply routines
>>    target/arm: Use clmul_32* routines
>>    target/s390x: Use clmul_32* routines
>>    target/ppc: Use clmul_32* routines
>>    crypto: Add generic 64-bit carry-less multiply routine
>>    target/arm: Use clmul_64
>>    target/s390x: Use clmul_64
>>    target/ppc: Use clmul_64
>>    host/include/i386: Implement clmul.h
>>    host/include/aarch64: Implement clmul.h
>>
> 
> I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still
> passes all its crypto selftests when running under TCG emulation on a
> TX2 arm64 host, so
> 
> Tested-by: Ard Biesheuvel <ardb@kernel.org>

Oh, whoops.  What's missing here?  Any target/i386 changes.


r~
Re: [PATCH v2 00/18] crypto: Provide clmul.h and host accel
Posted by Ard Biesheuvel 8 months, 4 weeks ago
On Mon, 21 Aug 2023 at 17:15, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 8/21/23 07:57, Ard Biesheuvel wrote:
> >> Richard Henderson (18):
> >>    crypto: Add generic 8-bit carry-less multiply routines
> >>    target/arm: Use clmul_8* routines
> >>    target/s390x: Use clmul_8* routines
> >>    target/ppc: Use clmul_8* routines
> >>    crypto: Add generic 16-bit carry-less multiply routines
> >>    target/arm: Use clmul_16* routines
> >>    target/s390x: Use clmul_16* routines
> >>    target/ppc: Use clmul_16* routines
> >>    crypto: Add generic 32-bit carry-less multiply routines
> >>    target/arm: Use clmul_32* routines
> >>    target/s390x: Use clmul_32* routines
> >>    target/ppc: Use clmul_32* routines
> >>    crypto: Add generic 64-bit carry-less multiply routine
> >>    target/arm: Use clmul_64
> >>    target/s390x: Use clmul_64
> >>    target/ppc: Use clmul_64
> >>    host/include/i386: Implement clmul.h
> >>    host/include/aarch64: Implement clmul.h
> >>
> >
> > I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still
> > passes all its crypto selftests when running under TCG emulation on a
> > TX2 arm64 host, so
> >
> > Tested-by: Ard Biesheuvel <ardb@kernel.org>
>
> Oh, whoops.  What's missing here?  Any target/i386 changes.
>

Ah yes - I hadn't spotted that. The below seems to do the trick.

--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2156,7 +2156,10 @@ void glue(helper_pclmulqdq, SUFFIX)(CPUX86State
*env, Reg *d, Reg *v, Reg *s,
     for (i = 0; i < 1 << SHIFT; i += 2) {
         a = v->Q(((ctrl & 1) != 0) + i);
         b = s->Q(((ctrl & 16) != 0) + i);
-        clmulq(&d->Q(i), &d->Q(i + 1), a, b);
+
+        Int128 r = clmul_64(a, b);
+        d->Q(i) = int128_getlo(r);
+        d->Q(i + 1) = int128_gethi(r);
     }
 }

[and the #include added and clmulq() dropped]

I did a quick RFC4106 benchmark with tcrypt (which doesn't speed up as
much as OpenSSL but it is a bit of a hassle cross-rebuilding that)

no acceleration:

tcrypt: test 7 (160 bit key, 8192 byte blocks): 1547 operations in 1
seconds (12673024 bytes)

AES only:

tcrypt: test 7 (160 bit key, 8192 byte blocks): 1679 operations in 1
seconds (13754368 bytes)

AES and PMULL

tcrypt: test 7 (160 bit key, 8192 byte blocks): 3298 operations in 1
seconds (27017216 bytes)