[PATCH v3 0/2] arm64: Speed up CRC-32 using PMULL instructions

Ard Biesheuvel posted 2 patches 1 month, 1 week ago
There is a newer version of this series
arch/arm64/lib/Makefile     |   2 +-
arch/arm64/lib/crc32-4way.S | 242 ++++++++++++++++++++
arch/arm64/lib/crc32-glue.c |  82 +++++++
arch/arm64/lib/crc32.S      |  22 +-
4 files changed, 331 insertions(+), 17 deletions(-)
create mode 100644 arch/arm64/lib/crc32-4way.S
create mode 100644 arch/arm64/lib/crc32-glue.c
[PATCH v3 0/2] arm64: Speed up CRC-32 using PMULL instructions
Posted by Ard Biesheuvel 1 month, 1 week ago
From: Ard Biesheuvel <ardb@kernel.org>

The CRC-32 code is library code, and is not part of the crypto
subsystem. This means that callers may not generally be aware of the
kind of implementation that backs it, and so we've refrained from using
FP/SIMD code in the past, as it disables preemption, and this may incur
scheduling latencies that the caller did not anticipate.

This was solved a while ago, and on arm64, kernel mode FP/SIMD no longer
disables preemption.

This means we can happily use PMULL instructions in the CRC-32 library
code, which permits an optimization to be implemented that results in a
speedup of 2 - 2.8x for inputs >1k in size (on Apple M2)

Patch #1 implements some prepwork to handle the scalar CRC-32
alternatives patching in C code.

Changes since v2:
- drop alternatives.h #include (#1)
- drop unneeded branch (#2)
- fix comment max -> min (#2)
- add Eric's Rb

Changes since v1:
- rename crc32-pmull.S to crc32-4way.S and avoid pmull in the function
  names to avoid confusion about the nature of the implementation;
- polish the asm a bit, and add some comments
- don't return via the scalar code if len dropped to 0 after calling the
  4-way code.

Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Kees Cook <kees@kernel.org>

Ard Biesheuvel (2):
  arm64/lib: Handle CRC-32 alternative in C code
  arm64/crc32: Implement 4-way interleave using PMULL

 arch/arm64/lib/Makefile     |   2 +-
 arch/arm64/lib/crc32-4way.S | 242 ++++++++++++++++++++
 arch/arm64/lib/crc32-glue.c |  82 +++++++
 arch/arm64/lib/crc32.S      |  22 +-
 4 files changed, 331 insertions(+), 17 deletions(-)
 create mode 100644 arch/arm64/lib/crc32-4way.S
 create mode 100644 arch/arm64/lib/crc32-glue.c

-- 
2.47.0.rc1.288.g06298d1525-goog
Re: [PATCH v3 0/2] arm64: Speed up CRC-32 using PMULL instructions
Posted by Ard Biesheuvel 1 month, 1 week ago
On Thu, 17 Oct 2024 at 11:41, Ard Biesheuvel <ardb+git@google.com> wrote:
>
> From: Ard Biesheuvel <ardb@kernel.org>
>
> The CRC-32 code is library code, and is not part of the crypto
> subsystem. This means that callers may not generally be aware of the
> kind of implementation that backs it, and so we've refrained from using
> FP/SIMD code in the past, as it disables preemption, and this may incur
> scheduling latencies that the caller did not anticipate.
>
> This was solved a while ago, and on arm64, kernel mode FP/SIMD no longer
> disables preemption.
>
> This means we can happily use PMULL instructions in the CRC-32 library
> code, which permits an optimization to be implemented that results in a
> speedup of 2 - 2.8x for inputs >1k in size (on Apple M2)
>
> Patch #1 implements some prepwork to handle the scalar CRC-32
> alternatives patching in C code.
>
> Changes since v2:
> - drop alternatives.h #include (#1)
> - drop unneeded branch (#2)
> - fix comment max -> min (#2)
> - add Eric's Rb
>
> Changes since v1:
> - rename crc32-pmull.S to crc32-4way.S and avoid pmull in the function
>   names to avoid confusion about the nature of the implementation;
> - polish the asm a bit, and add some comments
> - don't return via the scalar code if len dropped to 0 after calling the
>   4-way code.
>
> Cc: Eric Biggers <ebiggers@kernel.org>
> Cc: Kees Cook <kees@kernel.org>
>
> Ard Biesheuvel (2):
>   arm64/lib: Handle CRC-32 alternative in C code
>   arm64/crc32: Implement 4-way interleave using PMULL
>

I'll need to respin this - the crc32_be code doesn't actually work correctly.
Re: [PATCH v3 0/2] arm64: Speed up CRC-32 using PMULL instructions
Posted by Eric Biggers 1 month, 1 week ago
On Thu, Oct 17, 2024 at 06:30:19PM +0200, Ard Biesheuvel wrote:
> > Ard Biesheuvel (2):
> >   arm64/lib: Handle CRC-32 alternative in C code
> >   arm64/crc32: Implement 4-way interleave using PMULL
> >
> 
> I'll need to respin this - the crc32_be code doesn't actually work correctly.

Right, good catch.  It looks like it needs an rbit of the crc value at the
beginning and end.  lib/crc32test.c doesn't actually test crc32_be_arm64_4way()
because it runs the tests with IRQs disabled; it probably shouldn't do that.

On a slightly related topic, since any crc32_le() and __crc32c_le() functions in
arch/*/lib/ are automatically exposed as shash algorithms via the crypto API
(this was already the case, but your other patch makes this more explicit by
properly separating them from the generic implementation), I wonder if all the
remaining arch/*/crypto/crc32*.c should be migrated to arch/*/lib/, and then
users of crc32 and crc32c like ext4 and f2fs should just use the library
functions instead of shash.  That would simply things greatly.  See e.g. the
horrible hacks used in ext4_chksum() and __f2fs_crc32()...

The only crc32 and crc32c implementations that *aren't* software based are those
in drivers/crypto/stm32/stm32-crc32.c and
drivers/crypto/inside-secure/safexcel_hash.c.  Access to those would be lost by
going through lib.  But I strongly suspect they exist just because the hardware
supported it and not because they are actually useful.

- Eric
Re: [PATCH v3 0/2] arm64: Speed up CRC-32 using PMULL instructions
Posted by Ard Biesheuvel 1 month, 1 week ago
On Fri, 18 Oct 2024 at 00:04, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Thu, Oct 17, 2024 at 06:30:19PM +0200, Ard Biesheuvel wrote:
> > > Ard Biesheuvel (2):
> > >   arm64/lib: Handle CRC-32 alternative in C code
> > >   arm64/crc32: Implement 4-way interleave using PMULL
> > >
> >
> > I'll need to respin this - the crc32_be code doesn't actually work correctly.
>
> Right, good catch.  It looks like it needs an rbit of the crc value at the
> beginning and end.  lib/crc32test.c doesn't actually test crc32_be_arm64_4way()
> because it runs the tests with IRQs disabled; it probably shouldn't do that.
>

Yeah, we should probably fix that.

> On a slightly related topic, since any crc32_le() and __crc32c_le() functions in
> arch/*/lib/ are automatically exposed as shash algorithms via the crypto API
> (this was already the case, but your other patch makes this more explicit by
> properly separating them from the generic implementation), I wonder if all the
> remaining arch/*/crypto/crc32*.c should be migrated to arch/*/lib/, and then
> users of crc32 and crc32c like ext4 and f2fs should just use the library
> functions instead of shash.  That would simply things greatly.  See e.g. the
> horrible hacks used in ext4_chksum() and __f2fs_crc32()...
>
> The only crc32 and crc32c implementations that *aren't* software based are those
> in drivers/crypto/stm32/stm32-crc32.c and
> drivers/crypto/inside-secure/safexcel_hash.c.  Access to those would be lost by
> going through lib.  But I strongly suspect they exist just because the hardware
> supported it and not because they are actually useful.
>

Indeed. Another case where the flexibility of the shash interface
doesn't buy us anything but overhead and complexity.