[PATCH] x86/crc32: use builtins to improve code generation

Bill Wendling posted 1 patch 9 months, 3 weeks ago
There is a newer version of this series
arch/x86/Makefile         | 3 +++
arch/x86/lib/crc32-glue.c | 8 ++++----
2 files changed, 7 insertions(+), 4 deletions(-)
[PATCH] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
For both gcc and clang, crc32 builtins generate better code than the
inline asm. GCC improves, removing unneeded "mov" instructions. Clang
does the same and unrolls the loops. GCC has no changes on i386, but
Clang's code generation is vastly improved, due to Clang's "rm"
constraint issue.

The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
is expected because of the "rm" issue. However, Clang's performance is
better than GCC's by ~1.5%, most likely due to loop unrolling.

Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: llvm@lists.linux.dev
Signed-off-by: Bill Wendling <morbo@google.com>
---
 arch/x86/Makefile         | 3 +++
 arch/x86/lib/crc32-glue.c | 8 ++++----
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5b773b34768d..241436da1473 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -114,6 +114,9 @@ else
 KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
 endif

+# Enables the use of CRC32 builtins.
+KBUILD_CFLAGS += -mcrc32
+
 ifeq ($(CONFIG_X86_32),y)
         BITS := 32
         UTS_MACHINE := i386
diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
index 2dd18a886ded..fdb94bff25f4 100644
--- a/arch/x86/lib/crc32-glue.c
+++ b/arch/x86/lib/crc32-glue.c
@@ -48,9 +48,9 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
 EXPORT_SYMBOL(crc32_le_arch);

 #ifdef CONFIG_X86_64
-#define CRC32_INST "crc32q %1, %q0"
+#define CRC32_INST __builtin_ia32_crc32di
 #else
-#define CRC32_INST "crc32l %1, %0"
+#define CRC32_INST __builtin_ia32_crc32si
 #endif

 /*
@@ -78,10 +78,10 @@ u32 crc32c_le_arch(u32 crc, const u8 *p, size_t len)

        for (num_longs = len / sizeof(unsigned long);
             num_longs != 0; num_longs--, p += sizeof(unsigned long))
-               asm(CRC32_INST : "+r" (crc) : "rm" (*(unsigned long *)p));
+               crc = CRC32_INST(crc,  *(unsigned long *)p);

        for (len %= sizeof(unsigned long); len; len--, p++)
-               asm("crc32b %1, %0" : "+r" (crc) : "rm" (*p));
+               crc = __builtin_ia32_crc32qi(crc, *p);

        return crc;
 }
-- 
2.48.1.711.g2feabab25a-goog
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Dave Hansen 9 months, 3 weeks ago
On 2/26/25 22:12, Bill Wendling wrote:
>  #ifdef CONFIG_X86_64
> -#define CRC32_INST "crc32q %1, %q0"
> +#define CRC32_INST __builtin_ia32_crc32di
>  #else
> -#define CRC32_INST "crc32l %1, %0"
> +#define CRC32_INST __builtin_ia32_crc32si
>  #endif
> 
>  /*
> @@ -78,10 +78,10 @@ u32 crc32c_le_arch(u32 crc, const u8 *p, size_t len)
> 
>         for (num_longs = len / sizeof(unsigned long);
>              num_longs != 0; num_longs--, p += sizeof(unsigned long))
> -               asm(CRC32_INST : "+r" (crc) : "rm" (*(unsigned long *)p));
> +               crc = CRC32_INST(crc,  *(unsigned long *)p);

Could we get rid of the macros, please?

unsigned long crc32_ul(unsigned long crc, unsigned long data)
{
	if (IS_DEFINED(CONFIG_X86_64))
		return __builtin_ia32_crc32di(crc, data)
	else
		return __builtin_ia32_crc32si(crc, data)
}

I guess it could also do some check like:

	if (sizeof(int) == sizeof(long))

instead of CONFIG_X86_64, but the CONFIG_X86_64 will make it more
obvious when someone comes through to rip out 32-bit support some day.
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
On Thu, Feb 27, 2025 at 8:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
> On 2/26/25 22:12, Bill Wendling wrote:
> >  #ifdef CONFIG_X86_64
> > -#define CRC32_INST "crc32q %1, %q0"
> > +#define CRC32_INST __builtin_ia32_crc32di
> >  #else
> > -#define CRC32_INST "crc32l %1, %0"
> > +#define CRC32_INST __builtin_ia32_crc32si
> >  #endif
> >
> >  /*
> > @@ -78,10 +78,10 @@ u32 crc32c_le_arch(u32 crc, const u8 *p, size_t len)
> >
> >         for (num_longs = len / sizeof(unsigned long);
> >              num_longs != 0; num_longs--, p += sizeof(unsigned long))
> > -               asm(CRC32_INST : "+r" (crc) : "rm" (*(unsigned long *)p));
> > +               crc = CRC32_INST(crc,  *(unsigned long *)p);
>
> Could we get rid of the macros, please?
>
> unsigned long crc32_ul(unsigned long crc, unsigned long data)
> {
>         if (IS_DEFINED(CONFIG_X86_64))
>                 return __builtin_ia32_crc32di(crc, data)
>         else
>                 return __builtin_ia32_crc32si(crc, data)
> }
>
> I guess it could also do some check like:
>
>         if (sizeof(int) == sizeof(long))
>
> instead of CONFIG_X86_64, but the CONFIG_X86_64 will make it more
> obvious when someone comes through to rip out 32-bit support some day.

I vastly prefer the first way if made "static __always_inline".

-bw
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Dave Hansen 9 months, 3 weeks ago
On 2/27/25 12:57, Bill Wendling wrote:
> I vastly prefer the first way if made "static __always_inline".

'static', for sure. But I'd leave the explicit inlining out unless the
compiler is actively being stupid.
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Eric Biggers 9 months, 3 weeks ago
On Wed, Feb 26, 2025 at 10:12:47PM -0800, Bill Wendling wrote:
> For both gcc and clang, crc32 builtins generate better code than the
> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> does the same and unrolls the loops. GCC has no changes on i386, but
> Clang's code generation is vastly improved, due to Clang's "rm"
> constraint issue.
> 
> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> is expected because of the "rm" issue. However, Clang's performance is
> better than GCC's by ~1.5%, most likely due to loop unrolling.
> 
> Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: x86@kernel.org
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Eric Biggers <ebiggers@kernel.org>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Nathan Chancellor <nathan@kernel.org>
> Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
> Cc: Justin Stitt <justinstitt@google.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-crypto@vger.kernel.org
> Cc: llvm@lists.linux.dev
> Signed-off-by: Bill Wendling <morbo@google.com>
> ---
>  arch/x86/Makefile         | 3 +++
>  arch/x86/lib/crc32-glue.c | 8 ++++----
>  2 files changed, 7 insertions(+), 4 deletions(-)

Thanks!  A couple concerns, though:

> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 5b773b34768d..241436da1473 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -114,6 +114,9 @@ else
>  KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
>  endif
> 
> +# Enables the use of CRC32 builtins.
> +KBUILD_CFLAGS += -mcrc32

Doesn't this technically allow the compiler to insert CRC32 instructions
anywhere in arch/x86/ without the needed runtime CPU feature check?  Normally
when using intrinsics it's necessary to limit the scope of the feature
enablement to match the runtime CPU feature check that is done, e.g. by using
the target function attribute.

> diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
> index 2dd18a886ded..fdb94bff25f4 100644
> --- a/arch/x86/lib/crc32-glue.c
> +++ b/arch/x86/lib/crc32-glue.c
> @@ -48,9 +48,9 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
>  EXPORT_SYMBOL(crc32_le_arch);
> 
>  #ifdef CONFIG_X86_64
> -#define CRC32_INST "crc32q %1, %q0"
> +#define CRC32_INST __builtin_ia32_crc32di
>  #else
> -#define CRC32_INST "crc32l %1, %0"
> +#define CRC32_INST __builtin_ia32_crc32si
>  #endif

Do both gcc and clang consider these builtins to be a stable API, or do they
only guarantee the stability of _mm_crc32_*() from immintrin.h?  At least for
the rest of the SSE and AVX stuff, I thought that only the immintrin.h functions
are actually considered stable.

- Eric
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by H. Peter Anvin 9 months, 3 weeks ago
On February 26, 2025 10:28:59 PM PST, Eric Biggers <ebiggers@kernel.org> wrote:
>On Wed, Feb 26, 2025 at 10:12:47PM -0800, Bill Wendling wrote:
>> For both gcc and clang, crc32 builtins generate better code than the
>> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
>> does the same and unrolls the loops. GCC has no changes on i386, but
>> Clang's code generation is vastly improved, due to Clang's "rm"
>> constraint issue.
>> 
>> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
>> is expected because of the "rm" issue. However, Clang's performance is
>> better than GCC's by ~1.5%, most likely due to loop unrolling.
>> 
>> Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: x86@kernel.org
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Eric Biggers <ebiggers@kernel.org>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: Nathan Chancellor <nathan@kernel.org>
>> Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
>> Cc: Justin Stitt <justinstitt@google.com>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-crypto@vger.kernel.org
>> Cc: llvm@lists.linux.dev
>> Signed-off-by: Bill Wendling <morbo@google.com>
>> ---
>>  arch/x86/Makefile         | 3 +++
>>  arch/x86/lib/crc32-glue.c | 8 ++++----
>>  2 files changed, 7 insertions(+), 4 deletions(-)
>
>Thanks!  A couple concerns, though:
>
>> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>> index 5b773b34768d..241436da1473 100644
>> --- a/arch/x86/Makefile
>> +++ b/arch/x86/Makefile
>> @@ -114,6 +114,9 @@ else
>>  KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
>>  endif
>> 
>> +# Enables the use of CRC32 builtins.
>> +KBUILD_CFLAGS += -mcrc32
>
>Doesn't this technically allow the compiler to insert CRC32 instructions
>anywhere in arch/x86/ without the needed runtime CPU feature check?  Normally
>when using intrinsics it's necessary to limit the scope of the feature
>enablement to match the runtime CPU feature check that is done, e.g. by using
>the target function attribute.
>
>> diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
>> index 2dd18a886ded..fdb94bff25f4 100644
>> --- a/arch/x86/lib/crc32-glue.c
>> +++ b/arch/x86/lib/crc32-glue.c
>> @@ -48,9 +48,9 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
>>  EXPORT_SYMBOL(crc32_le_arch);
>> 
>>  #ifdef CONFIG_X86_64
>> -#define CRC32_INST "crc32q %1, %q0"
>> +#define CRC32_INST __builtin_ia32_crc32di
>>  #else
>> -#define CRC32_INST "crc32l %1, %0"
>> +#define CRC32_INST __builtin_ia32_crc32si
>>  #endif
>
>Do both gcc and clang consider these builtins to be a stable API, or do they
>only guarantee the stability of _mm_crc32_*() from immintrin.h?  At least for
>the rest of the SSE and AVX stuff, I thought that only the immintrin.h functions
>are actually considered stable.
>
>- Eric

There is that... also are there compiler versions that we support that do not have -mcrc32 support? 
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
On Thu, Feb 27, 2025 at 2:53 AM H. Peter Anvin <hpa@zytor.com> wrote:
> On February 26, 2025 10:28:59 PM PST, Eric Biggers <ebiggers@kernel.org> wrote:
> >On Wed, Feb 26, 2025 at 10:12:47PM -0800, Bill Wendling wrote:
> >> For both gcc and clang, crc32 builtins generate better code than the
> >> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> >> does the same and unrolls the loops. GCC has no changes on i386, but
> >> Clang's code generation is vastly improved, due to Clang's "rm"
> >> constraint issue.
> >>
> >> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> >> is expected because of the "rm" issue. However, Clang's performance is
> >> better than GCC's by ~1.5%, most likely due to loop unrolling.
> >>
> >> Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Borislav Petkov <bp@alien8.de>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: x86@kernel.org
> >> Cc: "H. Peter Anvin" <hpa@zytor.com>
> >> Cc: Eric Biggers <ebiggers@kernel.org>
> >> Cc: Ard Biesheuvel <ardb@kernel.org>
> >> Cc: Nathan Chancellor <nathan@kernel.org>
> >> Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
> >> Cc: Justin Stitt <justinstitt@google.com>
> >> Cc: linux-kernel@vger.kernel.org
> >> Cc: linux-crypto@vger.kernel.org
> >> Cc: llvm@lists.linux.dev
> >> Signed-off-by: Bill Wendling <morbo@google.com>
> >> ---
> >>  arch/x86/Makefile         | 3 +++
> >>  arch/x86/lib/crc32-glue.c | 8 ++++----
> >>  2 files changed, 7 insertions(+), 4 deletions(-)
> >
> >Thanks!  A couple concerns, though:
> >
> >> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> >> index 5b773b34768d..241436da1473 100644
> >> --- a/arch/x86/Makefile
> >> +++ b/arch/x86/Makefile
> >> @@ -114,6 +114,9 @@ else
> >>  KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
> >>  endif
> >>
> >> +# Enables the use of CRC32 builtins.
> >> +KBUILD_CFLAGS += -mcrc32
> >
> >Doesn't this technically allow the compiler to insert CRC32 instructions
> >anywhere in arch/x86/ without the needed runtime CPU feature check?  Normally
> >when using intrinsics it's necessary to limit the scope of the feature
> >enablement to match the runtime CPU feature check that is done, e.g. by using
> >the target function attribute.
> >
> >> diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
> >> index 2dd18a886ded..fdb94bff25f4 100644
> >> --- a/arch/x86/lib/crc32-glue.c
> >> +++ b/arch/x86/lib/crc32-glue.c
> >> @@ -48,9 +48,9 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
> >>  EXPORT_SYMBOL(crc32_le_arch);
> >>
> >>  #ifdef CONFIG_X86_64
> >> -#define CRC32_INST "crc32q %1, %q0"
> >> +#define CRC32_INST __builtin_ia32_crc32di
> >>  #else
> >> -#define CRC32_INST "crc32l %1, %0"
> >> +#define CRC32_INST __builtin_ia32_crc32si
> >>  #endif
> >
> >Do both gcc and clang consider these builtins to be a stable API, or do they
> >only guarantee the stability of _mm_crc32_*() from immintrin.h?  At least for
> >the rest of the SSE and AVX stuff, I thought that only the immintrin.h functions
> >are actually considered stable.
> >
> >- Eric
>
> There is that... also are there compiler versions that we support that do not have -mcrc32 support?
>
Checking GCC 5.1.0 and Clang 13.0.1, it seems that both support '-mcrc32'.

-bw
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
On Thu, Feb 27, 2025 at 4:17 AM Bill Wendling <morbo@google.com> wrote:
> On Thu, Feb 27, 2025 at 2:53 AM H. Peter Anvin <hpa@zytor.com> wrote:
> > On February 26, 2025 10:28:59 PM PST, Eric Biggers <ebiggers@kernel.org> wrote:
> > >On Wed, Feb 26, 2025 at 10:12:47PM -0800, Bill Wendling wrote:
> > >> For both gcc and clang, crc32 builtins generate better code than the
> > >> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > >> does the same and unrolls the loops. GCC has no changes on i386, but
> > >> Clang's code generation is vastly improved, due to Clang's "rm"
> > >> constraint issue.
> > >>
> > >> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > >> is expected because of the "rm" issue. However, Clang's performance is
> > >> better than GCC's by ~1.5%, most likely due to loop unrolling.
> > >>
> > >> Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
> > >> Cc: Thomas Gleixner <tglx@linutronix.de>
> > >> Cc: Ingo Molnar <mingo@redhat.com>
> > >> Cc: Borislav Petkov <bp@alien8.de>
> > >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > >> Cc: x86@kernel.org
> > >> Cc: "H. Peter Anvin" <hpa@zytor.com>
> > >> Cc: Eric Biggers <ebiggers@kernel.org>
> > >> Cc: Ard Biesheuvel <ardb@kernel.org>
> > >> Cc: Nathan Chancellor <nathan@kernel.org>
> > >> Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
> > >> Cc: Justin Stitt <justinstitt@google.com>
> > >> Cc: linux-kernel@vger.kernel.org
> > >> Cc: linux-crypto@vger.kernel.org
> > >> Cc: llvm@lists.linux.dev
> > >> Signed-off-by: Bill Wendling <morbo@google.com>
> > >> ---
> > >>  arch/x86/Makefile         | 3 +++
> > >>  arch/x86/lib/crc32-glue.c | 8 ++++----
> > >>  2 files changed, 7 insertions(+), 4 deletions(-)
> > >
> > >Thanks!  A couple concerns, though:
> > >
> > >> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> > >> index 5b773b34768d..241436da1473 100644
> > >> --- a/arch/x86/Makefile
> > >> +++ b/arch/x86/Makefile
> > >> @@ -114,6 +114,9 @@ else
> > >>  KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
> > >>  endif
> > >>
> > >> +# Enables the use of CRC32 builtins.
> > >> +KBUILD_CFLAGS += -mcrc32
> > >
> > >Doesn't this technically allow the compiler to insert CRC32 instructions
> > >anywhere in arch/x86/ without the needed runtime CPU feature check?  Normally
> > >when using intrinsics it's necessary to limit the scope of the feature
> > >enablement to match the runtime CPU feature check that is done, e.g. by using
> > >the target function attribute.
> > >
> > >> diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
> > >> index 2dd18a886ded..fdb94bff25f4 100644
> > >> --- a/arch/x86/lib/crc32-glue.c
> > >> +++ b/arch/x86/lib/crc32-glue.c
> > >> @@ -48,9 +48,9 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
> > >>  EXPORT_SYMBOL(crc32_le_arch);
> > >>
> > >>  #ifdef CONFIG_X86_64
> > >> -#define CRC32_INST "crc32q %1, %q0"
> > >> +#define CRC32_INST __builtin_ia32_crc32di
> > >>  #else
> > >> -#define CRC32_INST "crc32l %1, %0"
> > >> +#define CRC32_INST __builtin_ia32_crc32si
> > >>  #endif
> > >
> > >Do both gcc and clang consider these builtins to be a stable API, or do they
> > >only guarantee the stability of _mm_crc32_*() from immintrin.h?  At least for
> > >the rest of the SSE and AVX stuff, I thought that only the immintrin.h functions
> > >are actually considered stable.
> > >
> > >- Eric
> >
> > There is that... also are there compiler versions that we support that do not have -mcrc32 support?
> >
> Checking GCC 5.1.0 and Clang 13.0.1, it seems that both support '-mcrc32'.
>
I just checked and GCC 5.1.0 doesn't appear to be able to compile the
kernel anymore, at least not with "defconfig". It doesn't have
retpoline support for one and then can't compile lib/zstd:

lib/zstd/decompress/zstd_decompress_block.c: In function
‘ZSTD_decompressSequences_default’:
lib/zstd/decompress/zstd_decompress_block.c:1539:1: error: inlining
failed in call to always_inline ‘ZSTD_decompressSequences_body’:
optimization level attribute mismatch
 ZSTD_decompressSequences_body(ZSTD_DCtx* dctx,
 ^
lib/zstd/decompress/zstd_decompress_block.c:1633:12: error: called from here
     return ZSTD_decompressSequences_body(dctx, dst, maxDstSize,
seqStart, seqSize, nbSeq, isLongOffset, frame);
            ^

GCC 6.1.0 gets further, but also doesn't have retpoline support. Maybe
the minimal version should be changed?

Anyway, GCC 5.1.0 doesn't support
__attribute__((__target__("crc32"))), so I'd have to use the flag. I
know I can conditionally add the flag with:

CFLAGS_crc32-glue.o := -mcrc32

But like I said, the file is compiled twice (why?), but only once with
the arch/x86/lib/Makefile. If anyone has any suggestions on how to
solve this, please let me know.

-bw
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
On Wed, Feb 26, 2025 at 10:29 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Feb 26, 2025 at 10:12:47PM -0800, Bill Wendling wrote:
> > For both gcc and clang, crc32 builtins generate better code than the
> > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > does the same and unrolls the loops. GCC has no changes on i386, but
> > Clang's code generation is vastly improved, due to Clang's "rm"
> > constraint issue.
> >
> > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > is expected because of the "rm" issue. However, Clang's performance is
> > better than GCC's by ~1.5%, most likely due to loop unrolling.
> >
> > Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: x86@kernel.org
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: Eric Biggers <ebiggers@kernel.org>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: Nathan Chancellor <nathan@kernel.org>
> > Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
> > Cc: Justin Stitt <justinstitt@google.com>
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-crypto@vger.kernel.org
> > Cc: llvm@lists.linux.dev
> > Signed-off-by: Bill Wendling <morbo@google.com>
> > ---
> >  arch/x86/Makefile         | 3 +++
> >  arch/x86/lib/crc32-glue.c | 8 ++++----
> >  2 files changed, 7 insertions(+), 4 deletions(-)
>
> Thanks!  A couple concerns, though:
>
> > diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> > index 5b773b34768d..241436da1473 100644
> > --- a/arch/x86/Makefile
> > +++ b/arch/x86/Makefile
> > @@ -114,6 +114,9 @@ else
> >  KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
> >  endif
> >
> > +# Enables the use of CRC32 builtins.
> > +KBUILD_CFLAGS += -mcrc32
>
> Doesn't this technically allow the compiler to insert CRC32 instructions
> anywhere in arch/x86/ without the needed runtime CPU feature check?  Normally
> when using intrinsics it's necessary to limit the scope of the feature
> enablement to match the runtime CPU feature check that is done, e.g. by using
> the target function attribute.
>
I'm not sure if CRC32 instructions will automatically be inserted when
not explicitly called, especially since the other vector features are
disabled. I wanted to limit enabling this flag for only crc32-glue.c,
but my Makefile-fu failed me. The file appears to be compiled twice.
But adding __attribute__((target("crc32"))) to the function would be
much better.

> > diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
> > index 2dd18a886ded..fdb94bff25f4 100644
> > --- a/arch/x86/lib/crc32-glue.c
> > +++ b/arch/x86/lib/crc32-glue.c
> > @@ -48,9 +48,9 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
> >  EXPORT_SYMBOL(crc32_le_arch);
> >
> >  #ifdef CONFIG_X86_64
> > -#define CRC32_INST "crc32q %1, %q0"
> > +#define CRC32_INST __builtin_ia32_crc32di
> >  #else
> > -#define CRC32_INST "crc32l %1, %0"
> > +#define CRC32_INST __builtin_ia32_crc32si
> >  #endif
>
> Do both gcc and clang consider these builtins to be a stable API, or do they
> only guarantee the stability of _mm_crc32_*() from immintrin.h?  At least for
> the rest of the SSE and AVX stuff, I thought that only the immintrin.h functions
> are actually considered stable.
>
I don't know the answer for this. In general, once we (Clang) create a
__builtin_* function it's not going away, because it will break anyone
who uses them. (I assume the same is true for GCC.) There's a note in
Documentation/arch/x86/x86_64/fsgs.rst in regards to using
_{read,write}fsbase_u64() from immintrin.h (see below). I don't know
if that's analogous to what I'm doing here, but maybe we should do
something similar for crc32intr.h?

FSGSBASE instructions compiler support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

GCC version 4.6.4 and newer provide intrinsics for the FSGSBASE
instructions. Clang 5 supports them as well.

  =================== ===========================
  _readfsbase_u64()   Read the FS base register
  _readfsbase_u64()   Read the GS base register
  _writefsbase_u64()  Write the FS base register
  _writegsbase_u64()  Write the GS base register
  =================== ===========================

To utilize these intrinsics <immintrin.h> must be included in the source
code and the compiler option -mfsgsbase has to be added.


-bw
Re: [PATCH] x86/crc32: use builtins to improve code generation
Posted by Eric Biggers 9 months, 3 weeks ago
On Wed, Feb 26, 2025 at 11:08:22PM -0800, Bill Wendling wrote:
> > Doesn't this technically allow the compiler to insert CRC32 instructions
> > anywhere in arch/x86/ without the needed runtime CPU feature check?  Normally
> > when using intrinsics it's necessary to limit the scope of the feature
> > enablement to match the runtime CPU feature check that is done, e.g. by using
> > the target function attribute.
> >
> I'm not sure if CRC32 instructions will automatically be inserted when
> not explicitly called, especially since the other vector features are
> disabled. I wanted to limit enabling this flag for only crc32-glue.c,
> but my Makefile-fu failed me. The file appears to be compiled twice.
> But adding __attribute__((target("crc32"))) to the function would be
> much better.

Technically, limiting it to crc32-glue.c still isn't enough, as much of the code
in that file is executed before the crc32 instruction support is checked for.

I also noticed that -mcrc32 support wasn't added to clang until clang 14, by
https://github.com/llvm/llvm-project/commit/12fa608af44a80de8b655a8a984cd095908e7e80
But according to https://docs.kernel.org/process/changes.html the minimum clang
version to build Linux is 13.0.1.  So there's a missing check for support.

> > Do both gcc and clang consider these builtins to be a stable API, or do they
> > only guarantee the stability of _mm_crc32_*() from immintrin.h?  At least for
> > the rest of the SSE and AVX stuff, I thought that only the immintrin.h functions
> > are actually considered stable.
> >
> I don't know the answer for this. In general, once we (Clang) create a
> __builtin_* function it's not going away, because it will break anyone
> who uses them. (I assume the same is true for GCC.)

Here are examples of LLVM commits that removed x86 builtins:

* https://github.com/llvm/llvm-project/commit/09857a4bd166ca62a9610629731dfbf8f62cd955
* https://github.com/llvm/llvm-project/commit/9a14c369c422b244db78f1a9f947a891a75d912f
* https://github.com/llvm/llvm-project/commit/ec6024d0811b3116e0a29481b01179d5081a3b92
* https://github.com/llvm/llvm-project/commit/e4074432d5bf5c295f96eeed27c5b693f5b3bf16
* https://github.com/llvm/llvm-project/commit/9fddc3fd00b3ad5df5a3988e5cc4708254976173

So no, they do not appear to be considered stable.

(The equivalents in immintrin.h are stable, but good luck including immintrin.h
in the Linux kernel, since it depends on stdlib.h.)

Of course, if we really wanted this we could go with "it works in practice"
anyway.  But, given the small benefit of this patch vs. the potential risk I
don't think we should bother with it, unless it's acked by the gcc and clang
folks on the following points:

* The crc32 builtins are stable.

* gcc and clang will never generate crc32 instructions without explicitly using
  the builtins.  (BTW, keep in mind this ongoing work:
  https://gcc.gnu.org/wiki/cauldron2023talks?action=AttachFile&do=get&target=GCC+CRC+optimization.pdf)

Also note that crc32c_arch() already calls into the assembly code in
arch/x86/lib/crc32c-3way.S to handle lengths >= 512 bytes, and for handling the
tail data that assembly function already has a nice qword-at-a-time loop which
is exactly what we are trying to generate here.  A more promising approach might
be to reorganize things a bit so that we can reuse that assembly code.

- Eric
[PATCH v2] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
For both gcc and clang, crc32 builtins generate better code than the
inline asm. GCC improves, removing unneeded "mov" instructions. Clang
does the same and unrolls the loops. GCC has no changes on i386, but
Clang's code generation is vastly improved, due to Clang's "rm"
constraint issue.

The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
is expected because of the "rm" issue. However, Clang's performance is
better than GCC's by ~1.5%, most likely due to loop unrolling.

Link: https://github.com/llvm/llvm-project/issues/20571#issuecomment-2649330009
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: llvm@lists.linux.dev
Signed-off-by: Bill Wendling <morbo@google.com>
---
v2 - Limited range of '-mcrc32' usage to single file.
   - Use a function instead of macros.
---
 arch/x86/lib/Makefile     |  2 ++
 arch/x86/lib/crc32-glue.c | 15 ++++++++-------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 8a59c61624c2..1251f611ce3d 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -14,6 +14,8 @@ ifdef CONFIG_KCSAN
 CFLAGS_REMOVE_delay.o = $(CC_FLAGS_FTRACE)
 endif

+CFLAGS_crc32-glue.o := -mcrc32
+
 inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
 inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
 quiet_cmd_inat_tables = GEN     $@
diff --git a/arch/x86/lib/crc32-glue.c b/arch/x86/lib/crc32-glue.c
index 2dd18a886ded..fc70462ae2c1 100644
--- a/arch/x86/lib/crc32-glue.c
+++ b/arch/x86/lib/crc32-glue.c
@@ -47,11 +47,12 @@ u32 crc32_le_arch(u32 crc, const u8 *p, size_t len)
 }
 EXPORT_SYMBOL(crc32_le_arch);

-#ifdef CONFIG_X86_64
-#define CRC32_INST "crc32q %1, %q0"
-#else
-#define CRC32_INST "crc32l %1, %0"
-#endif
+static unsigned long crc32_ul(u32 crc, unsigned long p)
+{
+       if (IS_ENABLED(CONFIG_X86_64))
+               return __builtin_ia32_crc32di(crc, p);
+       return __builtin_ia32_crc32si(crc, p);
+}

 /*
  * Use carryless multiply version of crc32c when buffer size is >= 512 to
@@ -78,10 +79,10 @@ u32 crc32c_le_arch(u32 crc, const u8 *p, size_t len)

        for (num_longs = len / sizeof(unsigned long);
             num_longs != 0; num_longs--, p += sizeof(unsigned long))
-               asm(CRC32_INST : "+r" (crc) : "rm" (*(unsigned long *)p));
+               crc = crc32_ul(crc,  *(unsigned long *)p);

        for (len %= sizeof(unsigned long); len; len--, p++)
-               asm("crc32b %1, %0" : "+r" (crc) : "rm" (*p));
+               crc = __builtin_ia32_crc32qi(crc, *p);

        return crc;
 }
-- 
2.48.1.711.g2feabab25a-goog
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by David Laight 9 months, 3 weeks ago
On Thu, 27 Feb 2025 15:47:03 -0800
Bill Wendling <morbo@google.com> wrote:

> For both gcc and clang, crc32 builtins generate better code than the
> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> does the same and unrolls the loops. GCC has no changes on i386, but
> Clang's code generation is vastly improved, due to Clang's "rm"
> constraint issue.
> 
> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> is expected because of the "rm" issue. However, Clang's performance is
> better than GCC's by ~1.5%, most likely due to loop unrolling.

How much does it unroll?
How much you need depends on the latency of the crc32 instruction.
The copy of Agner's tables I have gives it a latency of 3 on
pretty much everything.
If you can only do one chained crc instruction every three clocks
it is hard to see how unrolling the loop will help.
Intel cpu (since sandy bridge) will run a two clock loop.
With three clocks to play with it should be easy (even for a compiler)
to generate a loop with no extra clock stalls.

Clearly if Clang decides to copy arguments to the stack an extra time
that will kill things. But in this case you want the "m" constraint
to directly read from the buffer (with a (reg,reg,8) addressing mode).

	David
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
On Mon, Mar 3, 2025 at 12:15 PM David Laight
<david.laight.linux@gmail.com> wrote:
> On Thu, 27 Feb 2025 15:47:03 -0800
> Bill Wendling <morbo@google.com> wrote:
>
> > For both gcc and clang, crc32 builtins generate better code than the
> > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > does the same and unrolls the loops. GCC has no changes on i386, but
> > Clang's code generation is vastly improved, due to Clang's "rm"
> > constraint issue.
> >
> > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > is expected because of the "rm" issue. However, Clang's performance is
> > better than GCC's by ~1.5%, most likely due to loop unrolling.
>
> How much does it unroll?
> How much you need depends on the latency of the crc32 instruction.
> The copy of Agner's tables I have gives it a latency of 3 on
> pretty much everything.
> If you can only do one chained crc instruction every three clocks
> it is hard to see how unrolling the loop will help.
> Intel cpu (since sandy bridge) will run a two clock loop.
> With three clocks to play with it should be easy (even for a compiler)
> to generate a loop with no extra clock stalls.
>
> Clearly if Clang decides to copy arguments to the stack an extra time
> that will kill things. But in this case you want the "m" constraint
> to directly read from the buffer (with a (reg,reg,8) addressing mode).
>
Below is what Clang generates with the builtins. From what Eric said,
this code is only run for sizes <= 512 bytes? So maybe it's not super
important to micro-optimize this. I apologize, but my ability to
measure clock loops for x86 code isn't great. (I'm sure I lack the
requisite benchmarks, etc.)

-bw

.LBB1_9:                                # =>This Inner Loop Header: Depth=1
        movl    %ebx, %ebx
        crc32q  (%rcx), %rbx
        addq    $8, %rcx
        incq    %rdi
        cmpq    %rdi, %rsi
        jne     .LBB1_9
# %bb.10:
        subq    %rdi, %rax
        jmp     .LBB1_11
.LBB1_7:
        movq    %r14, %rcx
.LBB1_11:
        movq    %r15, %rsi
        andq    $-8, %rsi
        cmpq    $7, %rdx
        jb      .LBB1_14
# %bb.12:
        xorl    %edx, %edx
.LBB1_13:                               # =>This Inner Loop Header: Depth=1
        movl    %ebx, %ebx
        crc32q  (%rcx,%rdx,8), %rbx
        crc32q  8(%rcx,%rdx,8), %rbx
        crc32q  16(%rcx,%rdx,8), %rbx
        crc32q  24(%rcx,%rdx,8), %rbx
        crc32q  32(%rcx,%rdx,8), %rbx
        crc32q  40(%rcx,%rdx,8), %rbx
        crc32q  48(%rcx,%rdx,8), %rbx
        crc32q  56(%rcx,%rdx,8), %rbx
        addq    $8, %rdx
        cmpq    %rdx, %rax
        jne     .LBB1_13
.LBB1_14:
        addq    %rsi, %r14
.LBB1_15:
        andq    $7, %r15
        je      .LBB1_23
# %bb.16:
        crc32b  (%r14), %ebx
        cmpl    $1, %r15d
        je      .LBB1_23
# %bb.17:
        crc32b  1(%r14), %ebx
        cmpl    $2, %r15d
        je      .LBB1_23
# %bb.18:
        crc32b  2(%r14), %ebx
        cmpl    $3, %r15d
        je      .LBB1_23
# %bb.19:
        crc32b  3(%r14), %ebx
        cmpl    $4, %r15d
        je      .LBB1_23
# %bb.20:
        crc32b  4(%r14), %ebx
        cmpl    $5, %r15d
        je      .LBB1_23
# %bb.21:
        crc32b  5(%r14), %ebx
        cmpl    $6, %r15d
        je      .LBB1_23
# %bb.22:
        crc32b  6(%r14), %ebx
.LBB1_23:
        movl    %ebx, %eax
.LBB1_24:
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by David Laight 9 months, 2 weeks ago
On Mon, 3 Mar 2025 12:27:21 -0800
Bill Wendling <morbo@google.com> wrote:

> On Mon, Mar 3, 2025 at 12:15 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> > On Thu, 27 Feb 2025 15:47:03 -0800
> > Bill Wendling <morbo@google.com> wrote:
> >  
> > > For both gcc and clang, crc32 builtins generate better code than the
> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > > does the same and unrolls the loops. GCC has no changes on i386, but
> > > Clang's code generation is vastly improved, due to Clang's "rm"
> > > constraint issue.
> > >
> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > > is expected because of the "rm" issue. However, Clang's performance is
> > > better than GCC's by ~1.5%, most likely due to loop unrolling.  
> >
> > How much does it unroll?
> > How much you need depends on the latency of the crc32 instruction.
> > The copy of Agner's tables I have gives it a latency of 3 on
> > pretty much everything.
> > If you can only do one chained crc instruction every three clocks
> > it is hard to see how unrolling the loop will help.
> > Intel cpu (since sandy bridge) will run a two clock loop.
> > With three clocks to play with it should be easy (even for a compiler)
> > to generate a loop with no extra clock stalls.
> >
> > Clearly if Clang decides to copy arguments to the stack an extra time
> > that will kill things. But in this case you want the "m" constraint
> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
> >  
> Below is what Clang generates with the builtins. From what Eric said,
> this code is only run for sizes <= 512 bytes? So maybe it's not super
> important to micro-optimize this. I apologize, but my ability to
> measure clock loops for x86 code isn't great. (I'm sure I lack the
> requisite benchmarks, etc.)

Jeepers - that is trashing the I-cache.
Not to mention all the conditional branches at the bottom.
Consider the basic loop:
1:	crc32q	(%rcx), %rbx
	addq	$8, %rcx
	cmp	%rcx, %rdx
	jne	1b
The crc32 has latency 3 so it must take at least 3 clocks.
Even naively the addq can be issued in the same clock as the crc32
and the cmp and jne in the following ones.
Since the jne is predicted taken, the addq can be assumed to execute
in the same clock as the jne.
(The cmp+jne might also get merged into a single u-op)
(I've done this with adc (for IP checksum), with two adc the loop takes
two clocks even with the extra memory reads.)

So that loop is likely to run limited by the three clock latency of crc32.
Even the memory reads will happen with all the crc32 just waiting for the
previous crc32 to finish.
You can take an instruction out of the loop:
1:	crc32q	(%rcx,%rdx), %rbx
	addq	$8, %rdx
	jne	1b
but that may not be necessary, and (IIRC) gcc doesn't like letting you
generate it.

For buffers that aren't multiples of 8 bytes 'remember' that the crc of
a byte depends on how far it is from the end of the buffer, and that initial
zero bytes have no effect.
So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
right by the number of bytes needed to make the rest of the buffer a multiple
or 8 bytes (the same as reading from across the start of the buffer and masking
the low bytes) then treat exactly the same as a buffer that is a multiple
of 8 bytes long.
Don't worry about misaligned reads, you lose less than one clock per cache
line (that is with adc doing a read every clock).

Actually measuring the performance is hard.
You can use rdtsc because the clock speed will change when the cpu gets busy.
There is a 'performance counter' that is actual clocks.
While you can use the library functions to set it up, you need to just read the
register - the library overhead it too big.
You also need the odd lfence.
Having done that, and provided the buffer is in the L1 d-cache you can measure
the loop time in clocks and compare against the expected value.
Once you've got 3 clocks per crc32 instruction it won't get any better,
which is why the 'fast' code for big buffers does crc of 3+ buffers sections
in parallel.

	David

> 
> -bw
> 
> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
>         movl    %ebx, %ebx
>         crc32q  (%rcx), %rbx
>         addq    $8, %rcx
>         incq    %rdi
>         cmpq    %rdi, %rsi
>         jne     .LBB1_9
> # %bb.10:
>         subq    %rdi, %rax
>         jmp     .LBB1_11
> .LBB1_7:
>         movq    %r14, %rcx
> .LBB1_11:
>         movq    %r15, %rsi
>         andq    $-8, %rsi
>         cmpq    $7, %rdx
>         jb      .LBB1_14
> # %bb.12:
>         xorl    %edx, %edx
> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
>         movl    %ebx, %ebx
>         crc32q  (%rcx,%rdx,8), %rbx
>         crc32q  8(%rcx,%rdx,8), %rbx
>         crc32q  16(%rcx,%rdx,8), %rbx
>         crc32q  24(%rcx,%rdx,8), %rbx
>         crc32q  32(%rcx,%rdx,8), %rbx
>         crc32q  40(%rcx,%rdx,8), %rbx
>         crc32q  48(%rcx,%rdx,8), %rbx
>         crc32q  56(%rcx,%rdx,8), %rbx
>         addq    $8, %rdx
>         cmpq    %rdx, %rax
>         jne     .LBB1_13
> .LBB1_14:
>         addq    %rsi, %r14
> .LBB1_15:
>         andq    $7, %r15
>         je      .LBB1_23
> # %bb.16:
>         crc32b  (%r14), %ebx
>         cmpl    $1, %r15d
>         je      .LBB1_23
> # %bb.17:
>         crc32b  1(%r14), %ebx
>         cmpl    $2, %r15d
>         je      .LBB1_23
> # %bb.18:
>         crc32b  2(%r14), %ebx
>         cmpl    $3, %r15d
>         je      .LBB1_23
> # %bb.19:
>         crc32b  3(%r14), %ebx
>         cmpl    $4, %r15d
>         je      .LBB1_23
> # %bb.20:
>         crc32b  4(%r14), %ebx
>         cmpl    $5, %r15d
>         je      .LBB1_23
> # %bb.21:
>         crc32b  5(%r14), %ebx
>         cmpl    $6, %r15d
>         je      .LBB1_23
> # %bb.22:
>         crc32b  6(%r14), %ebx
> .LBB1_23:
>         movl    %ebx, %eax
> .LBB1_24:
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by H. Peter Anvin 9 months, 2 weeks ago
On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@gmail.com> wrote:
>On Mon, 3 Mar 2025 12:27:21 -0800
>Bill Wendling <morbo@google.com> wrote:
>
>> On Mon, Mar 3, 2025 at 12:15 PM David Laight
>> <david.laight.linux@gmail.com> wrote:
>> > On Thu, 27 Feb 2025 15:47:03 -0800
>> > Bill Wendling <morbo@google.com> wrote:
>> >  
>> > > For both gcc and clang, crc32 builtins generate better code than the
>> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
>> > > does the same and unrolls the loops. GCC has no changes on i386, but
>> > > Clang's code generation is vastly improved, due to Clang's "rm"
>> > > constraint issue.
>> > >
>> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
>> > > is expected because of the "rm" issue. However, Clang's performance is
>> > > better than GCC's by ~1.5%, most likely due to loop unrolling.  
>> >
>> > How much does it unroll?
>> > How much you need depends on the latency of the crc32 instruction.
>> > The copy of Agner's tables I have gives it a latency of 3 on
>> > pretty much everything.
>> > If you can only do one chained crc instruction every three clocks
>> > it is hard to see how unrolling the loop will help.
>> > Intel cpu (since sandy bridge) will run a two clock loop.
>> > With three clocks to play with it should be easy (even for a compiler)
>> > to generate a loop with no extra clock stalls.
>> >
>> > Clearly if Clang decides to copy arguments to the stack an extra time
>> > that will kill things. But in this case you want the "m" constraint
>> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
>> >  
>> Below is what Clang generates with the builtins. From what Eric said,
>> this code is only run for sizes <= 512 bytes? So maybe it's not super
>> important to micro-optimize this. I apologize, but my ability to
>> measure clock loops for x86 code isn't great. (I'm sure I lack the
>> requisite benchmarks, etc.)
>
>Jeepers - that is trashing the I-cache.
>Not to mention all the conditional branches at the bottom.
>Consider the basic loop:
>1:	crc32q	(%rcx), %rbx
>	addq	$8, %rcx
>	cmp	%rcx, %rdx
>	jne	1b
>The crc32 has latency 3 so it must take at least 3 clocks.
>Even naively the addq can be issued in the same clock as the crc32
>and the cmp and jne in the following ones.
>Since the jne is predicted taken, the addq can be assumed to execute
>in the same clock as the jne.
>(The cmp+jne might also get merged into a single u-op)
>(I've done this with adc (for IP checksum), with two adc the loop takes
>two clocks even with the extra memory reads.)
>
>So that loop is likely to run limited by the three clock latency of crc32.
>Even the memory reads will happen with all the crc32 just waiting for the
>previous crc32 to finish.
>You can take an instruction out of the loop:
>1:	crc32q	(%rcx,%rdx), %rbx
>	addq	$8, %rdx
>	jne	1b
>but that may not be necessary, and (IIRC) gcc doesn't like letting you
>generate it.
>
>For buffers that aren't multiples of 8 bytes 'remember' that the crc of
>a byte depends on how far it is from the end of the buffer, and that initial
>zero bytes have no effect.
>So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
>right by the number of bytes needed to make the rest of the buffer a multiple
>or 8 bytes (the same as reading from across the start of the buffer and masking
>the low bytes) then treat exactly the same as a buffer that is a multiple
>of 8 bytes long.
>Don't worry about misaligned reads, you lose less than one clock per cache
>line (that is with adc doing a read every clock).
>
>Actually measuring the performance is hard.
>You can use rdtsc because the clock speed will change when the cpu gets busy.
>There is a 'performance counter' that is actual clocks.
>While you can use the library functions to set it up, you need to just read the
>register - the library overhead it too big.
>You also need the odd lfence.
>Having done that, and provided the buffer is in the L1 d-cache you can measure
>the loop time in clocks and compare against the expected value.
>Once you've got 3 clocks per crc32 instruction it won't get any better,
>which is why the 'fast' code for big buffers does crc of 3+ buffers sections
>in parallel.
>
>	David
>
>> 
>> -bw
>> 
>> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
>>         movl    %ebx, %ebx
>>         crc32q  (%rcx), %rbx
>>         addq    $8, %rcx
>>         incq    %rdi
>>         cmpq    %rdi, %rsi
>>         jne     .LBB1_9
>> # %bb.10:
>>         subq    %rdi, %rax
>>         jmp     .LBB1_11
>> .LBB1_7:
>>         movq    %r14, %rcx
>> .LBB1_11:
>>         movq    %r15, %rsi
>>         andq    $-8, %rsi
>>         cmpq    $7, %rdx
>>         jb      .LBB1_14
>> # %bb.12:
>>         xorl    %edx, %edx
>> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
>>         movl    %ebx, %ebx
>>         crc32q  (%rcx,%rdx,8), %rbx
>>         crc32q  8(%rcx,%rdx,8), %rbx
>>         crc32q  16(%rcx,%rdx,8), %rbx
>>         crc32q  24(%rcx,%rdx,8), %rbx
>>         crc32q  32(%rcx,%rdx,8), %rbx
>>         crc32q  40(%rcx,%rdx,8), %rbx
>>         crc32q  48(%rcx,%rdx,8), %rbx
>>         crc32q  56(%rcx,%rdx,8), %rbx
>>         addq    $8, %rdx
>>         cmpq    %rdx, %rax
>>         jne     .LBB1_13
>> .LBB1_14:
>>         addq    %rsi, %r14
>> .LBB1_15:
>>         andq    $7, %r15
>>         je      .LBB1_23
>> # %bb.16:
>>         crc32b  (%r14), %ebx
>>         cmpl    $1, %r15d
>>         je      .LBB1_23
>> # %bb.17:
>>         crc32b  1(%r14), %ebx
>>         cmpl    $2, %r15d
>>         je      .LBB1_23
>> # %bb.18:
>>         crc32b  2(%r14), %ebx
>>         cmpl    $3, %r15d
>>         je      .LBB1_23
>> # %bb.19:
>>         crc32b  3(%r14), %ebx
>>         cmpl    $4, %r15d
>>         je      .LBB1_23
>> # %bb.20:
>>         crc32b  4(%r14), %ebx
>>         cmpl    $5, %r15d
>>         je      .LBB1_23
>> # %bb.21:
>>         crc32b  5(%r14), %ebx
>>         cmpl    $6, %r15d
>>         je      .LBB1_23
>> # %bb.22:
>>         crc32b  6(%r14), %ebx
>> .LBB1_23:
>>         movl    %ebx, %eax
>> .LBB1_24:
>
>

The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 2 weeks ago
On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@zytor.com> wrote:
> On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@gmail.com> wrote:
> >On Mon, 3 Mar 2025 12:27:21 -0800
> >Bill Wendling <morbo@google.com> wrote:
> >
> >> On Mon, Mar 3, 2025 at 12:15 PM David Laight
> >> <david.laight.linux@gmail.com> wrote:
> >> > On Thu, 27 Feb 2025 15:47:03 -0800
> >> > Bill Wendling <morbo@google.com> wrote:
> >> >
> >> > > For both gcc and clang, crc32 builtins generate better code than the
> >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> >> > > does the same and unrolls the loops. GCC has no changes on i386, but
> >> > > Clang's code generation is vastly improved, due to Clang's "rm"
> >> > > constraint issue.
> >> > >
> >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> >> > > is expected because of the "rm" issue. However, Clang's performance is
> >> > > better than GCC's by ~1.5%, most likely due to loop unrolling.
> >> >
> >> > How much does it unroll?
> >> > How much you need depends on the latency of the crc32 instruction.
> >> > The copy of Agner's tables I have gives it a latency of 3 on
> >> > pretty much everything.
> >> > If you can only do one chained crc instruction every three clocks
> >> > it is hard to see how unrolling the loop will help.
> >> > Intel cpu (since sandy bridge) will run a two clock loop.
> >> > With three clocks to play with it should be easy (even for a compiler)
> >> > to generate a loop with no extra clock stalls.
> >> >
> >> > Clearly if Clang decides to copy arguments to the stack an extra time
> >> > that will kill things. But in this case you want the "m" constraint
> >> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
> >> >
> >> Below is what Clang generates with the builtins. From what Eric said,
> >> this code is only run for sizes <= 512 bytes? So maybe it's not super
> >> important to micro-optimize this. I apologize, but my ability to
> >> measure clock loops for x86 code isn't great. (I'm sure I lack the
> >> requisite benchmarks, etc.)
> >
> >Jeepers - that is trashing the I-cache.
> >Not to mention all the conditional branches at the bottom.
> >Consider the basic loop:
> >1:     crc32q  (%rcx), %rbx
> >       addq    $8, %rcx
> >       cmp     %rcx, %rdx
> >       jne     1b
> >The crc32 has latency 3 so it must take at least 3 clocks.
> >Even naively the addq can be issued in the same clock as the crc32
> >and the cmp and jne in the following ones.
> >Since the jne is predicted taken, the addq can be assumed to execute
> >in the same clock as the jne.
> >(The cmp+jne might also get merged into a single u-op)
> >(I've done this with adc (for IP checksum), with two adc the loop takes
> >two clocks even with the extra memory reads.)
> >
> >So that loop is likely to run limited by the three clock latency of crc32.
> >Even the memory reads will happen with all the crc32 just waiting for the
> >previous crc32 to finish.
> >You can take an instruction out of the loop:
> >1:     crc32q  (%rcx,%rdx), %rbx
> >       addq    $8, %rdx
> >       jne     1b
> >but that may not be necessary, and (IIRC) gcc doesn't like letting you
> >generate it.
> >
> >For buffers that aren't multiples of 8 bytes 'remember' that the crc of
> >a byte depends on how far it is from the end of the buffer, and that initial
> >zero bytes have no effect.
> >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
> >right by the number of bytes needed to make the rest of the buffer a multiple
> >or 8 bytes (the same as reading from across the start of the buffer and masking
> >the low bytes) then treat exactly the same as a buffer that is a multiple
> >of 8 bytes long.
> >Don't worry about misaligned reads, you lose less than one clock per cache
> >line (that is with adc doing a read every clock).
> >
For reference, GCC does much better with code gen, but only with the builtin:

.L39:
        crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
        addq    $8, %rax        #, p
        cmpq    %rcx, %rax      # _37, p
        jne     .L39    #,
        leaq    (%rsi,%rdi,8), %rsi     #, p
.L38:
        andl    $7, %edx        #, len
        je      .L41    #,
        addq    %rsi, %rdx      # p, _11
        movl    %ebx, %eax      # crc, <retval>
        .p2align 4
.L40:
        crc32b  (%rsi), %eax    # MEM[(const u8 *)p_45], <retval>
        addq    $1, %rsi        #, p
        cmpq    %rsi, %rdx      # p, _11
        jne     .L40    #,

> >Actually measuring the performance is hard.
> >You can use rdtsc because the clock speed will change when the cpu gets busy.
> >There is a 'performance counter' that is actual clocks.
> >While you can use the library functions to set it up, you need to just read the
> >register - the library overhead it too big.
> >You also need the odd lfence.
> >Having done that, and provided the buffer is in the L1 d-cache you can measure
> >the loop time in clocks and compare against the expected value.
> >Once you've got 3 clocks per crc32 instruction it won't get any better,
> >which is why the 'fast' code for big buffers does crc of 3+ buffers sections
> >in parallel.
> >
Thanks for the info! It'll help a lot the next time I need to delve
deeply into performance.

I tried using rdtsc and another programmatic way of measuring timing.
Also tried making the task have high priority, restricting to one CPU,
etc. But the numbers weren't as consistent as I wanted them to be. The
times I reported were the based on the fastest times / clocks /
whatever from several runs for each build.

> >       David
> >
> >>
> >> -bw
> >>
> >> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
> >>         movl    %ebx, %ebx
> >>         crc32q  (%rcx), %rbx
> >>         addq    $8, %rcx
> >>         incq    %rdi
> >>         cmpq    %rdi, %rsi
> >>         jne     .LBB1_9
> >> # %bb.10:
> >>         subq    %rdi, %rax
> >>         jmp     .LBB1_11
> >> .LBB1_7:
> >>         movq    %r14, %rcx
> >> .LBB1_11:
> >>         movq    %r15, %rsi
> >>         andq    $-8, %rsi
> >>         cmpq    $7, %rdx
> >>         jb      .LBB1_14
> >> # %bb.12:
> >>         xorl    %edx, %edx
> >> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
> >>         movl    %ebx, %ebx
> >>         crc32q  (%rcx,%rdx,8), %rbx
> >>         crc32q  8(%rcx,%rdx,8), %rbx
> >>         crc32q  16(%rcx,%rdx,8), %rbx
> >>         crc32q  24(%rcx,%rdx,8), %rbx
> >>         crc32q  32(%rcx,%rdx,8), %rbx
> >>         crc32q  40(%rcx,%rdx,8), %rbx
> >>         crc32q  48(%rcx,%rdx,8), %rbx
> >>         crc32q  56(%rcx,%rdx,8), %rbx
> >>         addq    $8, %rdx
> >>         cmpq    %rdx, %rax
> >>         jne     .LBB1_13
> >> .LBB1_14:
> >>         addq    %rsi, %r14
> >> .LBB1_15:
> >>         andq    $7, %r15
> >>         je      .LBB1_23
> >> # %bb.16:
> >>         crc32b  (%r14), %ebx
> >>         cmpl    $1, %r15d
> >>         je      .LBB1_23
> >> # %bb.17:
> >>         crc32b  1(%r14), %ebx
> >>         cmpl    $2, %r15d
> >>         je      .LBB1_23
> >> # %bb.18:
> >>         crc32b  2(%r14), %ebx
> >>         cmpl    $3, %r15d
> >>         je      .LBB1_23
> >> # %bb.19:
> >>         crc32b  3(%r14), %ebx
> >>         cmpl    $4, %r15d
> >>         je      .LBB1_23
> >> # %bb.20:
> >>         crc32b  4(%r14), %ebx
> >>         cmpl    $5, %r15d
> >>         je      .LBB1_23
> >> # %bb.21:
> >>         crc32b  5(%r14), %ebx
> >>         cmpl    $6, %r15d
> >>         je      .LBB1_23
> >> # %bb.22:
> >>         crc32b  6(%r14), %ebx
> >> .LBB1_23:
> >>         movl    %ebx, %eax
> >> .LBB1_24:
> >
> >
>
> The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?

Definitely on the weird side! I considered hard-coding something like
that, but thought it might be a bit convoluted, though certainly less
convoluted than what we generate now. A simple loop is probably all
that's needed, because it should only need to be done at most seven
times.

-bw
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by David Laight 9 months, 2 weeks ago
On Mon, 3 Mar 2025 16:16:43 -0800
Bill Wendling <morbo@google.com> wrote:

> On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@gmail.com> wrote:  
> > >On Mon, 3 Mar 2025 12:27:21 -0800
> > >Bill Wendling <morbo@google.com> wrote:
> > >  
> > >> On Mon, Mar 3, 2025 at 12:15 PM David Laight
> > >> <david.laight.linux@gmail.com> wrote:  
> > >> > On Thu, 27 Feb 2025 15:47:03 -0800
> > >> > Bill Wendling <morbo@google.com> wrote:
> > >> >  
> > >> > > For both gcc and clang, crc32 builtins generate better code than the
> > >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > >> > > does the same and unrolls the loops. GCC has no changes on i386, but
> > >> > > Clang's code generation is vastly improved, due to Clang's "rm"
> > >> > > constraint issue.
> > >> > >
> > >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > >> > > is expected because of the "rm" issue. However, Clang's performance is
> > >> > > better than GCC's by ~1.5%, most likely due to loop unrolling.  
> > >> >
> > >> > How much does it unroll?
> > >> > How much you need depends on the latency of the crc32 instruction.
> > >> > The copy of Agner's tables I have gives it a latency of 3 on
> > >> > pretty much everything.
> > >> > If you can only do one chained crc instruction every three clocks
> > >> > it is hard to see how unrolling the loop will help.
> > >> > Intel cpu (since sandy bridge) will run a two clock loop.
> > >> > With three clocks to play with it should be easy (even for a compiler)
> > >> > to generate a loop with no extra clock stalls.
> > >> >
> > >> > Clearly if Clang decides to copy arguments to the stack an extra time
> > >> > that will kill things. But in this case you want the "m" constraint
> > >> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
> > >> >  
> > >> Below is what Clang generates with the builtins. From what Eric said,
> > >> this code is only run for sizes <= 512 bytes? So maybe it's not super
> > >> important to micro-optimize this. I apologize, but my ability to
> > >> measure clock loops for x86 code isn't great. (I'm sure I lack the
> > >> requisite benchmarks, etc.)  
> > >
> > >Jeepers - that is trashing the I-cache.
> > >Not to mention all the conditional branches at the bottom.
> > >Consider the basic loop:
> > >1:     crc32q  (%rcx), %rbx
> > >       addq    $8, %rcx
> > >       cmp     %rcx, %rdx
> > >       jne     1b
> > >The crc32 has latency 3 so it must take at least 3 clocks.
> > >Even naively the addq can be issued in the same clock as the crc32
> > >and the cmp and jne in the following ones.
> > >Since the jne is predicted taken, the addq can be assumed to execute
> > >in the same clock as the jne.
> > >(The cmp+jne might also get merged into a single u-op)
> > >(I've done this with adc (for IP checksum), with two adc the loop takes
> > >two clocks even with the extra memory reads.)
> > >
> > >So that loop is likely to run limited by the three clock latency of crc32.
> > >Even the memory reads will happen with all the crc32 just waiting for the
> > >previous crc32 to finish.
> > >You can take an instruction out of the loop:
> > >1:     crc32q  (%rcx,%rdx), %rbx
> > >       addq    $8, %rdx
> > >       jne     1b
> > >but that may not be necessary, and (IIRC) gcc doesn't like letting you
> > >generate it.
> > >
> > >For buffers that aren't multiples of 8 bytes 'remember' that the crc of
> > >a byte depends on how far it is from the end of the buffer, and that initial
> > >zero bytes have no effect.
> > >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
> > >right by the number of bytes needed to make the rest of the buffer a multiple
> > >or 8 bytes (the same as reading from across the start of the buffer and masking
> > >the low bytes) then treat exactly the same as a buffer that is a multiple
> > >of 8 bytes long.
> > >Don't worry about misaligned reads, you lose less than one clock per cache
> > >line (that is with adc doing a read every clock).
> > >  
> For reference, GCC does much better with code gen, but only with the builtin:
> 
> .L39:
>         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
>         addq    $8, %rax        #, p
>         cmpq    %rcx, %rax      # _37, p
>         jne     .L39    #,

That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
then you either need to unroll once (no point doing any more) or use
the loop that does negative offsets from the end.

>         leaq    (%rsi,%rdi,8), %rsi     #, p

That is gcc being brain-dead again.
It pretty much refuses to use a loop-updated pointer (%rax above)
and recalculates it from the count.
At least it is a single instruction here and there are the extra
register don't cause a spill to stack.

> .L38:
>         andl    $7, %edx        #, len
>         je      .L41    #,
>         addq    %rsi, %rdx      # p, _11
>         movl    %ebx, %eax      # crc, <retval>
>         .p2align 4
> .L40:
>         crc32b  (%rsi), %eax    # MEM[(const u8 *)p_45], <retval>
>         addq    $1, %rsi        #, p
>         cmpq    %rsi, %rdx      # p, _11
>         jne     .L40    #,
> 
> > >Actually measuring the performance is hard.
> > >You can use rdtsc because the clock speed will change when the cpu gets busy.
> > >There is a 'performance counter' that is actual clocks.
> > >While you can use the library functions to set it up, you need to just read the
> > >register - the library overhead it too big.
> > >You also need the odd lfence.
> > >Having done that, and provided the buffer is in the L1 d-cache you can measure
> > >the loop time in clocks and compare against the expected value.
> > >Once you've got 3 clocks per crc32 instruction it won't get any better,
> > >which is why the 'fast' code for big buffers does crc of 3+ buffers sections
> > >in parallel.
> > >  
> Thanks for the info! It'll help a lot the next time I need to delve
> deeply into performance.
> 
> I tried using rdtsc and another programmatic way of measuring timing.
> Also tried making the task have high priority, restricting to one CPU,
> etc. But the numbers weren't as consistent as I wanted them to be. The
> times I reported were the based on the fastest times / clocks /
> whatever from several runs for each build.

I'll find the code loop I use - machine isn't powered on at the moment.

> 
> > >       David
> > >  
> > >>
> > >> -bw
> > >>
> > >> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
> > >>         movl    %ebx, %ebx
> > >>         crc32q  (%rcx), %rbx
> > >>         addq    $8, %rcx
> > >>         incq    %rdi
> > >>         cmpq    %rdi, %rsi
> > >>         jne     .LBB1_9
> > >> # %bb.10:
> > >>         subq    %rdi, %rax
> > >>         jmp     .LBB1_11
> > >> .LBB1_7:
> > >>         movq    %r14, %rcx
> > >> .LBB1_11:
> > >>         movq    %r15, %rsi
> > >>         andq    $-8, %rsi
> > >>         cmpq    $7, %rdx
> > >>         jb      .LBB1_14
> > >> # %bb.12:
> > >>         xorl    %edx, %edx
> > >> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
> > >>         movl    %ebx, %ebx
> > >>         crc32q  (%rcx,%rdx,8), %rbx
> > >>         crc32q  8(%rcx,%rdx,8), %rbx
> > >>         crc32q  16(%rcx,%rdx,8), %rbx
> > >>         crc32q  24(%rcx,%rdx,8), %rbx
> > >>         crc32q  32(%rcx,%rdx,8), %rbx
> > >>         crc32q  40(%rcx,%rdx,8), %rbx
> > >>         crc32q  48(%rcx,%rdx,8), %rbx
> > >>         crc32q  56(%rcx,%rdx,8), %rbx
> > >>         addq    $8, %rdx
> > >>         cmpq    %rdx, %rax
> > >>         jne     .LBB1_13
> > >> .LBB1_14:
> > >>         addq    %rsi, %r14
> > >> .LBB1_15:
> > >>         andq    $7, %r15
> > >>         je      .LBB1_23
> > >> # %bb.16:
> > >>         crc32b  (%r14), %ebx
> > >>         cmpl    $1, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.17:
> > >>         crc32b  1(%r14), %ebx
> > >>         cmpl    $2, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.18:
> > >>         crc32b  2(%r14), %ebx
> > >>         cmpl    $3, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.19:
> > >>         crc32b  3(%r14), %ebx
> > >>         cmpl    $4, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.20:
> > >>         crc32b  4(%r14), %ebx
> > >>         cmpl    $5, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.21:
> > >>         crc32b  5(%r14), %ebx
> > >>         cmpl    $6, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.22:
> > >>         crc32b  6(%r14), %ebx
> > >> .LBB1_23:
> > >>         movl    %ebx, %eax
> > >> .LBB1_24:  
> > >
> > >  
> >
> > The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?

Well, provided the branches aren't mispredicted it'll be limited by
the crc32b - so three clocks per byte, max 27
The 4-2-1 stepdown needs the extra address update but that may not cost
and is then max 9 clocks. Also a lot less I-cache.
The code logic may not matter unless the buffer is short.
I think the cpu will be executing the tail instructions while many
of the crc32 from the main loop are still queued waiting results
from earlier instructions (especially if you get a loop that would
run in two clocks with (say) addq instead of crc32q.

> Definitely on the weird side! I considered hard-coding something like
> that, but thought it might be a bit convoluted, though certainly less
> convoluted than what we generate now. A simple loop is probably all
> that's needed, because it should only need to be done at most seven
> times.

The byte loop should be limited by the crc32b. So probably as fast
as that unrolled mess, although it will always have a mispredicted
branch (or two) - I suspect all loops do.

	David
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by David Laight 9 months, 2 weeks ago
On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@gmail.com> wrote:

....
> > For reference, GCC does much better with code gen, but only with the builtin:
> > 
> > .L39:
> >         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
> >         addq    $8, %rax        #, p
> >         cmpq    %rcx, %rax      # _37, p
> >         jne     .L39    #,  
> 
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.

Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.

...
> I'll find the code loop I use - machine isn't powered on at the moment.

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static int pmc_id;
static void init_pmc(void)
{
        static struct perf_event_attr perf_attr = {
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .pinned = 1,
        };
        struct perf_event_mmap_page *pc;

        int perf_fd;
        perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
        if (perf_fd < 0) {
                fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
                exit(1);
        }
        pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
        if (pc == MAP_FAILED) {
                fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
                exit(1);
        }
        pmc_id = pc->index - 1;
}

static inline unsigned int rdpmc(id)
{
        unsigned int low, high;

// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
        asm volatile("mfence");
#endif
        asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
        asm volatile("mfence");
#endif

        // return low bits, counter might to 32 or 40 bits wide.
        return low;
}

The test code is then something like:
#define PASSES 10
        unsigned int ticks[PASSES];
        unsigned int tick;
        unsigned int i;

        for (i = 0; i < PASSES; i++) {
                tick = rdpmc(pmc_id);
                test_fn(buf, len);
                ticks[i] = rdpmc(pmc_id) - tick;
        }

        for (i = 0; i < PASSES; i++)
                printf(" %5d", ticks[i]);

Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.

I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.

The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum 
                0     0   160   160   160   160   160   160   160   160   160   160  overhead
 3637b4f0b942c3c4  682f   316    25    26    26    26    26    26    26    26    26  csum_partial
 3637b4f0b942c3c4  682f   124    79    43    25    25    25    24    26    25    24  csum_partial_1
 3637b4f0b942c3c4  682f   166    43    25    25    24    24    24    24    24    24  csum_new adc pair
 3637b4f0b942c3c4  682f   115    21    21    21    21    21    21    21    21    21  adc_dec_2
 3637b4f0b942c3c4  682f    97    34    31    23    24    24    24    24    24    23  adc_dec_4
 3637b4f0b942c3c4  682f    39    33    34    21    21    21    21    21    21    21  adc_dec_8
 3637b4f0b942c3c4  682f    81    52    49    52    49    26    25    27    25    26  adc_jcxz_2
 3637b4f0b942c3c4  682f    62    46    24    24    24    24    24    24    24    24  adc_jcxz_4
 3637b4f0b942c3c4  682f   224    40    21    21    23    23    23    23    23    23  adc_2_pair
 3637b4f0b942c3c4  682f    42    36    37    22    22    22    22    22    22    22  adc_4_pair_old
 3637b4f0b942c3c4  682f    42    37    34    41    23    23    23    23    23    23  adc_4_pair
 3637b4f0b942c3c4  682f   122    19    20    19    18    19    18    19    18    19  adcx_adox
        bef7a78a9  682f   104    51    30    30    30    30    30    30    30    30  add_c_16
        bef7a78a9  682f   143    50    50    27    27    27    27    27    27    27  add_c_32
        6ef7a78ae  682f   103    91    45    34    34    34    35    34    34    34  add_c_high

I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.


	David
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by Eric Biggers 9 months, 2 weeks ago
On Tue, Mar 04, 2025 at 08:52:52PM +0000, David Laight wrote:
> On Tue, 4 Mar 2025 04:32:23 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> ....
> > > For reference, GCC does much better with code gen, but only with the builtin:
> > > 
> > > .L39:
> > >         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
> > >         addq    $8, %rax        #, p
> > >         cmpq    %rcx, %rax      # _37, p
> > >         jne     .L39    #,  
> > 
> > That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> > then you either need to unroll once (no point doing any more) or use
> > the loop that does negative offsets from the end.
> 
> Thinking while properly awake the 1% difference isn't going to be a
> difference between the above and Clang's unrolled loop.
> Clang's loop will do 8 bytes every three clocks, if the above is slower
> it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
> which would be either 25% or 33% depending which way you measure it.
> 
> ...
> > I'll find the code loop I use - machine isn't powered on at the moment.
> 
> #include <linux/perf_event.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> 
> static int pmc_id;
> static void init_pmc(void)
> {
>         static struct perf_event_attr perf_attr = {
>                 .type = PERF_TYPE_HARDWARE,
>                 .config = PERF_COUNT_HW_CPU_CYCLES,
>                 .pinned = 1,
>         };
>         struct perf_event_mmap_page *pc;
> 
>         int perf_fd;
>         perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
>         if (perf_fd < 0) {
>                 fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
>                 exit(1);
>         }
>         pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
>         if (pc == MAP_FAILED) {
>                 fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
>                 exit(1);
>         }
>         pmc_id = pc->index - 1;
> }
> 
> static inline unsigned int rdpmc(id)
> {
>         unsigned int low, high;
> 
> // You need something to force the instruction pipeline to finish.
> // lfence might be enough.
> #ifndef NOFENCE
>         asm volatile("mfence");
> #endif
>         asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
> #ifndef NOFENCE
>         asm volatile("mfence");
> #endif
> 
>         // return low bits, counter might to 32 or 40 bits wide.
>         return low;
> }
> 
> The test code is then something like:
> #define PASSES 10
>         unsigned int ticks[PASSES];
>         unsigned int tick;
>         unsigned int i;
> 
>         for (i = 0; i < PASSES; i++) {
>                 tick = rdpmc(pmc_id);
>                 test_fn(buf, len);
>                 ticks[i] = rdpmc(pmc_id) - tick;
>         }
> 
>         for (i = 0; i < PASSES; i++)
>                 printf(" %5d", ticks[i]);
> 
> Make sure the data is in the l1-cache (or that dominates).
> The values output for passes 2-10 are likely to be the same to within
> a clock or two.
> I probably tried to subtract an offset for an empty test_fn().
> But you can easily work out the 'clocks per loop iteration'
> (which is what you are trying to measure) by measuring two separate
> loop lengths.
> 
> I did find that sometimes running the program gave slow results.
> But it is usually very consistent.
> Needs to be run as root.
> Clearly a hardware interrupt will generate a very big number.
> But they don't happen.
> 
> The copy I found was used for measuring ip checksum algorithms.
> Seems to output:
> $ sudo ./ipcsum 
>                 0     0   160   160   160   160   160   160   160   160   160   160  overhead
>  3637b4f0b942c3c4  682f   316    25    26    26    26    26    26    26    26    26  csum_partial
>  3637b4f0b942c3c4  682f   124    79    43    25    25    25    24    26    25    24  csum_partial_1
>  3637b4f0b942c3c4  682f   166    43    25    25    24    24    24    24    24    24  csum_new adc pair
>  3637b4f0b942c3c4  682f   115    21    21    21    21    21    21    21    21    21  adc_dec_2
>  3637b4f0b942c3c4  682f    97    34    31    23    24    24    24    24    24    23  adc_dec_4
>  3637b4f0b942c3c4  682f    39    33    34    21    21    21    21    21    21    21  adc_dec_8
>  3637b4f0b942c3c4  682f    81    52    49    52    49    26    25    27    25    26  adc_jcxz_2
>  3637b4f0b942c3c4  682f    62    46    24    24    24    24    24    24    24    24  adc_jcxz_4
>  3637b4f0b942c3c4  682f   224    40    21    21    23    23    23    23    23    23  adc_2_pair
>  3637b4f0b942c3c4  682f    42    36    37    22    22    22    22    22    22    22  adc_4_pair_old
>  3637b4f0b942c3c4  682f    42    37    34    41    23    23    23    23    23    23  adc_4_pair
>  3637b4f0b942c3c4  682f   122    19    20    19    18    19    18    19    18    19  adcx_adox
>         bef7a78a9  682f   104    51    30    30    30    30    30    30    30    30  add_c_16
>         bef7a78a9  682f   143    50    50    27    27    27    27    27    27    27  add_c_32
>         6ef7a78ae  682f   103    91    45    34    34    34    35    34    34    34  add_c_high
> 
> I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
> but more portable.

I guess this thread has turned into one where everyone has to weigh in :-)

Just to summarize my thoughts on the whole thread:

- IMO we should not use the crc32 intrinsics yet, as there are too many issues
  including no stability guarantee for the builtins (or else having to figure
  out how to include immintrin.h in the kernel to get the stable functions),
  having to set the crc32 target with the correct scope, dealing with old
  compiler versions that don't support crc32, and unhelpful loop unrolling.

- https://lore.kernel.org/r/20250210210741.471725-1-ebiggers@kernel.org already
  fixed the spilling to the stack with clang.  It does result in a separate mov
  from memory instead of taking advantage of the mem operand support.  But that
  should not make much of a difference.

- crc_kunit already includes a benchmark.  I recommend using that for
  benchmarking the kernel's CRC code.  Sure, one can do a more precise analysis
  with performance counters, but IMO it's generally unnecessary.

- The 4-2-1 step-down is a good idea, and in fact crc32c-3way.S (which handles
  lengths >= 512 bytes) already does exactly that for tail handling.  I sent out
  https://lore.kernel.org/r/20250304213216.108925-1-ebiggers@kernel.org which
  adds it to the C code (which handles lengths < 512 bytes) too.

- Moving all this to assembly is still attractive, especially considering that
  lengths >= 512 bytes are already handled in assembly in crc32c-3way.S, and
  essentially the exact code we want is already in that file (it's used to
  handle anything left over from the 3-way processing).  But, I think we'll keep
  the C (with inline asm for just the crc32 instructions) version too for now.
  It's a bit more approachable, and it's nice to avoid an extra function call to
  a .S file.
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by H. Peter Anvin 9 months, 2 weeks ago
On March 3, 2025 4:16:43 PM PST, Bill Wendling <morbo@google.com> wrote:
>On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@zytor.com> wrote:
>> On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@gmail.com> wrote:
>> >On Mon, 3 Mar 2025 12:27:21 -0800
>> >Bill Wendling <morbo@google.com> wrote:
>> >
>> >> On Mon, Mar 3, 2025 at 12:15 PM David Laight
>> >> <david.laight.linux@gmail.com> wrote:
>> >> > On Thu, 27 Feb 2025 15:47:03 -0800
>> >> > Bill Wendling <morbo@google.com> wrote:
>> >> >
>> >> > > For both gcc and clang, crc32 builtins generate better code than the
>> >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
>> >> > > does the same and unrolls the loops. GCC has no changes on i386, but
>> >> > > Clang's code generation is vastly improved, due to Clang's "rm"
>> >> > > constraint issue.
>> >> > >
>> >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
>> >> > > is expected because of the "rm" issue. However, Clang's performance is
>> >> > > better than GCC's by ~1.5%, most likely due to loop unrolling.
>> >> >
>> >> > How much does it unroll?
>> >> > How much you need depends on the latency of the crc32 instruction.
>> >> > The copy of Agner's tables I have gives it a latency of 3 on
>> >> > pretty much everything.
>> >> > If you can only do one chained crc instruction every three clocks
>> >> > it is hard to see how unrolling the loop will help.
>> >> > Intel cpu (since sandy bridge) will run a two clock loop.
>> >> > With three clocks to play with it should be easy (even for a compiler)
>> >> > to generate a loop with no extra clock stalls.
>> >> >
>> >> > Clearly if Clang decides to copy arguments to the stack an extra time
>> >> > that will kill things. But in this case you want the "m" constraint
>> >> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
>> >> >
>> >> Below is what Clang generates with the builtins. From what Eric said,
>> >> this code is only run for sizes <= 512 bytes? So maybe it's not super
>> >> important to micro-optimize this. I apologize, but my ability to
>> >> measure clock loops for x86 code isn't great. (I'm sure I lack the
>> >> requisite benchmarks, etc.)
>> >
>> >Jeepers - that is trashing the I-cache.
>> >Not to mention all the conditional branches at the bottom.
>> >Consider the basic loop:
>> >1:     crc32q  (%rcx), %rbx
>> >       addq    $8, %rcx
>> >       cmp     %rcx, %rdx
>> >       jne     1b
>> >The crc32 has latency 3 so it must take at least 3 clocks.
>> >Even naively the addq can be issued in the same clock as the crc32
>> >and the cmp and jne in the following ones.
>> >Since the jne is predicted taken, the addq can be assumed to execute
>> >in the same clock as the jne.
>> >(The cmp+jne might also get merged into a single u-op)
>> >(I've done this with adc (for IP checksum), with two adc the loop takes
>> >two clocks even with the extra memory reads.)
>> >
>> >So that loop is likely to run limited by the three clock latency of crc32.
>> >Even the memory reads will happen with all the crc32 just waiting for the
>> >previous crc32 to finish.
>> >You can take an instruction out of the loop:
>> >1:     crc32q  (%rcx,%rdx), %rbx
>> >       addq    $8, %rdx
>> >       jne     1b
>> >but that may not be necessary, and (IIRC) gcc doesn't like letting you
>> >generate it.
>> >
>> >For buffers that aren't multiples of 8 bytes 'remember' that the crc of
>> >a byte depends on how far it is from the end of the buffer, and that initial
>> >zero bytes have no effect.
>> >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
>> >right by the number of bytes needed to make the rest of the buffer a multiple
>> >or 8 bytes (the same as reading from across the start of the buffer and masking
>> >the low bytes) then treat exactly the same as a buffer that is a multiple
>> >of 8 bytes long.
>> >Don't worry about misaligned reads, you lose less than one clock per cache
>> >line (that is with adc doing a read every clock).
>> >
>For reference, GCC does much better with code gen, but only with the builtin:
>
>.L39:
>        crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
>        addq    $8, %rax        #, p
>        cmpq    %rcx, %rax      # _37, p
>        jne     .L39    #,
>        leaq    (%rsi,%rdi,8), %rsi     #, p
>.L38:
>        andl    $7, %edx        #, len
>        je      .L41    #,
>        addq    %rsi, %rdx      # p, _11
>        movl    %ebx, %eax      # crc, <retval>
>        .p2align 4
>.L40:
>        crc32b  (%rsi), %eax    # MEM[(const u8 *)p_45], <retval>
>        addq    $1, %rsi        #, p
>        cmpq    %rsi, %rdx      # p, _11
>        jne     .L40    #,
>
>> >Actually measuring the performance is hard.
>> >You can use rdtsc because the clock speed will change when the cpu gets busy.
>> >There is a 'performance counter' that is actual clocks.
>> >While you can use the library functions to set it up, you need to just read the
>> >register - the library overhead it too big.
>> >You also need the odd lfence.
>> >Having done that, and provided the buffer is in the L1 d-cache you can measure
>> >the loop time in clocks and compare against the expected value.
>> >Once you've got 3 clocks per crc32 instruction it won't get any better,
>> >which is why the 'fast' code for big buffers does crc of 3+ buffers sections
>> >in parallel.
>> >
>Thanks for the info! It'll help a lot the next time I need to delve
>deeply into performance.
>
>I tried using rdtsc and another programmatic way of measuring timing.
>Also tried making the task have high priority, restricting to one CPU,
>etc. But the numbers weren't as consistent as I wanted them to be. The
>times I reported were the based on the fastest times / clocks /
>whatever from several runs for each build.
>
>> >       David
>> >
>> >>
>> >> -bw
>> >>
>> >> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
>> >>         movl    %ebx, %ebx
>> >>         crc32q  (%rcx), %rbx
>> >>         addq    $8, %rcx
>> >>         incq    %rdi
>> >>         cmpq    %rdi, %rsi
>> >>         jne     .LBB1_9
>> >> # %bb.10:
>> >>         subq    %rdi, %rax
>> >>         jmp     .LBB1_11
>> >> .LBB1_7:
>> >>         movq    %r14, %rcx
>> >> .LBB1_11:
>> >>         movq    %r15, %rsi
>> >>         andq    $-8, %rsi
>> >>         cmpq    $7, %rdx
>> >>         jb      .LBB1_14
>> >> # %bb.12:
>> >>         xorl    %edx, %edx
>> >> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
>> >>         movl    %ebx, %ebx
>> >>         crc32q  (%rcx,%rdx,8), %rbx
>> >>         crc32q  8(%rcx,%rdx,8), %rbx
>> >>         crc32q  16(%rcx,%rdx,8), %rbx
>> >>         crc32q  24(%rcx,%rdx,8), %rbx
>> >>         crc32q  32(%rcx,%rdx,8), %rbx
>> >>         crc32q  40(%rcx,%rdx,8), %rbx
>> >>         crc32q  48(%rcx,%rdx,8), %rbx
>> >>         crc32q  56(%rcx,%rdx,8), %rbx
>> >>         addq    $8, %rdx
>> >>         cmpq    %rdx, %rax
>> >>         jne     .LBB1_13
>> >> .LBB1_14:
>> >>         addq    %rsi, %r14
>> >> .LBB1_15:
>> >>         andq    $7, %r15
>> >>         je      .LBB1_23
>> >> # %bb.16:
>> >>         crc32b  (%r14), %ebx
>> >>         cmpl    $1, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.17:
>> >>         crc32b  1(%r14), %ebx
>> >>         cmpl    $2, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.18:
>> >>         crc32b  2(%r14), %ebx
>> >>         cmpl    $3, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.19:
>> >>         crc32b  3(%r14), %ebx
>> >>         cmpl    $4, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.20:
>> >>         crc32b  4(%r14), %ebx
>> >>         cmpl    $5, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.21:
>> >>         crc32b  5(%r14), %ebx
>> >>         cmpl    $6, %r15d
>> >>         je      .LBB1_23
>> >> # %bb.22:
>> >>         crc32b  6(%r14), %ebx
>> >> .LBB1_23:
>> >>         movl    %ebx, %eax
>> >> .LBB1_24:
>> >
>> >
>>
>> The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?
>
>Definitely on the weird side! I considered hard-coding something like
>that, but thought it might be a bit convoluted, though certainly less
>convoluted than what we generate now. A simple loop is probably all
>that's needed, because it should only need to be done at most seven
>times.
>
>-bw
>

4-2-1 makes more sense probably (4 bytes, then 2 bytes, then 1 byte depending on which bits are set.)
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by Eric Biggers 9 months, 3 weeks ago
On Thu, Feb 27, 2025 at 03:47:03PM -0800, Bill Wendling wrote:
> For both gcc and clang, crc32 builtins generate better code than the
> inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> does the same and unrolls the loops. GCC has no changes on i386, but
> Clang's code generation is vastly improved, due to Clang's "rm"
> constraint issue.
> 
> The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> is expected because of the "rm" issue. However, Clang's performance is
> better than GCC's by ~1.5%, most likely due to loop unrolling.

Also note that the patch
https://lore.kernel.org/r/20250210210741.471725-1-ebiggers@kernel.org/ (which is
already enqueued in the crc tree for 6.15) changes "rm" to "r" when the compiler
is clang, to improve clang's code generation.  The numbers you quote are against
the original version, right?

- Eric
Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Posted by Bill Wendling 9 months, 3 weeks ago
On Fri, Feb 28, 2025 at 1:20 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Thu, Feb 27, 2025 at 03:47:03PM -0800, Bill Wendling wrote:
> > For both gcc and clang, crc32 builtins generate better code than the
> > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > does the same and unrolls the loops. GCC has no changes on i386, but
> > Clang's code generation is vastly improved, due to Clang's "rm"
> > constraint issue.
> >
> > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > is expected because of the "rm" issue. However, Clang's performance is
> > better than GCC's by ~1.5%, most likely due to loop unrolling.
>
> Also note that the patch
> https://lore.kernel.org/r/20250210210741.471725-1-ebiggers@kernel.org/ (which is
> already enqueued in the crc tree for 6.15) changes "rm" to "r" when the compiler
> is clang, to improve clang's code generation.  The numbers you quote are against
> the original version, right?
>
Yeah, they were against top-of-tree.

-bw