Adding __popcountsi2 and __popcountdi2

Nathan Chancellor posted 1 patch 9 months, 2 weeks ago
Adding __popcountsi2 and __popcountdi2
Posted by Nathan Chancellor 9 months, 2 weeks ago
Hi Linus,

Since I ran into problems at pull request time previously, I figured I
would save myself some trouble and gauge your opinion up front. How
palatable would the diff at the end of the thread be for the kernel?
Clang would like to start emitting calls to __popcountsi2 and
__popcountdi2 [1] for certain architectures (ARM and RISC-V), which
would normally be a part of the compiler runtime but obviously the
kernel does not link against it so it breaks the build. I figured added
these may not be as bad as the wcslen() case because most architectures
generally have an optimized popcount implementation and I am not sure
compiler builtins are banned entirely from the kernel but I can
understand if it is still contentious. It sounds like GCC has previously
wanted to something similar [2] and it was somewhat brought up on the
mailing lists [3] but never persued further it seems. Since this is a
compiler runtime function, '-fno-builtin' would not work to avoid this.

Cheers,
Nathan

[1]: https://github.com/llvm/llvm-project/pull/101786
[2]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105253#c10
[3]: https://lore.kernel.org/YlSb5D3rDTyCWpay@tucnak/

diff --git a/lib/Makefile b/lib/Makefile
index f07b24ce1b3f..0240fa7d6b5b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -52,7 +52,7 @@ obj-y	+= lockref.o
 
 obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \
-	 list_sort.o uuid.o iov_iter.o clz_ctz.o \
+	 list_sort.o uuid.o iov_iter.o clz_ctz.o popcount.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
diff --git a/lib/popcount.c b/lib/popcount.c
new file mode 100644
index 000000000000..0234961cc35e
--- /dev/null
+++ b/lib/popcount.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * The functions in this file aren't called directly, but may be emitted
+ * by the compiler.
+ */
+
+#include <linux/bitops.h>
+#include <linux/export.h>
+
+int __popcountsi2(unsigned int val);
+int __popcountsi2(unsigned int val)
+{
+	return __arch_hweight32(val);
+}
+EXPORT_SYMBOL(__popcountsi2);
+
+int __popcountdi2(unsigned long long val);
+int __popcountdi2(unsigned long long val)
+{
+	return __arch_hweight64(val);
+}
+EXPORT_SYMBOL(__popcountdi2);
Re: Adding __popcountsi2 and __popcountdi2
Posted by David Laight 9 months, 1 week ago
On Thu, 24 Apr 2025 17:33:42 -0700
Nathan Chancellor <nathan@kernel.org> wrote:

> Hi Linus,
> 
> Since I ran into problems at pull request time previously, I figured I
> would save myself some trouble and gauge your opinion up front. How
> palatable would the diff at the end of the thread be for the kernel?
> Clang would like to start emitting calls to __popcountsi2 and
> __popcountdi2 [1] for certain architectures (ARM and RISC-V), which
> would normally be a part of the compiler runtime but obviously the
> kernel does not link against it so it breaks the build. I figured added
> these may not be as bad as the wcslen() case because most architectures
> generally have an optimized popcount implementation and I am not sure
> compiler builtins are banned entirely from the kernel but I can
> understand if it is still contentious. It sounds like GCC has previously
> wanted to something similar [2] and it was somewhat brought up on the
> mailing lists [3] but never persued further it seems. Since this is a
> compiler runtime function, '-fno-builtin' would not work to avoid this.

Is this the compiler converting a call to __builtin_popcount() into
a function call - which the kernel can arrange to never do.

Or the compiler detecting a code pattern that looks like an open-coded
'popcount' function and deciding to convert it to a call to the builtin?
(which is a translation the kernel pretty much never wants for any
such code pattern - including memcpy()).

In either case the link failure is exactly what you want.

	David
Re: Adding __popcountsi2 and __popcountdi2
Posted by Linus Torvalds 9 months, 2 weeks ago
On Thu, 24 Apr 2025 at 17:33, Nathan Chancellor <nathan@kernel.org> wrote:
>
> I figured added
> these may not be as bad as the wcslen() case because most architectures
> generally have an optimized popcount implementation and I am not sure
> compiler builtins are banned entirely from the kernel but I can
> understand if it is still contentious.

Why does the compiler even bother to do this if the architecture
doesn't have the popcount instruction? The function call is quite
possibly more expensive than just doing it the stupid way.

But if you want to do this, put the damn thing as an alias on the code
that actually *does* the SW fallback in lib/hweight.c.

Because the way your patch does it now, it takes "I'm doing stupid
things" to the next level by turning that function call into *two*
function calls - first calling __popcountsi2, which then calls
__sw_hweight32.

Let's not do stupid things, ok?

              Linus
Re: Adding __popcountsi2 and __popcountdi2
Posted by Nathan Chancellor 9 months, 2 weeks ago
On Thu, Apr 24, 2025 at 06:36:33PM -0700, Linus Torvalds wrote:
> On Thu, 24 Apr 2025 at 17:33, Nathan Chancellor <nathan@kernel.org> wrote:
> >
> > I figured added
> > these may not be as bad as the wcslen() case because most architectures
> > generally have an optimized popcount implementation and I am not sure
> > compiler builtins are banned entirely from the kernel but I can
> > understand if it is still contentious.
> 
> Why does the compiler even bother to do this if the architecture
> doesn't have the popcount instruction? The function call is quite
> possibly more expensive than just doing it the stupid way.

Not entirely sure what the motivation is from the compiler side but I
cannot immagine that they would be doing this if it was not more
efficient in some way.

> But if you want to do this, put the damn thing as an alias on the code
> that actually *does* the SW fallback in lib/hweight.c.
> 
> Because the way your patch does it now, it takes "I'm doing stupid
> things" to the next level by turning that function call into *two*
> function calls - first calling __popcountsi2, which then calls
> __sw_hweight32.
> 
> Let's not do stupid things, ok?

I will test declaring __popcount{s,d}i2() as aliases of
__sw_hweight{32,64}() to see what effect that has but I figured that
calling the __arch_hweight variants was more correct because some
architectures (at least RISC-V and x86 when I looked) use alternatives
in that path to use hardware instructions and avoid the software path
altogether. While there would still be the overhead from the function
call, I figured not using the software fallback would at least soften
that blow.

Cheers,
Nathan
Re: Adding __popcountsi2 and __popcountdi2
Posted by Linus Torvalds 9 months, 2 weeks ago
On Thu, 24 Apr 2025 at 19:11, Nathan Chancellor <nathan@kernel.org> wrote:
>
> I will test declaring __popcount{s,d}i2() as aliases of
> __sw_hweight{32,64}() to see what effect that has but I figured that
> calling the __arch_hweight variants was more correct because some
> architectures (at least RISC-V and x86 when I looked) use alternatives
> in that path to use hardware instructions and avoid the software path
> altogether. While there would still be the overhead from the function
> call, I figured not using the software fallback would at least soften
> that blow.

Once you have the overhead of a function call - and all the register
games etc that involves, you're almost certainly better off with the
simple unconditional bitwise games.

                Linus
Re: Adding __popcountsi2 and __popcountdi2
Posted by Maciej W. Rozycki 9 months, 1 week ago
On Thu, 24 Apr 2025, Linus Torvalds wrote:

> > I will test declaring __popcount{s,d}i2() as aliases of
> > __sw_hweight{32,64}() to see what effect that has but I figured that
> > calling the __arch_hweight variants was more correct because some
> > architectures (at least RISC-V and x86 when I looked) use alternatives
> > in that path to use hardware instructions and avoid the software path
> > altogether. While there would still be the overhead from the function
> > call, I figured not using the software fallback would at least soften
> > that blow.
> 
> Once you have the overhead of a function call - and all the register
> games etc that involves, you're almost certainly better off with the
> simple unconditional bitwise games.

 Unless optimising for size, which we do support.

 Adding another function call to a non-leaf function is usually cheap on 
RISC platforms as a stack frame must have been created already, so the 
instructions required have taken their space anyway.  For a leaf function 
-- it depends, but in the cheapest case it's 5 instructions total: 2 pairs 
to adjust SP both ways and to save/restore the link register respectively, 
plus one for the actual call.  The result may still be smaller than an 
open-coded variant.

 I have no opinion though on providing libgcc replacement functions to 
satisfy libcalls for -Os builds.

  Maciej
Re: Adding __popcountsi2 and __popcountdi2
Posted by Linus Torvalds 9 months, 1 week ago
On Mon, 5 May 2025 at 08:05, Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
> On Thu, 24 Apr 2025, Linus Torvalds wrote:
>
> > > I will test declaring __popcount{s,d}i2() as aliases of
> > > __sw_hweight{32,64}() to see what effect that has but I figured that
> > > calling the __arch_hweight variants was more correct because some
> > > architectures (at least RISC-V and x86 when I looked) use alternatives
> > > in that path to use hardware instructions and avoid the software path
> > > altogether. While there would still be the overhead from the function
> > > call, I figured not using the software fallback would at least soften
> > > that blow.
> >
> > Once you have the overhead of a function call - and all the register
> > games etc that involves, you're almost certainly better off with the
> > simple unconditional bitwise games.
>
>  Unless optimising for size, which we do support.

I think you missed the part where I said

 "But if you want to do this, put the damn thing as an alias on the code
  that actually *does* the SW fallback in lib/hweight.c."

IOW, what I object to is being *stupid* and doing *two* function
calls. It sure as hell doesn't optimize for size.

If you want to optimize for size, just call the existing SW fallback
in lib/hweight.c.

Don't call a *new* function that then calls the existing SW fallback
as another fallback level.

Because that just makes me go "Eww".

Honestly, from a performance standpoint absolutely none of this
matters. Unless you *inline* the bit scanning as a single instruction
- which this is *NOT* about - you have already lost.

So at that point, it's a question of just how pointless you want the code to be.

             Linus
Re: Adding __popcountsi2 and __popcountdi2
Posted by Maciej W. Rozycki 9 months, 1 week ago
On Mon, 5 May 2025, Linus Torvalds wrote:

> > > Once you have the overhead of a function call - and all the register
> > > games etc that involves, you're almost certainly better off with the
> > > simple unconditional bitwise games.
> >
> >  Unless optimising for size, which we do support.
> 
> I think you missed the part where I said
> 
>  "But if you want to do this, put the damn thing as an alias on the code
>   that actually *does* the SW fallback in lib/hweight.c."
> 
> IOW, what I object to is being *stupid* and doing *two* function
> calls. It sure as hell doesn't optimize for size.

 Sure, I only commented on whether or not to have entry points defined in 
the link for libcalls that GCC emits whenever it produces a `popcount' 
operation it cannot find a machine description pattern for (IOW a machine 
instruction or a favourable sequence of).

 They need to have specific names a per the subject and it's up to us to 
implement them one way or another should we choose to; and as I say it 
seems not unreasonable to me for GCC to emit these libcalls at least when 
optimising for size.

 Traditionally we have chosen not to implement double-precision division 
libcalls so as to trap expensive usage.  For `popcount' I have no opinion 
offhand.

  Maciej