include/linux/find.h | 45 +++++++++---- lib/find_bit.c | 155 +++++++++++++++++++++++++++++-------------- 2 files changed, 138 insertions(+), 62 deletions(-)
In the recent discussion [1], it was noticed that find_next_bit()
functions may be improved by adding wrappers around common
__find_next_bit().
I tried that trick; and although it doesn't affect performance
significantly, as reported by find_bit_benchmark, there's ~2.5K
decrease in image size.
find_first_bit() is reworked accordingly.
[1] https://lkml.org/lkml/2022/7/25/830
Yury Norov (5):
lib/find_bit: rename le to need_swab in __find_next_bit()
lib/find_bit: optimize find_next_bit() functions
lib/find_bit: unify _find_first_{,and,zero}_bit implementations
lib/find_bit: create find_first_zero_bit_le()
lib/find_bit: re-use __find_first_bit() in __find_next_bit()
include/linux/find.h | 45 +++++++++----
lib/find_bit.c | 155 +++++++++++++++++++++++++++++--------------
2 files changed, 138 insertions(+), 62 deletions(-)
--
2.34.1
On Thu, Jul 28, 2022 at 9:12 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> In the recent discussion [1], it was noticed that find_next_bit()
> functions may be improved by adding wrappers around common
> __find_next_bit().
So looking at the end result, this generates fairly good code.
I say "fairly good" because _find_next_and_bit() ends up being disgusting.
The reason? That
if (addr2)
tmp &= addr2[start / BITS_PER_LONG];
generates horrific code when 'addr2' isn't seen to be always NULL.
So this doesn't affect the regular _find_next_bit case, because the
inliner sees that addr2 is always NULL and doesn't generate extra code
for it.
But the code that actually *does* have two incoming bitmasks will now
pointlessly test that second bitmask pointer all the time.
Now, that particular function probably doesn't matter, but this code
seems to be unnecessarily written to try to be overly generic, and
that
(a) makes it hard to read the source code because the source code
doesn't do the obvious thing
(b) clearly generates suboptimal code too
so I'm really not a huge fan of this "try to share code". Not when the
resulting shared code is harder to read, and impossible for the
compiler to do a great job at.
And the code sharing really doesn't help all that much.
If you really want to generate good code, use macros, and share the
logic that way. Not hugely readable either, but I think it's actually
not bad.
I think something like this would work:
#define BIT_FIND_SETUP(addr, size, start) \
unsigned long val, idx; \
if (unlikely(start >= size)) \
return size; \
idx = start / BITS_PER_LONG;
#define BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION) \
if (!IS_ALIGNED(start, BITS_PER_LONG)) { \
unsigned long mask; \
mask = BITMAP_FIRST_WORD_MASK(start); \
if ((val = mask & (EXPRESSION)) != 0) \
goto found; \
idx++; \
}
#define BIT_WORD_LOOP(ptr, size, idx, val, EXPRESSION) \
do { \
if ((val = (EXPRESSION)) != 0) \
goto found; \
idx++; \
} while ((idx)*BITS_PER_LONG < (size));
#define BIT_FIND_BODY(addr, size, start, EXPRESSION) \
BIT_FIND_SETUP(addr, size, start) \
BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION) \
BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION) \
return size; \
found: BIT_WORD_SWAB(val); \
return min((idx)*BITS_PER_LONG + __ffs(val), size)
#define BIT_WORD_SWAB(x) /* Nothing */
and then for the BE versions you just use the same macros, but you
make BIT_WORD_SWAB() do the swab.
I'm attaching an UNTESTED version of lib/find_bit.c that works the
above way (note: it is based on your header file changes!)
It builds for me and seems to generate reasonable code, although I
notice that clang messes up the "__ffs()" inline asm and forces the
source into memory.
(Gcc doesn't do that, but gcc follows the code exactly and generates
"shl $6" instructions, while clang seems to figure out that it can
just add 64 instead)
Linus
On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It builds for me and seems to generate reasonable code, although I
> notice that clang messes up the "__ffs()" inline asm and forces the
> source into memory.
I have created a llvm issue for this at
https://github.com/llvm/llvm-project/issues/56789
and while I noticed this while looking at the rather odd code
generation for the bit finding functions, it seems to be a general
issue with clang inline asm.
It looks like any instruction that takes a mod/rm input (so a register
or memory) will always force the thing to be in memory. Which is very
pointless in itself, but it actually causes some functions to have a
stack frame that they wouldn't otherwise need or want. So it actually
has secondary downsides too.
And yes, that particular case could be solved with __builtin_ctzl(),
which seems to DTRT. But that uses plain bsf, and we seem to really
want tzcnt ("rep bsf") here, although I didn't check why (the comment
explicitly says "Undefined if no bit exists", which is the main
difference between bsf and tzcnt).
I _think_ it's because tzcnt is faster when it exists exactly because
it always writes the destination, so 'bsf' is actually the inferior
op, and clang shouldn't generate it.
But the "rm" thing exists elsewhere too, and I just checked - this
same issue seems to happen with "g" too (ie "any general integer
input").
Linus
On Thu, Jul 28, 2022 at 2:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It builds for me and seems to generate reasonable code, although I
> > notice that clang messes up the "__ffs()" inline asm and forces the
> > source into memory.
>
> I have created a llvm issue for this at
>
> https://github.com/llvm/llvm-project/issues/56789
Thanks for the report. I left a response there (in case you have
notification emails from github filtered; following up here).
https://github.com/llvm/llvm-project/issues/56789#issuecomment-1201525395
So it looks like at least 3 things we can clean up:
1. https://github.com/llvm/llvm-project/issues/20571
2. https://github.com/llvm/llvm-project/issues/34191
3. https://github.com/llvm/llvm-project/issues/33216
>
> and while I noticed this while looking at the rather odd code
> generation for the bit finding functions, it seems to be a general
> issue with clang inline asm.
>
> It looks like any instruction that takes a mod/rm input (so a register
> or memory) will always force the thing to be in memory. Which is very
> pointless in itself, but it actually causes some functions to have a
> stack frame that they wouldn't otherwise need or want. So it actually
> has secondary downsides too.
>
> And yes, that particular case could be solved with __builtin_ctzl(),
> which seems to DTRT. But that uses plain bsf, and we seem to really
> want tzcnt ("rep bsf") here, although I didn't check why (the comment
> explicitly says "Undefined if no bit exists", which is the main
> difference between bsf and tzcnt).
>
> I _think_ it's because tzcnt is faster when it exists exactly because
> it always writes the destination, so 'bsf' is actually the inferior
> op, and clang shouldn't generate it.
>
> But the "rm" thing exists elsewhere too, and I just checked - this
> same issue seems to happen with "g" too (ie "any general integer
> input").
>
> Linus
--
Thanks,
~Nick Desaulniers
© 2016 - 2026 Red Hat, Inc.