lib/find: optimize find_bit() functions

[PATCH 0/5] lib/find: optimize find_bit() functions

Posted by Yury Norov 3 years, 8 months ago

In the recent discussion [1], it was noticed that find_next_bit()
functions may be improved by adding wrappers around common
__find_next_bit(). 

I tried that trick; and although it doesn't affect performance
significantly, as reported by find_bit_benchmark, there's ~2.5K
decrease in image size.

find_first_bit() is reworked accordingly.

[1] https://lkml.org/lkml/2022/7/25/830

Yury Norov (5):
  lib/find_bit: rename le to need_swab in __find_next_bit()
  lib/find_bit: optimize find_next_bit() functions
  lib/find_bit: unify _find_first_{,and,zero}_bit implementations
  lib/find_bit: create find_first_zero_bit_le()
  lib/find_bit: re-use __find_first_bit() in __find_next_bit()

 include/linux/find.h |  45 +++++++++----
 lib/find_bit.c       | 155 +++++++++++++++++++++++++++++--------------
 2 files changed, 138 insertions(+), 62 deletions(-)

-- 
2.34.1

Re: [PATCH 0/5] lib/find: optimize find_bit() functions

Posted by Linus Torvalds 3 years, 8 months ago

On Thu, Jul 28, 2022 at 9:12 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> In the recent discussion [1], it was noticed that find_next_bit()
> functions may be improved by adding wrappers around common
> __find_next_bit().

So looking at the end result, this generates fairly good code.

I say "fairly good" because _find_next_and_bit() ends up being disgusting.

The reason? That

        if (addr2)
                tmp &= addr2[start / BITS_PER_LONG];

generates horrific code when 'addr2' isn't seen to be always NULL.

So this doesn't affect the regular _find_next_bit case, because the
inliner sees that addr2 is always NULL and doesn't generate extra code
for it.

But the code that actually *does* have two incoming bitmasks will now
pointlessly test that second bitmask pointer all the time.

Now, that particular function probably doesn't matter, but this code
seems to be unnecessarily written to try to be overly generic, and
that

 (a) makes it hard to read the source code because the source code
doesn't do the obvious thing

 (b) clearly generates suboptimal code too

so I'm really not a huge fan of this "try to share code". Not when the
resulting shared code is harder to read, and impossible for the
compiler to do a great job at.

And the code sharing really doesn't help all that much.

If you really want to generate good code, use macros, and share the
logic that way. Not hugely readable either, but I think it's actually
not bad.

I think something like this would work:

    #define BIT_FIND_SETUP(addr, size, start)                   \
        unsigned long val, idx;                                 \
        if (unlikely(start >= size))                            \
                return size;                                    \
        idx = start / BITS_PER_LONG;

    #define BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)  \
        if (!IS_ALIGNED(start, BITS_PER_LONG)) {                \
                unsigned long mask;                             \
                mask = BITMAP_FIRST_WORD_MASK(start);           \
                if ((val = mask & (EXPRESSION)) != 0)           \
                        goto found;                             \
                idx++;                                          \
        }

    #define BIT_WORD_LOOP(ptr, size, idx, val, EXPRESSION)              \
        do {                                                    \
                if ((val = (EXPRESSION)) != 0)                  \
                         goto found;                            \
                idx++;                                          \
        } while ((idx)*BITS_PER_LONG < (size));

    #define BIT_FIND_BODY(addr, size, start, EXPRESSION)                \
        BIT_FIND_SETUP(addr, size, start)                       \
        BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)      \
        BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION) \
        return size;                                            \
    found:      BIT_WORD_SWAB(val);                                     \
        return min((idx)*BITS_PER_LONG + __ffs(val), size)

    #define BIT_WORD_SWAB(x) /* Nothing */

and then for the BE versions you just use the same macros, but you
make BIT_WORD_SWAB() do the swab.

I'm attaching an UNTESTED version of lib/find_bit.c that works the
above way (note: it is based on your header file changes!)

It builds for me and seems to generate reasonable code, although I
notice that clang messes up the "__ffs()" inline asm and forces the
source into memory.

(Gcc doesn't do that, but gcc follows the code exactly and generates
"shl $6" instructions, while clang seems to figure out that it can
just add 64 instead)

                        Linus

Re: [PATCH 0/5] lib/find: optimize find_bit() functions

Posted by Linus Torvalds 3 years, 8 months ago

On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It builds for me and seems to generate reasonable code, although I
> notice that clang messes up the "__ffs()" inline asm and forces the
> source into memory.

I have created a llvm issue for this at

        https://github.com/llvm/llvm-project/issues/56789

and while I noticed this while looking at the rather odd code
generation for the bit finding functions, it seems to be a general
issue with clang inline asm.

It looks like any instruction that takes a mod/rm input (so a register
or memory) will always force the thing to be in memory. Which is very
pointless in itself, but it actually causes some functions to have a
stack frame that they wouldn't otherwise need or want. So it actually
has secondary downsides too.

And yes, that particular case could be solved with __builtin_ctzl(),
which seems to DTRT. But that uses plain bsf, and we seem to really
want tzcnt ("rep bsf") here, although I didn't check why (the comment
explicitly says "Undefined if no bit exists", which is the main
difference between bsf and tzcnt).

I _think_ it's because tzcnt is faster when it exists exactly because
it always writes the destination, so 'bsf' is actually the inferior
op, and clang shouldn't generate it.

But the "rm" thing exists elsewhere too, and I just checked - this
same issue seems to happen with "g" too (ie "any general integer
input").

                      Linus

Re: [PATCH 0/5] lib/find: optimize find_bit() functions

Posted by Nick Desaulniers 3 years, 8 months ago

On Thu, Jul 28, 2022 at 2:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It builds for me and seems to generate reasonable code, although I
> > notice that clang messes up the "__ffs()" inline asm and forces the
> > source into memory.
>
> I have created a llvm issue for this at
>
>         https://github.com/llvm/llvm-project/issues/56789

Thanks for the report.  I left a response there (in case you have
notification emails from github filtered; following up here).
https://github.com/llvm/llvm-project/issues/56789#issuecomment-1201525395

So it looks like at least 3 things we can clean up:
1. https://github.com/llvm/llvm-project/issues/20571
2. https://github.com/llvm/llvm-project/issues/34191
3. https://github.com/llvm/llvm-project/issues/33216

>
> and while I noticed this while looking at the rather odd code
> generation for the bit finding functions, it seems to be a general
> issue with clang inline asm.
>
> It looks like any instruction that takes a mod/rm input (so a register
> or memory) will always force the thing to be in memory. Which is very
> pointless in itself, but it actually causes some functions to have a
> stack frame that they wouldn't otherwise need or want. So it actually
> has secondary downsides too.
>
> And yes, that particular case could be solved with __builtin_ctzl(),
> which seems to DTRT. But that uses plain bsf, and we seem to really
> want tzcnt ("rep bsf") here, although I didn't check why (the comment
> explicitly says "Undefined if no bit exists", which is the main
> difference between bsf and tzcnt).
>
> I _think_ it's because tzcnt is faster when it exists exactly because
> it always writes the destination, so 'bsf' is actually the inferior
> op, and clang shouldn't generate it.
>
> But the "rm" thing exists elsewhere too, and I just checked - this
> same issue seems to happen with "g" too (ie "any general integer
> input").
>
>                       Linus



-- 
Thanks,
~Nick Desaulniers