[v4] Implement MTE tag compression for swapped pages

[PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Alexander Potapenko 2 years, 6 months ago

From: Syed Nayyar Waris <syednwaris@gmail.com>

The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.

The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
 - instead of using roundup(), which adds an unnecessary dependency
   on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
 - indentation is reduced by not using else-clauses (suggested by
   checkpatch for bitmap_get_value())

Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v4:
 - Address comments by Andy Shevchenko and Yury Norov:
   - prevent passing values >= 64 to GENMASK()
   - fix commit authorship
   - change comments
   - check for unlikely(nbits==0)
   - drop unnecessary const declarations
   - fix kernel-doc comments
   - rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
 include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..bc21c09a2e038 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
  *  bitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dst
  *  bitmap_to_arr64(buf, src, nbits)            Copy nbits from buf to u64[] dst
  *  bitmap_get_value8(map, start)               Get 8bit value from map at start
+ *  bitmap_read(map, start, nbits)              Read an nbits-sized value from
+ *                                              map at start
  *  bitmap_set_value8(map, value, start)        Set 8bit value to map at start
+ *  bitmap_write(map, value, start, nbits)      Write an nbits-sized value to
+ *                                              map at start
  *
  * Note, bitmap_zero() and bitmap_fill() operate over the region of
  * unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
 	return (map[index] >> offset) & 0xFF;
 }
 
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+					unsigned long start,
+					unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+	unsigned long value_low, value_high;
+
+	if (unlikely(!nbits))
+		return 0;
+	if (space >= nbits)
+		return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+	value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+	value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+	return (value_low >> offset) | (value_high << space);
+}
+
 /**
  * bitmap_set_value8 - set an 8-bit value within a memory region
  * @map: address to the bitmap memory region
@@ -599,6 +630,35 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
 	map[index] |= value << offset;
 }
 
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ */
+static inline void bitmap_write(unsigned long *map,
+				unsigned long value,
+				unsigned long start, unsigned long nbits)
+{
+	size_t index = BIT_WORD(start);
+	unsigned long offset = start % BITS_PER_LONG;
+	unsigned long space = BITS_PER_LONG - offset;
+
+	if (unlikely(!nbits))
+		return;
+	value &= GENMASK(nbits - 1, 0);
+	if (space >= nbits) {
+		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
+		map[index] |= value << offset;
+		return;
+	}
+	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+	map[index] |= value << offset;
+	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+	map[index + 1] |= (value >> space);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __LINUX_BITMAP_H */
-- 
2.41.0.487.g6d72f3e995-goog

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Yury Norov 2 years, 6 months ago

On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value of nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, up to BITS_PER_LONG
> + */
> +static inline void bitmap_write(unsigned long *map,
> +				unsigned long value,
> +				unsigned long start, unsigned long nbits)
> +{
> +	size_t index = BIT_WORD(start);
> +	unsigned long offset = start % BITS_PER_LONG;
> +	unsigned long space = BITS_PER_LONG - offset;
> +
> +	if (unlikely(!nbits))
> +		return;
> +	value &= GENMASK(nbits - 1, 0);

Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
because otherwise it's an out-of-bonds type of error.

This is kind of gray zone to me, because it's a caller's responsibility
to provide correct input. But on the other hand, it would be a great
headache debugging corrupted bitmaps.

Now that we've got a single user of the bitmap_write, and even more,
it's wrapped with a helper, I think it would be reasonable to trim a
'value' in the helper, if needed.

Anyways, the comment must warn about that loudly...

> +	if (space >= nbits) {
> +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);

'GENMASK(nbits - 1, 0) << offset' looks really silly.

> +		map[index] |= value << offset;

Here you commit 2 reads and 2 writes for a word instead of one.

> +		return;
> +	}
> +	map[index] &= ~BITMAP_FIRST_WORD_MASK(start);

~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
~25 bytes of .text on x86_64.

> +	map[index] |= value << offset;
> +	map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> +	map[index + 1] |= (value >> space);
> +}

With all that I think the implementation should look something like
this:

        unsigned long w, mask;

        if (unlikely(nbits == 0))
                return 0;

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_FIRST_WORD_MASK(end);
        *map = w | (value >> BITS_PER_LONG * 2 - end);

It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
it should be more efficient.

Alexander, can you please try the above and compare?

In previous iteration, I asked you to share disassembly listings for the
functions. Can you please do that now?

Regarding the rest of the series:
 - I still see Evgenii's name in mtecomp.c, and EA0 references;
 - git-am throws warning about trailing line;
 - checkpatch warns 7 times;

Can you fix all the above before sending the new version?

Have you tested generic part against BE32, BE64 and LE32 architectures;
and arch part against BE64? If not, please do.

You're mentioning that the compression ratio is 2 to 20x. Can you
share the absolute numbers? If it's 1k vs 2k, I think most people
just don't care...

Can you share the code that you used to measure the compression ratio?
Would it make sense to export the numbers via sysfs?

Thanks,
Yury

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Alexander Potapenko 2 years, 4 months ago

>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;

Double-checked there are none in v5 (only the Suggested-by: tag)

>  - git-am throws warning about trailing line;

I checked locally that `git am` does not warn about v5 patches. But
given that the patches are generated with `git format-patch` I suspect
they get garbled when you download them, could it be the case?

>  - checkpatch warns 7 times;

It now warns 4 times, three warnings are about updating MAINTAINERS (I
don't think there's need for this), the last one is about
CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description
text in Kconfig.

> Can you fix all the above before sending the new version?

> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

I did now.


> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

In the other thread I mentioned that although 20x compression is
reachable, it may not lead to practical savings. I reworded the
description, having added the absolute numbers.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

Done in v5

> Thanks,
> Yury



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Alexander Potapenko 2 years, 6 months ago

On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > +                             unsigned long value,
> > +                             unsigned long start, unsigned long nbits)
> > +{
> > +     size_t index = BIT_WORD(start);
> > +     unsigned long offset = start % BITS_PER_LONG;
> > +     unsigned long space = BITS_PER_LONG - offset;
> > +
> > +     if (unlikely(!nbits))
> > +             return;
> > +     value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.

I can easily imagine someone passing -1 (or ~0) as a value, but
wanting to only write n bits of n.

>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...
>
> > +     if (space >= nbits) {
> > +             map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
>
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

As noted in the other patch discussion, pulling offset into GENMASK is
actually not an identity transformation, because nbits + offset may
exceed 64, producing an invalid mask.

>
> > +             map[index] |= value << offset;
>
> Here you commit 2 reads and 2 writes for a word instead of one.

There won't be two reads and two writes.
The compiler will read map[index] to a register, apply the mask, then
write value << offset to it, then perform the write.
Unless map[] is a volatile, repeated reads/writes will be optimized
out by any decent compiler.

>
> > +             return;
> > +     }
> > +     map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
>
> ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
> ~25 bytes of .text on x86_64.
>
> > +     map[index] |= value << offset;
> > +     map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > +     map[index + 1] |= (value >> space);
> > +}
>
> With all that I think the implementation should look something like
> this:
>
>         unsigned long w, mask;
>
>         if (unlikely(nbits == 0))
>                 return 0;
>
>         map += BIT_WORD(start);
>         start %= BITS_PER_LONG;
>         end = start + nbits - 1;
>
>         w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
>         *map = w | (value << start);
>
>         if (end < BITS_PER_LONG)
>                 return;
>
>         w = *++map & BITMAP_FIRST_WORD_MASK(end);
>         *map = w | (value >> BITS_PER_LONG * 2 - end);
>
> It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
> it should be more efficient.
>
> Alexander, can you please try the above and compare?

I like the idea of sharing the first write between the branches, and
it can be made even shorter:

===========================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
===========================================================

According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
GCC (which on the other hand does a poor job optimizing both).

Overall, given that there's currently a single user of these
functions, isn't it premature to optimize them without knowing
anything about their performance?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Will godbolt work for you (see above)?


>
> Regarding the rest of the series:
>  - I still see Evgenii's name in mtecomp.c, and EA0 references;
Will fix, thanks!

>  - git-am throws warning about trailing line;
Interesting, I generate the patches using `git format-patch`, I
thought `git am` should be the inversion of it. I'll check.

>  - checkpatch warns 7 times;

Indeed, there were warnings that I ignored (e.g. one of them was
caused by me adding extern symbols to the test module, as requested
during the review process). I think I can fix all of them.

>
> Can you fix all the above before sending the new version?
>
> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.

BE64 works, I'll also need to check the 32-bit versions as well.
Note that MTE is an ARM64 feature (yet we still need to ensure
bitops.h works on 32 bits).

>
> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...

I'll provide the exact numbers with the next patch series. Last time I
checked, the order of magnitude was tens of megabytes.

> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?

For out-of-line allocations the data can be derived from
/proc/slabinfo, but we don't calculate inline allocations.
Agreed, a debugfs interface won't hurt.

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Yury Norov 2 years, 6 months ago

On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote:
> On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > > +/**
> > > + * bitmap_write - write n-bit value within a memory region
> > > + * @map: address to the bitmap memory region
> > > + * @value: value of nbits
> > > + * @start: bit offset of the n-bit value
> > > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > > + */
> > > +static inline void bitmap_write(unsigned long *map,
> > > +                             unsigned long value,
> > > +                             unsigned long start, unsigned long nbits)
> > > +{
> > > +     size_t index = BIT_WORD(start);
> > > +     unsigned long offset = start % BITS_PER_LONG;
> > > +     unsigned long space = BITS_PER_LONG - offset;
> > > +
> > > +     if (unlikely(!nbits))
> > > +             return;
> > > +     value &= GENMASK(nbits - 1, 0);
> >
> > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> > because otherwise it's an out-of-bonds type of error.
> 
> I can easily imagine someone passing -1 (or ~0) as a value, but
> wanting to only write n bits of n.

This is an abuse of new API because we've got a bitmap_set(). But
whatever, let's keep that masking.

...
 
> I like the idea of sharing the first write between the branches, and
> it can be made even shorter:
> 
> ===========================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;

        space >= nbits <=>
        BITS_PER_LONG - offset >= nbits <=> 
        offset + nbits <= BITS_PER_LONG

>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :

So here GENMASK(nbits + offset - 1, offset) is at max:
GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
point. Does it make sense?

> ~BITMAP_FIRST_WORD_MASK(start));

As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
and vise-versa.

>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }
> ===========================================================
> 
> According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
> is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
> GCC (which on the other hand does a poor job optimizing both).
> 
> Overall, given that there's currently a single user of these
> functions, isn't it premature to optimize them without knowing
> anything about their performance?
> 
> > In previous iteration, I asked you to share disassembly listings for the
> > functions. Can you please do that now?
> 
> Will godbolt work for you (see above)?

I don't know for how long an external resource will keep the reference
alive. My SSD keeps emails long enough.

...

> > You're mentioning that the compression ratio is 2 to 20x. Can you
> > share the absolute numbers? If it's 1k vs 2k, I think most people
> > just don't care...
> 
> I'll provide the exact numbers with the next patch series. Last time I
> checked, the order of magnitude was tens of megabytes.

That's impressive. Fruitful idea. It would be important for embedded guys
who may disable MTE because of memory overhead. I think it's worth to
mention that in Kconfig together with associate performance overhead,
if it ever measurable.

> > Can you share the code that you used to measure the compression ratio?
> > Would it make sense to export the numbers via sysfs?
> 
> For out-of-line allocations the data can be derived from
> /proc/slabinfo, but we don't calculate inline allocations.
> Agreed, a debugfs interface won't hurt.

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Alexander Potapenko 2 years, 6 months ago

>         space >= nbits <=>
>         BITS_PER_LONG - offset >= nbits <=>
>         offset + nbits <= BITS_PER_LONG
>
> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
>
> So here GENMASK(nbits + offset - 1, offset) is at max:
> GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> point. Does it make sense?

It indeed does. Perhaps pulling offset inside GENMASK is not a bug
after all (a simple test does not show any difference between their
behavior.
But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
My guess is that this happens because the compiler fails to reuse the
value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

>
> > ~BITMAP_FIRST_WORD_MASK(start));
>
> As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> and vise-versa.

Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
BITMAP_LAST_WORD_MASK().

>
> >         map[index] |= value << offset;
> >         if (fit)
> >                 return;
> >
> >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);

OTOH I managed to shave three more bytes off by replacing
~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.

> >         map[index + 1] |= (value >> space);
> > }

I'll post the implementations together with the disassembly below.
I used some Clang 17.0.0 version that is a couple months behind
upstream, but that still produces sustainably shorter code (~48 bytes
less) than the trunk GCC on Godbolt.

1. Original implementation of bitmap_write() from this patch - 164
bytes (interestingly, it's 157 bytes with Clang 14.0.6)

==================================================================
void bitmap_write(unsigned long *map, unsigned long value,
                  unsigned long start, unsigned long nbits)
{
        size_t index = BIT_WORD(start);
        unsigned long offset = start % BITS_PER_LONG;
        unsigned long space = BITS_PER_LONG - offset;

        if (unlikely(!nbits))
                return;
        value &= GENMASK(nbits - 1, 0);
        if (space >= nbits) {
                map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
                map[index] |= value << offset;
                return;
        }
        map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
        map[index] |= value << offset;
        map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001200 <bitmap_write>:
    1200: 41 57                push   %r15
    1202: 41 56                push   %r14
    1204: 53                    push   %rbx
    1205: 48 85 c9              test   %rcx,%rcx
    1208: 74 7b                je     1285 <bitmap_write+0x85>
    120a: 48 89 c8              mov    %rcx,%rax
    120d: 49 89 d2              mov    %rdx,%r10
    1210: 49 c1 ea 06          shr    $0x6,%r10
    1214: 41 89 d1              mov    %edx,%r9d
    1217: f6 d9                neg    %cl
    1219: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1220: 49 d3 ef              shr    %cl,%r15
    1223: 41 83 e1 3f          and    $0x3f,%r9d
    1227: 41 b8 40 00 00 00    mov    $0x40,%r8d
    122d: 4c 21 fe              and    %r15,%rsi
    1230: 48 89 f3              mov    %rsi,%rbx
    1233: 44 89 c9              mov    %r9d,%ecx
    1236: 48 d3 e3              shl    %cl,%rbx
    1239: 4d 29 c8              sub    %r9,%r8
    123c: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    1243: 4e 8b 34 d7          mov    (%rdi,%r10,8),%r14
    1247: 49 39 c0              cmp    %rax,%r8
    124a: 73 3f                jae    128b <bitmap_write+0x8b>
    124c: 49 c7 c7 ff ff ff ff mov    $0xffffffffffffffff,%r15
    1253: 44 89 c9              mov    %r9d,%ecx
    1256: 49 d3 e7              shl    %cl,%r15
    1259: 49 f7 d7              not    %r15
    125c: 4d 21 fe              and    %r15,%r14
    125f: 49 09 de              or     %rbx,%r14
    1262: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    1266: 01 c2                add    %eax,%edx
    1268: f6 da                neg    %dl
    126a: 89 d1                mov    %edx,%ecx
    126c: 49 d3 eb              shr    %cl,%r11
    126f: 49 f7 d3              not    %r11
    1272: 44 89 c1              mov    %r8d,%ecx
    1275: 48 d3 ee              shr    %cl,%rsi
    1278: 4e 23 5c d7 08        and    0x8(%rdi,%r10,8),%r11
    127d: 4c 09 de              or     %r11,%rsi
    1280: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    1285: 5b                    pop    %rbx
    1286: 41 5e                pop    %r14
    1288: 41 5f                pop    %r15
    128a: c3                    ret
    128b: 44 89 c9              mov    %r9d,%ecx
    128e: 49 d3 e7              shl    %cl,%r15
    1291: 49 f7 d7              not    %r15
    1294: 4d 21 fe              and    %r15,%r14
    1297: 49 09 de              or     %rbx,%r14
    129a: 4e 89 34 d7          mov    %r14,(%rdi,%r10,8)
    129e: 5b                    pop    %rbx
    129f: 41 5e                pop    %r14
    12a1: 41 5f                pop    %r15
    12a3: c3                    ret
==================================================================

2. Your implementation, my_bitmap_write() - 152 bytes (used to be
slightly worse with Clang 14.0.6 - 159 bytes):

==================================================================
void my_bitmap_write(unsigned long *map, unsigned long value,
                     unsigned long start, unsigned long nbits)
{
        unsigned long w, end;

        if (unlikely(nbits == 0))
                return;

        value &= GENMASK(nbits - 1, 0);

        map += BIT_WORD(start);
        start %= BITS_PER_LONG;
        end = start + nbits - 1;

        w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) :
BITMAP_LAST_WORD_MASK(start));
        *map = w | (value << start);

        if (end < BITS_PER_LONG)
                return;

        w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
        *map = w | (value >> (BITS_PER_LONG - start));
}
==================================================================
0000000000001160 <my_bitmap_write>:
    1160: 48 85 c9              test   %rcx,%rcx
    1163: 0f 84 8e 00 00 00    je     11f7 <my_bitmap_write+0x97>
    1169: 49 89 c9              mov    %rcx,%r9
    116c: f6 d9                neg    %cl
    116e: 48 d3 e6              shl    %cl,%rsi
    1171: 48 d3 ee              shr    %cl,%rsi
    1174: 49 89 d2              mov    %rdx,%r10
    1177: 49 c1 ea 06          shr    $0x6,%r10
    117b: 89 d0                mov    %edx,%eax
    117d: 83 e0 3f              and    $0x3f,%eax
    1180: 4e 8d 04 08          lea    (%rax,%r9,1),%r8
    1184: 4a 8d 0c 08          lea    (%rax,%r9,1),%rcx
    1188: 48 ff c9              dec    %rcx
    118b: 4e 8b 0c d7          mov    (%rdi,%r10,8),%r9
    118f: 48 83 f9 3f          cmp    $0x3f,%rcx
    1193: 77 29                ja     11be <my_bitmap_write+0x5e>
    1195: 41 f6 d8              neg    %r8b
    1198: 48 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%rdx
    119f: 44 89 c1              mov    %r8d,%ecx
    11a2: 48 d3 ea              shr    %cl,%rdx
    11a5: 89 c1                mov    %eax,%ecx
    11a7: 48 d3 ea              shr    %cl,%rdx
    11aa: 48 d3 e2              shl    %cl,%rdx
    11ad: 48 d3 e6              shl    %cl,%rsi
    11b0: 48 f7 d2              not    %rdx
    11b3: 49 21 d1              and    %rdx,%r9
    11b6: 4c 09 ce              or     %r9,%rsi
    11b9: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    11bd: c3                    ret
    11be: f6 da                neg    %dl
    11c0: 89 d1                mov    %edx,%ecx
    11c2: 49 d3 e1              shl    %cl,%r9
    11c5: 49 d3 e9              shr    %cl,%r9
    11c8: 48 89 f2              mov    %rsi,%rdx
    11cb: 89 c1                mov    %eax,%ecx
    11cd: 48 d3 e2              shl    %cl,%rdx
    11d0: 4c 09 ca              or     %r9,%rdx
    11d3: 4a 89 14 d7          mov    %rdx,(%rdi,%r10,8)
    11d7: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    11dc: 41 f6 d8              neg    %r8b
    11df: 44 89 c1              mov    %r8d,%ecx
    11e2: 48 d3 e2              shl    %cl,%rdx
    11e5: 48 d3 ea              shr    %cl,%rdx
    11e8: f6 d8                neg    %al
    11ea: 89 c1                mov    %eax,%ecx
    11ec: 48 d3 ee              shr    %cl,%rsi
    11ef: 48 09 d6              or     %rdx,%rsi
    11f2: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    11f7: c3                    ret
==================================================================

3. My improved version built on top of yours and mentioned above under
the name bitmap_write_new() - 116 bytes:

==================================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
00000000000012b0 <bitmap_write_new>:
    12b0: 48 85 c9              test   %rcx,%rcx
    12b3: 74 6e                je     1323 <bitmap_write_new+0x73>
    12b5: 48 89 c8              mov    %rcx,%rax
    12b8: f6 d9                neg    %cl
    12ba: 49 c7 c2 ff ff ff ff mov    $0xffffffffffffffff,%r10
    12c1: 49 d3 ea              shr    %cl,%r10
    12c4: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    12cb: 4c 21 d6              and    %r10,%rsi
    12ce: 89 d1                mov    %edx,%ecx
    12d0: 83 e1 3f              and    $0x3f,%ecx
    12d3: 41 b8 40 00 00 00    mov    $0x40,%r8d
    12d9: 49 29 c8              sub    %rcx,%r8
    12dc: 49 89 d1              mov    %rdx,%r9
    12df: 49 c1 e9 06          shr    $0x6,%r9
    12e3: 49 39 c0              cmp    %rax,%r8
    12e6: 4d 0f 42 d3          cmovb  %r11,%r10
    12ea: 49 d3 e2              shl    %cl,%r10
    12ed: 49 f7 d2              not    %r10
    12f0: 4e 23 14 cf          and    (%rdi,%r9,8),%r10
    12f4: 49 89 f3              mov    %rsi,%r11
    12f7: 49 d3 e3              shl    %cl,%r11
    12fa: 4d 09 d3              or     %r10,%r11
    12fd: 4e 89 1c cf          mov    %r11,(%rdi,%r9,8)
    1301: 49 39 c0              cmp    %rax,%r8
    1304: 73 1d                jae    1323 <bitmap_write_new+0x73>
    1306: 01 d0                add    %edx,%eax
    1308: 4a 8b 54 cf 08        mov    0x8(%rdi,%r9,8),%rdx
    130d: 89 c1                mov    %eax,%ecx
    130f: 48 d3 ea              shr    %cl,%rdx
    1312: 48 d3 e2              shl    %cl,%rdx
    1315: 44 89 c1              mov    %r8d,%ecx
    1318: 48 d3 ee              shr    %cl,%rsi
    131b: 48 09 d6              or     %rdx,%rsi
    131e: 4a 89 74 cf 08        mov    %rsi,0x8(%rdi,%r9,8)
    1323: c3                    ret
==================================================================

4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 +
offset, offset)` - 146 bytes:

==================================================================
void bitmap_write_new_shift(unsigned long *map, unsigned long value,
                      unsigned long start, unsigned long nbits)
{
        unsigned long offset;
        unsigned long space;
        size_t index;
        bool fit;

        if (unlikely(!nbits))
                return;

        value &= GENMASK(nbits - 1, 0);
        offset = start % BITS_PER_LONG;
        space = BITS_PER_LONG - offset;
        index = BIT_WORD(start);
        fit = space >= nbits;

        map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) :
~BITMAP_FIRST_WORD_MASK(start));
        map[index] |= value << offset;
        if (fit)
                return;

        map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
        map[index + 1] |= (value >> space);
}
==================================================================
0000000000001330 <bitmap_write_new_shift>:
    1330: 48 85 c9              test   %rcx,%rcx
    1333: 74 6a                je     139f <bitmap_write_new_shift+0x6f>
    1335: 48 89 c8              mov    %rcx,%rax
    1338: f6 d9                neg    %cl
    133a: 48 d3 e6              shl    %cl,%rsi
    133d: 48 d3 ee              shr    %cl,%rsi
    1340: 41 89 d0              mov    %edx,%r8d
    1343: 41 83 e0 3f          and    $0x3f,%r8d
    1347: 41 b9 40 00 00 00    mov    $0x40,%r9d
    134d: 4d 29 c1              sub    %r8,%r9
    1350: 49 89 d2              mov    %rdx,%r10
    1353: 49 c1 ea 06          shr    $0x6,%r10
    1357: 49 c7 c3 ff ff ff ff mov    $0xffffffffffffffff,%r11
    135e: 44 89 c1              mov    %r8d,%ecx
    1361: 49 d3 e3              shl    %cl,%r11
    1364: 49 39 c1              cmp    %rax,%r9
    1367: 73 37                jae    13a0 <bitmap_write_new_shift+0x70>
    1369: 53                    push   %rbx
    136a: 49 f7 d3              not    %r11
    136d: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    1371: 48 89 f3              mov    %rsi,%rbx
    1374: 44 89 c1              mov    %r8d,%ecx
    1377: 48 d3 e3              shl    %cl,%rbx
    137a: 4c 09 db              or     %r11,%rbx
    137d: 4a 89 1c d7          mov    %rbx,(%rdi,%r10,8)
    1381: 01 d0                add    %edx,%eax
    1383: 4a 8b 54 d7 08        mov    0x8(%rdi,%r10,8),%rdx
    1388: 89 c1                mov    %eax,%ecx
    138a: 48 d3 ea              shr    %cl,%rdx
    138d: 48 d3 e2              shl    %cl,%rdx
    1390: 44 89 c9              mov    %r9d,%ecx
    1393: 48 d3 ee              shr    %cl,%rsi
    1396: 48 09 d6              or     %rdx,%rsi
    1399: 4a 89 74 d7 08        mov    %rsi,0x8(%rdi,%r10,8)
    139e: 5b                    pop    %rbx
    139f: c3                    ret
    13a0: 44 01 c0              add    %r8d,%eax
    13a3: f6 d8                neg    %al
    13a5: 89 c1                mov    %eax,%ecx
    13a7: 49 d3 e3              shl    %cl,%r11
    13aa: 49 d3 eb              shr    %cl,%r11
    13ad: 49 f7 d3              not    %r11
    13b0: 44 89 c1              mov    %r8d,%ecx
    13b3: 48 d3 e6              shl    %cl,%rsi
    13b6: 4e 23 1c d7          and    (%rdi,%r10,8),%r11
    13ba: 4c 09 de              or     %r11,%rsi
    13bd: 4a 89 34 d7          mov    %rsi,(%rdi,%r10,8)
    13c1: c3                    ret
==================================================================

For posterity, I am also attaching the C file containing the four
implementations together with some code testing them.

PS. I won't be actively working on the patch series till the end of
August, so feel free to ignore this letter until then.

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Yury Norov 2 years, 6 months ago

On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:
> >         space >= nbits <=>
> >         BITS_PER_LONG - offset >= nbits <=>
> >         offset + nbits <= BITS_PER_LONG
> >
> > >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> >
> > So here GENMASK(nbits + offset - 1, offset) is at max:
> > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> > point. Does it make sense?
> 
> It indeed does. Perhaps pulling offset inside GENMASK is not a bug
> after all (a simple test does not show any difference between their
> behavior.
> But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
> My guess is that this happens because the compiler fails to reuse the
> value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
> calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.

OK. Can you put a comment explaining this? Or maybe would be even
better to use BITMAP_LAST_WORD_MASK() here:

         mask = BITMAP_LAST_WORD_MASK(nbits);
         value &= mask;
         ...
         map[index] &= (fit ? (~mask << offset)) :

> > > ~BITMAP_FIRST_WORD_MASK(start));
> >
> > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > and vise-versa.
> 
> Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> BITMAP_LAST_WORD_MASK().
 
Wow... If that's consistent across different compilers/arches, we'd
just drop the latter. Thanks for pointing that. I'll check.

> > >         map[index] |= value << offset;
> > >         if (fit)
> > >                 return;
> > >
> > >         map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> 
> OTOH I managed to shave three more bytes off by replacing
> ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
> 
> > >         map[index + 1] |= (value >> space);
> > > }
> 
> I'll post the implementations together with the disassembly below.
> I used some Clang 17.0.0 version that is a couple months behind
> upstream, but that still produces sustainably shorter code (~48 bytes
> less) than the trunk GCC on Godbolt.
> 
> 1. Original implementation of bitmap_write() from this patch - 164
> bytes (interestingly, it's 157 bytes with Clang 14.0.6)

I spotted that too in some other case. Newer compilers tend to
generate bigger code, but the result usually works faster. One
particular reason for my case was a loop unrolling.

[...]

> 3. My improved version built on top of yours and mentioned above under
> the name bitmap_write_new() - 116 bytes:

30% better in size - that's impressive!
 
> ==================================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
>                       unsigned long start, unsigned long nbits)
> {
>         unsigned long offset;
>         unsigned long space;
>         size_t index;
>         bool fit;
> 
>         if (unlikely(!nbits))
>                 return;
> 
>         value &= GENMASK(nbits - 1, 0);
>         offset = start % BITS_PER_LONG;
>         space = BITS_PER_LONG - offset;
>         index = BIT_WORD(start);
>         fit = space >= nbits;
> 
>         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> ~BITMAP_FIRST_WORD_MASK(start));
>         map[index] |= value << offset;
>         if (fit)
>                 return;
> 
>         map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
>         map[index + 1] |= (value >> space);
> }

Thanks,
Yury

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Alexander Potapenko 2 years, 4 months ago

> OK. Can you put a comment explaining this? Or maybe would be even
> better to use BITMAP_LAST_WORD_MASK() here:
>
>          mask = BITMAP_LAST_WORD_MASK(nbits);
>          value &= mask;
>          ...
>          map[index] &= (fit ? (~mask << offset)) :

I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's
self-explanatory now, WDYT?

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Andy Shevchenko 2 years, 6 months ago

On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote:
> On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:

> > > > ~BITMAP_FIRST_WORD_MASK(start));
> > >
> > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > > and vise-versa.
> > 
> > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> > BITMAP_LAST_WORD_MASK().
>  
> Wow... If that's consistent across different compilers/arches, we'd
> just drop the latter. Thanks for pointing that. I'll check.

It might be...

...

> >         map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> > ~BITMAP_FIRST_WORD_MASK(start));

...that both parts of ternary are using negation. So compiler may use negation
only once. (Haven't looked at the assembly.)

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Andy Shevchenko 2 years, 6 months ago

On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:

> > +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> 
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.

But you followed the thread to get a clue why it's written in this form, right?

...

> With all that I think the implementation should look something like
> this:

I would go this way if and only if the code generation on main architectures
with both GCC and clang is better.

And maybe even some performance tests need to be provided.

...

> Alexander, can you please try the above and compare?

> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?

Exactly what we need before going with your suggestion(s).

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Yury Norov 2 years, 6 months ago

On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> 
> > > +		map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> > 
> > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
> 
> But you followed the thread to get a clue why it's written in this form, right?

Yes, I did. But I don't expect everyone looking at kernel code would spend
time recovering discussions that explain why that happened. So, at least it
would be fine to drop a comment.
 
> ...
> 
> > With all that I think the implementation should look something like
> > this:
> 
> I would go this way if and only if the code generation on main architectures
> with both GCC and clang is better.
> 
> And maybe even some performance tests need to be provided.

For the following implementation:

  void my_bitmap_write(unsigned long *map, unsigned long value,
                                 unsigned long start, unsigned long nbits)
  {
          unsigned long w, end;
  
          if (unlikely(nbits == 0))
                  return;
  
          value &= GENMASK(nbits - 1, 0);
  
          map += BIT_WORD(start);
          start %= BITS_PER_LONG;
          end = start + nbits - 1;
  
          w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
          *map = w | (value << start);
  
          if (end < BITS_PER_LONG)
                  return;
  
          w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
          *map = w | (value >> (BITS_PER_LONG - start));
  }

This is the bloat-o-meter output:

$ scripts/bloat-o-meter lib/test_bitmap.o.orig lib/test_bitmap.o
add/remove: 8/0 grow/shrink: 1/0 up/down: 2851/0 (2851)
Function                                     old     new   delta
test_bitmap_init                            3846    5484   +1638
test_bitmap_write_perf                         -     401    +401
bitmap_write                                   -     271    +271
my_bitmap_write                                -     248    +248
bitmap_read                                    -     229    +229
__pfx_test_bitmap_write_perf                   -      16     +16
__pfx_my_bitmap_write                          -      16     +16
__pfx_bitmap_write                             -      16     +16
__pfx_bitmap_read                              -      16     +16
Total: Before=36964, After=39815, chg +7.71%

And this is the performance test:

        for (cnt = 0; cnt < 5; cnt++) {
                time = ktime_get();
                for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
                        for (i = 0; i < 1000; i++) {
                                if (i + nbits > 1000)
                                        break;
                                bitmap_write(bmap, val, i, nbits);
                        }
                }
                time = ktime_get() - time;
                pr_err("bitmap_write:\t%llu\t", time);

                time = ktime_get();
                for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
                        for (i = 0; i < 1000; i++) {
                                if (i + nbits > 1000)
                                        break;
                                my_bitmap_write(bmap, val, i, nbits);
                        }
                }
                time = ktime_get() - time;
                pr_cont("%llu\n", time);
        }

Which on x86_64/kvm with GCC gives:
                                                Orig    My
[    1.630731] test_bitmap: bitmap_write:	299092	252764
[    1.631584] test_bitmap: bitmap_write:	299522	252554
[    1.632429] test_bitmap: bitmap_write:	299171	258665
[    1.633280] test_bitmap: bitmap_write:	299241	252794
[    1.634133] test_bitmap: bitmap_write:	306716	252934

So, it's ~15% difference in performance and 8% in size.

I don't insist on my implementation, but I think, we'd experiment for more
with code generation.

Thanks,
Yury

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Andy Shevchenko 2 years, 6 months ago

On Mon, Jul 24, 2023 at 10:04:34PM -0700, Yury Norov wrote:
> On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> > On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:

...

> > > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
> > 
> > But you followed the thread to get a clue why it's written in this form, right?
> 
> Yes, I did. But I don't expect everyone looking at kernel code would spend
> time recovering discussions that explain why that happened. So, at least it
> would be fine to drop a comment.

See also below.

...

>           w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));

This GENMASK() may generate worse code as compiler uses more instructions
instead of simple approach with the above..

...

> bitmap_write                                   -     271    +271
> my_bitmap_write                                -     248    +248
> bitmap_read                                    -     229    +229

my_ -- means your proposal? Do you mean you have it better in size than original one?

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Yury Norov 2 years, 6 months ago

On Sat, Jul 22, 2023 at 06:57:26PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > +				unsigned long value,
> > +				unsigned long start, unsigned long nbits)
> > +{
> > +	size_t index = BIT_WORD(start);
> > +	unsigned long offset = start % BITS_PER_LONG;
> > +	unsigned long space = BITS_PER_LONG - offset;
> > +
> > +	if (unlikely(!nbits))
> > +		return;
> > +	value &= GENMASK(nbits - 1, 0);
> 
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.
> 
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
> 
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
> 
> Anyways, the comment must warn about that loudly...

OK. I spent a night with that, and I'm still not sure. Pseudo code
that implements it looks like this:

        for (bit = 0; bit < nbits; bit++)
                assign_bit(start + bit, bitmap, val & BIT(bit));

And it ignores trailing bits. So if we're optimizing this pattern,
we'd ignore these bits just as well...

Either way, whatever we decide, let's stay clear with our intentions
and mention explicitly that tail bits are either must be zero, or
ignored.

Alexander, can you add the snippet above to the comments for the
bitmap_write() and bitmap_read(), as well as in the test? Also, if we
decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
generate better code than GENMASK(nbits - 1, 0) does?

Commets are very appreciated.

Thanks,
Yury

Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

Posted by Alexander Potapenko 2 years, 4 months ago

> Either way, whatever we decide, let's stay clear with our intentions
> and mention explicitly that tail bits are either must be zero, or
> ignored.
>
> Alexander, can you add the snippet above to the comments for the
> bitmap_write() and bitmap_read(), as well as in the test? Also, if we
> decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
> generate better code than GENMASK(nbits - 1, 0) does?

Added the snippet above to bitmap_write(). I think however it is
obvious that bitmap_read() reads only nbits?

> Commets are very appreciated.
>
> Thanks,
> Yury



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

[PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()
[PATCH v4 2/5] lib/test_bitmap: add tests for bitmap_{set,get}_value()
[PATCH v4 3/5] arm64: mte: implement CONFIG_ARM64_MTE_COMP
[PATCH v4 4/5] arm64: mte: add a test for MTE tags compression
[PATCH v4 5/5] arm64: mte: add compression support to mteswap.c