From: Syed Nayyar Waris <syednwaris@gmail.com>
The two new functions allow setting/getting values of length up to
BITS_PER_LONG bits at arbitrary position in the bitmap.
The code was taken from "bitops: Introduce the for_each_set_clump macro"
by Syed Nayyar Waris with a couple of minor changes:
- instead of using roundup(), which adds an unnecessary dependency
on <linux/math.h>, we calculate space as BITS_PER_LONG-offset;
- indentation is reduced by not using else-clauses (suggested by
checkpatch for bitmap_get_value())
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com>
Signed-off-by: William Breathitt Gray <william.gray@linaro.org>
Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
---
v4:
- Address comments by Andy Shevchenko and Yury Norov:
- prevent passing values >= 64 to GENMASK()
- fix commit authorship
- change comments
- check for unlikely(nbits==0)
- drop unnecessary const declarations
- fix kernel-doc comments
- rename bitmap_{get,set}_value() to bitmap_{read,write}()
---
include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 03644237e1efb..bc21c09a2e038 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -76,7 +76,11 @@ struct device;
* bitmap_to_arr32(buf, src, nbits) Copy nbits from buf to u32[] dst
* bitmap_to_arr64(buf, src, nbits) Copy nbits from buf to u64[] dst
* bitmap_get_value8(map, start) Get 8bit value from map at start
+ * bitmap_read(map, start, nbits) Read an nbits-sized value from
+ * map at start
* bitmap_set_value8(map, value, start) Set 8bit value to map at start
+ * bitmap_write(map, value, start, nbits) Write an nbits-sized value to
+ * map at start
*
* Note, bitmap_zero() and bitmap_fill() operate over the region of
* unsigned longs, that is, bits behind bitmap till the unsigned long
@@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map,
return (map[index] >> offset) & 0xFF;
}
+/**
+ * bitmap_read - read a value of n-bits from the memory region
+ * @map: address to the bitmap memory region
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ *
+ * Returns: value of nbits located at the @start bit offset within the @map
+ * memory region.
+ */
+static inline unsigned long bitmap_read(const unsigned long *map,
+ unsigned long start,
+ unsigned long nbits)
+{
+ size_t index = BIT_WORD(start);
+ unsigned long offset = start % BITS_PER_LONG;
+ unsigned long space = BITS_PER_LONG - offset;
+ unsigned long value_low, value_high;
+
+ if (unlikely(!nbits))
+ return 0;
+ if (space >= nbits)
+ return (map[index] >> offset) & GENMASK(nbits - 1, 0);
+ value_low = map[index] & BITMAP_FIRST_WORD_MASK(start);
+ value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits);
+ return (value_low >> offset) | (value_high << space);
+}
+
/**
* bitmap_set_value8 - set an 8-bit value within a memory region
* @map: address to the bitmap memory region
@@ -599,6 +630,35 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value,
map[index] |= value << offset;
}
+/**
+ * bitmap_write - write n-bit value within a memory region
+ * @map: address to the bitmap memory region
+ * @value: value of nbits
+ * @start: bit offset of the n-bit value
+ * @nbits: size of value in bits, up to BITS_PER_LONG
+ */
+static inline void bitmap_write(unsigned long *map,
+ unsigned long value,
+ unsigned long start, unsigned long nbits)
+{
+ size_t index = BIT_WORD(start);
+ unsigned long offset = start % BITS_PER_LONG;
+ unsigned long space = BITS_PER_LONG - offset;
+
+ if (unlikely(!nbits))
+ return;
+ value &= GENMASK(nbits - 1, 0);
+ if (space >= nbits) {
+ map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
+ map[index] |= value << offset;
+ return;
+ }
+ map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
+ map[index] |= value << offset;
+ map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
+ map[index + 1] |= (value >> space);
+}
+
#endif /* __ASSEMBLY__ */
#endif /* __LINUX_BITMAP_H */
--
2.41.0.487.g6d72f3e995-goog
On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> +/**
> + * bitmap_write - write n-bit value within a memory region
> + * @map: address to the bitmap memory region
> + * @value: value of nbits
> + * @start: bit offset of the n-bit value
> + * @nbits: size of value in bits, up to BITS_PER_LONG
> + */
> +static inline void bitmap_write(unsigned long *map,
> + unsigned long value,
> + unsigned long start, unsigned long nbits)
> +{
> + size_t index = BIT_WORD(start);
> + unsigned long offset = start % BITS_PER_LONG;
> + unsigned long space = BITS_PER_LONG - offset;
> +
> + if (unlikely(!nbits))
> + return;
> + value &= GENMASK(nbits - 1, 0);
Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
because otherwise it's an out-of-bonds type of error.
This is kind of gray zone to me, because it's a caller's responsibility
to provide correct input. But on the other hand, it would be a great
headache debugging corrupted bitmaps.
Now that we've got a single user of the bitmap_write, and even more,
it's wrapped with a helper, I think it would be reasonable to trim a
'value' in the helper, if needed.
Anyways, the comment must warn about that loudly...
> + if (space >= nbits) {
> + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
'GENMASK(nbits - 1, 0) << offset' looks really silly.
> + map[index] |= value << offset;
Here you commit 2 reads and 2 writes for a word instead of one.
> + return;
> + }
> + map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
~25 bytes of .text on x86_64.
> + map[index] |= value << offset;
> + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> + map[index + 1] |= (value >> space);
> +}
With all that I think the implementation should look something like
this:
unsigned long w, mask;
if (unlikely(nbits == 0))
return 0;
map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;
w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);
if (end < BITS_PER_LONG)
return;
w = *++map & BITMAP_FIRST_WORD_MASK(end);
*map = w | (value >> BITS_PER_LONG * 2 - end);
It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
it should be more efficient.
Alexander, can you please try the above and compare?
In previous iteration, I asked you to share disassembly listings for the
functions. Can you please do that now?
Regarding the rest of the series:
- I still see Evgenii's name in mtecomp.c, and EA0 references;
- git-am throws warning about trailing line;
- checkpatch warns 7 times;
Can you fix all the above before sending the new version?
Have you tested generic part against BE32, BE64 and LE32 architectures;
and arch part against BE64? If not, please do.
You're mentioning that the compression ratio is 2 to 20x. Can you
share the absolute numbers? If it's 1k vs 2k, I think most people
just don't care...
Can you share the code that you used to measure the compression ratio?
Would it make sense to export the numbers via sysfs?
Thanks,
Yury
> > Regarding the rest of the series: > - I still see Evgenii's name in mtecomp.c, and EA0 references; Double-checked there are none in v5 (only the Suggested-by: tag) > - git-am throws warning about trailing line; I checked locally that `git am` does not warn about v5 patches. But given that the patches are generated with `git format-patch` I suspect they get garbled when you download them, could it be the case? > - checkpatch warns 7 times; It now warns 4 times, three warnings are about updating MAINTAINERS (I don't think there's need for this), the last one is about CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description text in Kconfig. > Can you fix all the above before sending the new version? > Have you tested generic part against BE32, BE64 and LE32 architectures; > and arch part against BE64? If not, please do. I did now. > You're mentioning that the compression ratio is 2 to 20x. Can you > share the absolute numbers? If it's 1k vs 2k, I think most people > just don't care... In the other thread I mentioned that although 20x compression is reachable, it may not lead to practical savings. I reworded the description, having added the absolute numbers. > Can you share the code that you used to measure the compression ratio? > Would it make sense to export the numbers via sysfs? Done in v5 > Thanks, > Yury -- Alexander Potapenko Software Engineer Google Germany GmbH Erika-Mann-Straße, 33 80636 München Geschäftsführer: Paul Manicle, Liana Sebastian Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg
On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
>
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > + unsigned long value,
> > + unsigned long start, unsigned long nbits)
> > +{
> > + size_t index = BIT_WORD(start);
> > + unsigned long offset = start % BITS_PER_LONG;
> > + unsigned long space = BITS_PER_LONG - offset;
> > +
> > + if (unlikely(!nbits))
> > + return;
> > + value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.
I can easily imagine someone passing -1 (or ~0) as a value, but
wanting to only write n bits of n.
>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...
>
> > + if (space >= nbits) {
> > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
>
> 'GENMASK(nbits - 1, 0) << offset' looks really silly.
As noted in the other patch discussion, pulling offset into GENMASK is
actually not an identity transformation, because nbits + offset may
exceed 64, producing an invalid mask.
>
> > + map[index] |= value << offset;
>
> Here you commit 2 reads and 2 writes for a word instead of one.
There won't be two reads and two writes.
The compiler will read map[index] to a register, apply the mask, then
write value << offset to it, then perform the write.
Unless map[] is a volatile, repeated reads/writes will be optimized
out by any decent compiler.
>
> > + return;
> > + }
> > + map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
>
> ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves
> ~25 bytes of .text on x86_64.
>
> > + map[index] |= value << offset;
> > + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> > + map[index + 1] |= (value >> space);
> > +}
>
> With all that I think the implementation should look something like
> this:
>
> unsigned long w, mask;
>
> if (unlikely(nbits == 0))
> return 0;
>
> map += BIT_WORD(start);
> start %= BITS_PER_LONG;
> end = start + nbits - 1;
>
> w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
> *map = w | (value << start);
>
> if (end < BITS_PER_LONG)
> return;
>
> w = *++map & BITMAP_FIRST_WORD_MASK(end);
> *map = w | (value >> BITS_PER_LONG * 2 - end);
>
> It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect
> it should be more efficient.
>
> Alexander, can you please try the above and compare?
I like the idea of sharing the first write between the branches, and
it can be made even shorter:
===========================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long offset;
unsigned long space;
size_t index;
bool fit;
if (unlikely(!nbits))
return;
value &= GENMASK(nbits - 1, 0);
offset = start % BITS_PER_LONG;
space = BITS_PER_LONG - offset;
index = BIT_WORD(start);
fit = space >= nbits;
map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
map[index] |= value << offset;
if (fit)
return;
map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
===========================================================
According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
GCC (which on the other hand does a poor job optimizing both).
Overall, given that there's currently a single user of these
functions, isn't it premature to optimize them without knowing
anything about their performance?
> In previous iteration, I asked you to share disassembly listings for the
> functions. Can you please do that now?
Will godbolt work for you (see above)?
>
> Regarding the rest of the series:
> - I still see Evgenii's name in mtecomp.c, and EA0 references;
Will fix, thanks!
> - git-am throws warning about trailing line;
Interesting, I generate the patches using `git format-patch`, I
thought `git am` should be the inversion of it. I'll check.
> - checkpatch warns 7 times;
Indeed, there were warnings that I ignored (e.g. one of them was
caused by me adding extern symbols to the test module, as requested
during the review process). I think I can fix all of them.
>
> Can you fix all the above before sending the new version?
>
> Have you tested generic part against BE32, BE64 and LE32 architectures;
> and arch part against BE64? If not, please do.
BE64 works, I'll also need to check the 32-bit versions as well.
Note that MTE is an ARM64 feature (yet we still need to ensure
bitops.h works on 32 bits).
>
> You're mentioning that the compression ratio is 2 to 20x. Can you
> share the absolute numbers? If it's 1k vs 2k, I think most people
> just don't care...
I'll provide the exact numbers with the next patch series. Last time I
checked, the order of magnitude was tens of megabytes.
> Can you share the code that you used to measure the compression ratio?
> Would it make sense to export the numbers via sysfs?
For out-of-line allocations the data can be derived from
/proc/slabinfo, but we don't calculate inline allocations.
Agreed, a debugfs interface won't hurt.
On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote:
> On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > > +/**
> > > + * bitmap_write - write n-bit value within a memory region
> > > + * @map: address to the bitmap memory region
> > > + * @value: value of nbits
> > > + * @start: bit offset of the n-bit value
> > > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > > + */
> > > +static inline void bitmap_write(unsigned long *map,
> > > + unsigned long value,
> > > + unsigned long start, unsigned long nbits)
> > > +{
> > > + size_t index = BIT_WORD(start);
> > > + unsigned long offset = start % BITS_PER_LONG;
> > > + unsigned long space = BITS_PER_LONG - offset;
> > > +
> > > + if (unlikely(!nbits))
> > > + return;
> > > + value &= GENMASK(nbits - 1, 0);
> >
> > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> > because otherwise it's an out-of-bonds type of error.
>
> I can easily imagine someone passing -1 (or ~0) as a value, but
> wanting to only write n bits of n.
This is an abuse of new API because we've got a bitmap_set(). But
whatever, let's keep that masking.
...
> I like the idea of sharing the first write between the branches, and
> it can be made even shorter:
>
> ===========================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
> unsigned long start, unsigned long nbits)
> {
> unsigned long offset;
> unsigned long space;
> size_t index;
> bool fit;
>
> if (unlikely(!nbits))
> return;
>
> value &= GENMASK(nbits - 1, 0);
> offset = start % BITS_PER_LONG;
> space = BITS_PER_LONG - offset;
> index = BIT_WORD(start);
> fit = space >= nbits;
space >= nbits <=>
BITS_PER_LONG - offset >= nbits <=>
offset + nbits <= BITS_PER_LONG
> map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
So here GENMASK(nbits + offset - 1, offset) is at max:
GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
point. Does it make sense?
> ~BITMAP_FIRST_WORD_MASK(start));
As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
and vise-versa.
> map[index] |= value << offset;
> if (fit)
> return;
>
> map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
> map[index + 1] |= (value >> space);
> }
> ===========================================================
>
> According to Godbolt (https://godbolt.org/z/n5Te779bf), this function
> is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under
> GCC (which on the other hand does a poor job optimizing both).
>
> Overall, given that there's currently a single user of these
> functions, isn't it premature to optimize them without knowing
> anything about their performance?
>
> > In previous iteration, I asked you to share disassembly listings for the
> > functions. Can you please do that now?
>
> Will godbolt work for you (see above)?
I don't know for how long an external resource will keep the reference
alive. My SSD keeps emails long enough.
...
> > You're mentioning that the compression ratio is 2 to 20x. Can you
> > share the absolute numbers? If it's 1k vs 2k, I think most people
> > just don't care...
>
> I'll provide the exact numbers with the next patch series. Last time I
> checked, the order of magnitude was tens of megabytes.
That's impressive. Fruitful idea. It would be important for embedded guys
who may disable MTE because of memory overhead. I think it's worth to
mention that in Kconfig together with associate performance overhead,
if it ever measurable.
> > Can you share the code that you used to measure the compression ratio?
> > Would it make sense to export the numbers via sysfs?
>
> For out-of-line allocations the data can be derived from
> /proc/slabinfo, but we don't calculate inline allocations.
> Agreed, a debugfs interface won't hurt.
> space >= nbits <=>
> BITS_PER_LONG - offset >= nbits <=>
> offset + nbits <= BITS_PER_LONG
>
> > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
>
> So here GENMASK(nbits + offset - 1, offset) is at max:
> GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> point. Does it make sense?
It indeed does. Perhaps pulling offset inside GENMASK is not a bug
after all (a simple test does not show any difference between their
behavior.
But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
My guess is that this happens because the compiler fails to reuse the
value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.
>
> > ~BITMAP_FIRST_WORD_MASK(start));
>
> As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> and vise-versa.
Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
BITMAP_LAST_WORD_MASK().
>
> > map[index] |= value << offset;
> > if (fit)
> > return;
> >
> > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
OTOH I managed to shave three more bytes off by replacing
~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
> > map[index + 1] |= (value >> space);
> > }
I'll post the implementations together with the disassembly below.
I used some Clang 17.0.0 version that is a couple months behind
upstream, but that still produces sustainably shorter code (~48 bytes
less) than the trunk GCC on Godbolt.
1. Original implementation of bitmap_write() from this patch - 164
bytes (interestingly, it's 157 bytes with Clang 14.0.6)
==================================================================
void bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
size_t index = BIT_WORD(start);
unsigned long offset = start % BITS_PER_LONG;
unsigned long space = BITS_PER_LONG - offset;
if (unlikely(!nbits))
return;
value &= GENMASK(nbits - 1, 0);
if (space >= nbits) {
map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
map[index] |= value << offset;
return;
}
map[index] &= ~BITMAP_FIRST_WORD_MASK(start);
map[index] |= value << offset;
map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
==================================================================
0000000000001200 <bitmap_write>:
1200: 41 57 push %r15
1202: 41 56 push %r14
1204: 53 push %rbx
1205: 48 85 c9 test %rcx,%rcx
1208: 74 7b je 1285 <bitmap_write+0x85>
120a: 48 89 c8 mov %rcx,%rax
120d: 49 89 d2 mov %rdx,%r10
1210: 49 c1 ea 06 shr $0x6,%r10
1214: 41 89 d1 mov %edx,%r9d
1217: f6 d9 neg %cl
1219: 49 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%r15
1220: 49 d3 ef shr %cl,%r15
1223: 41 83 e1 3f and $0x3f,%r9d
1227: 41 b8 40 00 00 00 mov $0x40,%r8d
122d: 4c 21 fe and %r15,%rsi
1230: 48 89 f3 mov %rsi,%rbx
1233: 44 89 c9 mov %r9d,%ecx
1236: 48 d3 e3 shl %cl,%rbx
1239: 4d 29 c8 sub %r9,%r8
123c: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11
1243: 4e 8b 34 d7 mov (%rdi,%r10,8),%r14
1247: 49 39 c0 cmp %rax,%r8
124a: 73 3f jae 128b <bitmap_write+0x8b>
124c: 49 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%r15
1253: 44 89 c9 mov %r9d,%ecx
1256: 49 d3 e7 shl %cl,%r15
1259: 49 f7 d7 not %r15
125c: 4d 21 fe and %r15,%r14
125f: 49 09 de or %rbx,%r14
1262: 4e 89 34 d7 mov %r14,(%rdi,%r10,8)
1266: 01 c2 add %eax,%edx
1268: f6 da neg %dl
126a: 89 d1 mov %edx,%ecx
126c: 49 d3 eb shr %cl,%r11
126f: 49 f7 d3 not %r11
1272: 44 89 c1 mov %r8d,%ecx
1275: 48 d3 ee shr %cl,%rsi
1278: 4e 23 5c d7 08 and 0x8(%rdi,%r10,8),%r11
127d: 4c 09 de or %r11,%rsi
1280: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8)
1285: 5b pop %rbx
1286: 41 5e pop %r14
1288: 41 5f pop %r15
128a: c3 ret
128b: 44 89 c9 mov %r9d,%ecx
128e: 49 d3 e7 shl %cl,%r15
1291: 49 f7 d7 not %r15
1294: 4d 21 fe and %r15,%r14
1297: 49 09 de or %rbx,%r14
129a: 4e 89 34 d7 mov %r14,(%rdi,%r10,8)
129e: 5b pop %rbx
129f: 41 5e pop %r14
12a1: 41 5f pop %r15
12a3: c3 ret
==================================================================
2. Your implementation, my_bitmap_write() - 152 bytes (used to be
slightly worse with Clang 14.0.6 - 159 bytes):
==================================================================
void my_bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long w, end;
if (unlikely(nbits == 0))
return;
value &= GENMASK(nbits - 1, 0);
map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;
w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) :
BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);
if (end < BITS_PER_LONG)
return;
w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
*map = w | (value >> (BITS_PER_LONG - start));
}
==================================================================
0000000000001160 <my_bitmap_write>:
1160: 48 85 c9 test %rcx,%rcx
1163: 0f 84 8e 00 00 00 je 11f7 <my_bitmap_write+0x97>
1169: 49 89 c9 mov %rcx,%r9
116c: f6 d9 neg %cl
116e: 48 d3 e6 shl %cl,%rsi
1171: 48 d3 ee shr %cl,%rsi
1174: 49 89 d2 mov %rdx,%r10
1177: 49 c1 ea 06 shr $0x6,%r10
117b: 89 d0 mov %edx,%eax
117d: 83 e0 3f and $0x3f,%eax
1180: 4e 8d 04 08 lea (%rax,%r9,1),%r8
1184: 4a 8d 0c 08 lea (%rax,%r9,1),%rcx
1188: 48 ff c9 dec %rcx
118b: 4e 8b 0c d7 mov (%rdi,%r10,8),%r9
118f: 48 83 f9 3f cmp $0x3f,%rcx
1193: 77 29 ja 11be <my_bitmap_write+0x5e>
1195: 41 f6 d8 neg %r8b
1198: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
119f: 44 89 c1 mov %r8d,%ecx
11a2: 48 d3 ea shr %cl,%rdx
11a5: 89 c1 mov %eax,%ecx
11a7: 48 d3 ea shr %cl,%rdx
11aa: 48 d3 e2 shl %cl,%rdx
11ad: 48 d3 e6 shl %cl,%rsi
11b0: 48 f7 d2 not %rdx
11b3: 49 21 d1 and %rdx,%r9
11b6: 4c 09 ce or %r9,%rsi
11b9: 4a 89 34 d7 mov %rsi,(%rdi,%r10,8)
11bd: c3 ret
11be: f6 da neg %dl
11c0: 89 d1 mov %edx,%ecx
11c2: 49 d3 e1 shl %cl,%r9
11c5: 49 d3 e9 shr %cl,%r9
11c8: 48 89 f2 mov %rsi,%rdx
11cb: 89 c1 mov %eax,%ecx
11cd: 48 d3 e2 shl %cl,%rdx
11d0: 4c 09 ca or %r9,%rdx
11d3: 4a 89 14 d7 mov %rdx,(%rdi,%r10,8)
11d7: 4a 8b 54 d7 08 mov 0x8(%rdi,%r10,8),%rdx
11dc: 41 f6 d8 neg %r8b
11df: 44 89 c1 mov %r8d,%ecx
11e2: 48 d3 e2 shl %cl,%rdx
11e5: 48 d3 ea shr %cl,%rdx
11e8: f6 d8 neg %al
11ea: 89 c1 mov %eax,%ecx
11ec: 48 d3 ee shr %cl,%rsi
11ef: 48 09 d6 or %rdx,%rsi
11f2: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8)
11f7: c3 ret
==================================================================
3. My improved version built on top of yours and mentioned above under
the name bitmap_write_new() - 116 bytes:
==================================================================
void bitmap_write_new(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long offset;
unsigned long space;
size_t index;
bool fit;
if (unlikely(!nbits))
return;
value &= GENMASK(nbits - 1, 0);
offset = start % BITS_PER_LONG;
space = BITS_PER_LONG - offset;
index = BIT_WORD(start);
fit = space >= nbits;
map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
~BITMAP_FIRST_WORD_MASK(start));
map[index] |= value << offset;
if (fit)
return;
map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
==================================================================
00000000000012b0 <bitmap_write_new>:
12b0: 48 85 c9 test %rcx,%rcx
12b3: 74 6e je 1323 <bitmap_write_new+0x73>
12b5: 48 89 c8 mov %rcx,%rax
12b8: f6 d9 neg %cl
12ba: 49 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%r10
12c1: 49 d3 ea shr %cl,%r10
12c4: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11
12cb: 4c 21 d6 and %r10,%rsi
12ce: 89 d1 mov %edx,%ecx
12d0: 83 e1 3f and $0x3f,%ecx
12d3: 41 b8 40 00 00 00 mov $0x40,%r8d
12d9: 49 29 c8 sub %rcx,%r8
12dc: 49 89 d1 mov %rdx,%r9
12df: 49 c1 e9 06 shr $0x6,%r9
12e3: 49 39 c0 cmp %rax,%r8
12e6: 4d 0f 42 d3 cmovb %r11,%r10
12ea: 49 d3 e2 shl %cl,%r10
12ed: 49 f7 d2 not %r10
12f0: 4e 23 14 cf and (%rdi,%r9,8),%r10
12f4: 49 89 f3 mov %rsi,%r11
12f7: 49 d3 e3 shl %cl,%r11
12fa: 4d 09 d3 or %r10,%r11
12fd: 4e 89 1c cf mov %r11,(%rdi,%r9,8)
1301: 49 39 c0 cmp %rax,%r8
1304: 73 1d jae 1323 <bitmap_write_new+0x73>
1306: 01 d0 add %edx,%eax
1308: 4a 8b 54 cf 08 mov 0x8(%rdi,%r9,8),%rdx
130d: 89 c1 mov %eax,%ecx
130f: 48 d3 ea shr %cl,%rdx
1312: 48 d3 e2 shl %cl,%rdx
1315: 44 89 c1 mov %r8d,%ecx
1318: 48 d3 ee shr %cl,%rsi
131b: 48 09 d6 or %rdx,%rsi
131e: 4a 89 74 cf 08 mov %rsi,0x8(%rdi,%r9,8)
1323: c3 ret
==================================================================
4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 +
offset, offset)` - 146 bytes:
==================================================================
void bitmap_write_new_shift(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long offset;
unsigned long space;
size_t index;
bool fit;
if (unlikely(!nbits))
return;
value &= GENMASK(nbits - 1, 0);
offset = start % BITS_PER_LONG;
space = BITS_PER_LONG - offset;
index = BIT_WORD(start);
fit = space >= nbits;
map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) :
~BITMAP_FIRST_WORD_MASK(start));
map[index] |= value << offset;
if (fit)
return;
map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
map[index + 1] |= (value >> space);
}
==================================================================
0000000000001330 <bitmap_write_new_shift>:
1330: 48 85 c9 test %rcx,%rcx
1333: 74 6a je 139f <bitmap_write_new_shift+0x6f>
1335: 48 89 c8 mov %rcx,%rax
1338: f6 d9 neg %cl
133a: 48 d3 e6 shl %cl,%rsi
133d: 48 d3 ee shr %cl,%rsi
1340: 41 89 d0 mov %edx,%r8d
1343: 41 83 e0 3f and $0x3f,%r8d
1347: 41 b9 40 00 00 00 mov $0x40,%r9d
134d: 4d 29 c1 sub %r8,%r9
1350: 49 89 d2 mov %rdx,%r10
1353: 49 c1 ea 06 shr $0x6,%r10
1357: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11
135e: 44 89 c1 mov %r8d,%ecx
1361: 49 d3 e3 shl %cl,%r11
1364: 49 39 c1 cmp %rax,%r9
1367: 73 37 jae 13a0 <bitmap_write_new_shift+0x70>
1369: 53 push %rbx
136a: 49 f7 d3 not %r11
136d: 4e 23 1c d7 and (%rdi,%r10,8),%r11
1371: 48 89 f3 mov %rsi,%rbx
1374: 44 89 c1 mov %r8d,%ecx
1377: 48 d3 e3 shl %cl,%rbx
137a: 4c 09 db or %r11,%rbx
137d: 4a 89 1c d7 mov %rbx,(%rdi,%r10,8)
1381: 01 d0 add %edx,%eax
1383: 4a 8b 54 d7 08 mov 0x8(%rdi,%r10,8),%rdx
1388: 89 c1 mov %eax,%ecx
138a: 48 d3 ea shr %cl,%rdx
138d: 48 d3 e2 shl %cl,%rdx
1390: 44 89 c9 mov %r9d,%ecx
1393: 48 d3 ee shr %cl,%rsi
1396: 48 09 d6 or %rdx,%rsi
1399: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8)
139e: 5b pop %rbx
139f: c3 ret
13a0: 44 01 c0 add %r8d,%eax
13a3: f6 d8 neg %al
13a5: 89 c1 mov %eax,%ecx
13a7: 49 d3 e3 shl %cl,%r11
13aa: 49 d3 eb shr %cl,%r11
13ad: 49 f7 d3 not %r11
13b0: 44 89 c1 mov %r8d,%ecx
13b3: 48 d3 e6 shl %cl,%rsi
13b6: 4e 23 1c d7 and (%rdi,%r10,8),%r11
13ba: 4c 09 de or %r11,%rsi
13bd: 4a 89 34 d7 mov %rsi,(%rdi,%r10,8)
13c1: c3 ret
==================================================================
For posterity, I am also attaching the C file containing the four
implementations together with some code testing them.
PS. I won't be actively working on the patch series till the end of
August, so feel free to ignore this letter until then.
On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote:
> > space >= nbits <=>
> > BITS_PER_LONG - offset >= nbits <=>
> > offset + nbits <= BITS_PER_LONG
> >
> > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> >
> > So here GENMASK(nbits + offset - 1, offset) is at max:
> > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my
> > point. Does it make sense?
>
> It indeed does. Perhaps pulling offset inside GENMASK is not a bug
> after all (a simple test does not show any difference between their
> behavior.
> But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below).
> My guess is that this happens because the compiler fails to reuse the
> value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and
> calculates `GENMASK(nbits - 1 + offset, offset)` from scratch.
OK. Can you put a comment explaining this? Or maybe would be even
better to use BITMAP_LAST_WORD_MASK() here:
mask = BITMAP_LAST_WORD_MASK(nbits);
value &= mask;
...
map[index] &= (fit ? (~mask << offset)) :
> > > ~BITMAP_FIRST_WORD_MASK(start));
> >
> > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK()
> > and vise-versa.
>
> Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than
> BITMAP_LAST_WORD_MASK().
Wow... If that's consistent across different compilers/arches, we'd
just drop the latter. Thanks for pointing that. I'll check.
> > > map[index] |= value << offset;
> > > if (fit)
> > > return;
> > >
> > > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits);
>
> OTOH I managed to shave three more bytes off by replacing
> ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here.
>
> > > map[index + 1] |= (value >> space);
> > > }
>
> I'll post the implementations together with the disassembly below.
> I used some Clang 17.0.0 version that is a couple months behind
> upstream, but that still produces sustainably shorter code (~48 bytes
> less) than the trunk GCC on Godbolt.
>
> 1. Original implementation of bitmap_write() from this patch - 164
> bytes (interestingly, it's 157 bytes with Clang 14.0.6)
I spotted that too in some other case. Newer compilers tend to
generate bigger code, but the result usually works faster. One
particular reason for my case was a loop unrolling.
[...]
> 3. My improved version built on top of yours and mentioned above under
> the name bitmap_write_new() - 116 bytes:
30% better in size - that's impressive!
> ==================================================================
> void bitmap_write_new(unsigned long *map, unsigned long value,
> unsigned long start, unsigned long nbits)
> {
> unsigned long offset;
> unsigned long space;
> size_t index;
> bool fit;
>
> if (unlikely(!nbits))
> return;
>
> value &= GENMASK(nbits - 1, 0);
> offset = start % BITS_PER_LONG;
> space = BITS_PER_LONG - offset;
> index = BIT_WORD(start);
> fit = space >= nbits;
>
> map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) :
> ~BITMAP_FIRST_WORD_MASK(start));
> map[index] |= value << offset;
> if (fit)
> return;
>
> map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits);
> map[index + 1] |= (value >> space);
> }
Thanks,
Yury
> OK. Can you put a comment explaining this? Or maybe would be even > better to use BITMAP_LAST_WORD_MASK() here: > > mask = BITMAP_LAST_WORD_MASK(nbits); > value &= mask; > ... > map[index] &= (fit ? (~mask << offset)) : I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's self-explanatory now, WDYT?
On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote: > On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote: > > > > ~BITMAP_FIRST_WORD_MASK(start)); > > > > > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK() > > > and vise-versa. > > > > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than > > BITMAP_LAST_WORD_MASK(). > > Wow... If that's consistent across different compilers/arches, we'd > just drop the latter. Thanks for pointing that. I'll check. It might be... ... > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : > > ~BITMAP_FIRST_WORD_MASK(start)); ...that both parts of ternary are using negation. So compiler may use negation only once. (Haven't looked at the assembly.) -- With Best Regards, Andy Shevchenko
On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote: > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset); > > 'GENMASK(nbits - 1, 0) << offset' looks really silly. But you followed the thread to get a clue why it's written in this form, right? ... > With all that I think the implementation should look something like > this: I would go this way if and only if the code generation on main architectures with both GCC and clang is better. And maybe even some performance tests need to be provided. ... > Alexander, can you please try the above and compare? > In previous iteration, I asked you to share disassembly listings for the > functions. Can you please do that now? Exactly what we need before going with your suggestion(s). -- With Best Regards, Andy Shevchenko
On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
>
> > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> >
> > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
>
> But you followed the thread to get a clue why it's written in this form, right?
Yes, I did. But I don't expect everyone looking at kernel code would spend
time recovering discussions that explain why that happened. So, at least it
would be fine to drop a comment.
> ...
>
> > With all that I think the implementation should look something like
> > this:
>
> I would go this way if and only if the code generation on main architectures
> with both GCC and clang is better.
>
> And maybe even some performance tests need to be provided.
For the following implementation:
void my_bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long w, end;
if (unlikely(nbits == 0))
return;
value &= GENMASK(nbits - 1, 0);
map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;
w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);
if (end < BITS_PER_LONG)
return;
w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
*map = w | (value >> (BITS_PER_LONG - start));
}
This is the bloat-o-meter output:
$ scripts/bloat-o-meter lib/test_bitmap.o.orig lib/test_bitmap.o
add/remove: 8/0 grow/shrink: 1/0 up/down: 2851/0 (2851)
Function old new delta
test_bitmap_init 3846 5484 +1638
test_bitmap_write_perf - 401 +401
bitmap_write - 271 +271
my_bitmap_write - 248 +248
bitmap_read - 229 +229
__pfx_test_bitmap_write_perf - 16 +16
__pfx_my_bitmap_write - 16 +16
__pfx_bitmap_write - 16 +16
__pfx_bitmap_read - 16 +16
Total: Before=36964, After=39815, chg +7.71%
And this is the performance test:
for (cnt = 0; cnt < 5; cnt++) {
time = ktime_get();
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
for (i = 0; i < 1000; i++) {
if (i + nbits > 1000)
break;
bitmap_write(bmap, val, i, nbits);
}
}
time = ktime_get() - time;
pr_err("bitmap_write:\t%llu\t", time);
time = ktime_get();
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
for (i = 0; i < 1000; i++) {
if (i + nbits > 1000)
break;
my_bitmap_write(bmap, val, i, nbits);
}
}
time = ktime_get() - time;
pr_cont("%llu\n", time);
}
Which on x86_64/kvm with GCC gives:
Orig My
[ 1.630731] test_bitmap: bitmap_write: 299092 252764
[ 1.631584] test_bitmap: bitmap_write: 299522 252554
[ 1.632429] test_bitmap: bitmap_write: 299171 258665
[ 1.633280] test_bitmap: bitmap_write: 299241 252794
[ 1.634133] test_bitmap: bitmap_write: 306716 252934
So, it's ~15% difference in performance and 8% in size.
I don't insist on my implementation, but I think, we'd experiment for more
with code generation.
Thanks,
Yury
On Mon, Jul 24, 2023 at 10:04:34PM -0700, Yury Norov wrote: > On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote: > > On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote: > > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: ... > > > 'GENMASK(nbits - 1, 0) << offset' looks really silly. > > > > But you followed the thread to get a clue why it's written in this form, right? > > Yes, I did. But I don't expect everyone looking at kernel code would spend > time recovering discussions that explain why that happened. So, at least it > would be fine to drop a comment. See also below. ... > w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start)); This GENMASK() may generate worse code as compiler uses more instructions instead of simple approach with the above.. ... > bitmap_write - 271 +271 > my_bitmap_write - 248 +248 > bitmap_read - 229 +229 my_ -- means your proposal? Do you mean you have it better in size than original one? -- With Best Regards, Andy Shevchenko
On Sat, Jul 22, 2023 at 06:57:26PM -0700, Yury Norov wrote:
> On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
> > +/**
> > + * bitmap_write - write n-bit value within a memory region
> > + * @map: address to the bitmap memory region
> > + * @value: value of nbits
> > + * @start: bit offset of the n-bit value
> > + * @nbits: size of value in bits, up to BITS_PER_LONG
> > + */
> > +static inline void bitmap_write(unsigned long *map,
> > + unsigned long value,
> > + unsigned long start, unsigned long nbits)
> > +{
> > + size_t index = BIT_WORD(start);
> > + unsigned long offset = start % BITS_PER_LONG;
> > + unsigned long space = BITS_PER_LONG - offset;
> > +
> > + if (unlikely(!nbits))
> > + return;
> > + value &= GENMASK(nbits - 1, 0);
>
> Strictly speaking, a 'value' shouldn't contain set bits beyond nbits
> because otherwise it's an out-of-bonds type of error.
>
> This is kind of gray zone to me, because it's a caller's responsibility
> to provide correct input. But on the other hand, it would be a great
> headache debugging corrupted bitmaps.
>
> Now that we've got a single user of the bitmap_write, and even more,
> it's wrapped with a helper, I think it would be reasonable to trim a
> 'value' in the helper, if needed.
>
> Anyways, the comment must warn about that loudly...
OK. I spent a night with that, and I'm still not sure. Pseudo code
that implements it looks like this:
for (bit = 0; bit < nbits; bit++)
assign_bit(start + bit, bitmap, val & BIT(bit));
And it ignores trailing bits. So if we're optimizing this pattern,
we'd ignore these bits just as well...
Either way, whatever we decide, let's stay clear with our intentions
and mention explicitly that tail bits are either must be zero, or
ignored.
Alexander, can you add the snippet above to the comments for the
bitmap_write() and bitmap_read(), as well as in the test? Also, if we
decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK()
generate better code than GENMASK(nbits - 1, 0) does?
Commets are very appreciated.
Thanks,
Yury
> Either way, whatever we decide, let's stay clear with our intentions > and mention explicitly that tail bits are either must be zero, or > ignored. > > Alexander, can you add the snippet above to the comments for the > bitmap_write() and bitmap_read(), as well as in the test? Also, if we > decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK() > generate better code than GENMASK(nbits - 1, 0) does? Added the snippet above to bitmap_write(). I think however it is obvious that bitmap_read() reads only nbits? > Commets are very appreciated. > > Thanks, > Yury -- Alexander Potapenko Software Engineer Google Germany GmbH Erika-Mann-Straße, 33 80636 München Geschäftsführer: Paul Manicle, Liana Sebastian Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg
© 2016 - 2026 Red Hat, Inc.