include/linux/cpumask.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
The cpumask_any_but() function can avoid using a loop to determine the
CPU index to return. If the first set bit in the cpumask is not equal
to the specified CPU, we can directly return the index of the first set
bit. Otherwise, we return the next set bit's index.
This optimization replaces the loop with a single if statement,
allowing the compiler to generate more concise and efficient code.
As a result, the size of the bzImage built with x86 defconfig is
reduced by 4096 bytes:
* Before:
$ size arch/x86/boot/bzImage
text data bss dec hex filename
13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
* After:
$ size arch/x86/boot/bzImage
text data bss dec hex filename
13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---
Not sure how to measure the efficiency difference, but I guess this
patch might be slightly more efficient or nearly the same as before. If
you have any good ideas for measuring efficiency, please let me know!
include/linux/cpumask.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 9278a50d514f..b769fcdbaa10 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
unsigned int i;
cpumask_check(cpu);
- for_each_cpu(i, mask)
- if (i != cpu)
- break;
- return i;
+ i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
+ if (i != cpu)
+ return i;
+ return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
}
/**
--
2.34.1
On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > The cpumask_any_but() function can avoid using a loop to determine the > CPU index to return. If the first set bit in the cpumask is not equal > to the specified CPU, we can directly return the index of the first set > bit. Otherwise, we return the next set bit's index. > > This optimization replaces the loop with a single if statement, > allowing the compiler to generate more concise and efficient code. > > As a result, the size of the bzImage built with x86 defconfig is > reduced by 4096 bytes: > > * Before: > $ size arch/x86/boot/bzImage > text data bss dec hex filename > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > * After: > $ size arch/x86/boot/bzImage > text data bss dec hex filename > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > --- > Not sure how to measure the efficiency difference, but I guess this > patch might be slightly more efficient or nearly the same as before. If > you have any good ideas for measuring efficiency, please let me know! > > include/linux/cpumask.h | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > index 9278a50d514f..b769fcdbaa10 100644 > --- a/include/linux/cpumask.h > +++ b/include/linux/cpumask.h > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > unsigned int i; > > cpumask_check(cpu); > - for_each_cpu(i, mask) > - if (i != cpu) > - break; > - return i; > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); Hi Kuan-Wei, How about using cpumask_first(mask) here to keep better consistency? > + if (i != cpu) > + return i; Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to perform find_next_bit(). > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > } > Regards, I Hsin > /** > -- > 2.34.1 >
On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote: > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > > The cpumask_any_but() function can avoid using a loop to determine the > > CPU index to return. If the first set bit in the cpumask is not equal > > to the specified CPU, we can directly return the index of the first set > > bit. Otherwise, we return the next set bit's index. > > > > This optimization replaces the loop with a single if statement, > > allowing the compiler to generate more concise and efficient code. > > > > As a result, the size of the bzImage built with x86 defconfig is > > reduced by 4096 bytes: > > > > * Before: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > > > * After: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage > > > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > --- > > Not sure how to measure the efficiency difference, but I guess this > > patch might be slightly more efficient or nearly the same as before. If > > you have any good ideas for measuring efficiency, please let me know! > > > > include/linux/cpumask.h | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > > index 9278a50d514f..b769fcdbaa10 100644 > > --- a/include/linux/cpumask.h > > +++ b/include/linux/cpumask.h > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > > unsigned int i; > > > > cpumask_check(cpu); > > - for_each_cpu(i, mask) > > - if (i != cpu) > > - break; > > - return i; > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); > > Hi Kuan-Wei, > > How about using cpumask_first(mask) here to keep better consistency? > Sure. > > + if (i != cpu) > > + return i; > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to > find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to > perform find_next_bit(). > Hmm, adding this check increases the image size back to what it was before the patch. Also, it only saves the execution time of find_next_bit() when the cpumask is entirely zero. I'm not sure how often a fully zero cpumask actually occurs in practice, but if we assume a uniform distribution, the probability of this case would approach zero. Regards, Kuan-Wei > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > > } > > > > Regards, > I Hsin > > > /** > > -- > > 2.34.1 > >
On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote: > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote: > > The cpumask_any_but() function can avoid using a loop to determine the > > CPU index to return. If the first set bit in the cpumask is not equal > > to the specified CPU, we can directly return the index of the first set > > bit. Otherwise, we return the next set bit's index. > > > > This optimization replaces the loop with a single if statement, > > allowing the compiler to generate more concise and efficient code. I thought compilers are smart enough to unroll loop in this case. Can you show disassembled code before and after? > > > > As a result, the size of the bzImage built with x86 defconfig is > > reduced by 4096 bytes: > > > > * Before: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage > > > > * After: > > $ size arch/x86/boot/bzImage > > text data bss dec hex filename > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage Comparing zipped images tells little about code generation. Please use scripts/bloat-o-meter. > > > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> > > --- > > Not sure how to measure the efficiency difference, but I guess this > > patch might be slightly more efficient or nearly the same as before. If > > you have any good ideas for measuring efficiency, please let me know! Check lib/find_bit_benchmark.c > > > > include/linux/cpumask.h | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > > index 9278a50d514f..b769fcdbaa10 100644 > > --- a/include/linux/cpumask.h > > +++ b/include/linux/cpumask.h > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu) > > unsigned int i; > > > > cpumask_check(cpu); > > - for_each_cpu(i, mask) > > - if (i != cpu) > > - break; > > - return i; > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits); > > Hi Kuan-Wei, > > How about using cpumask_first(mask) here to keep better consistency? I would do it the other way: introduce find_first_but_bit(), and then make cpumask_any_but() a wrapper around it. Doing this you'll be able to leverage find_bit_benchmark infrastructure to measure performance difference, if any. > > + if (i != cpu) > > + return i; > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to > find_next_bit() ? Yes it would. Thanks, Yury > if "i >= nr_cpu_ids" holds it would be a waste to > perform find_next_bit(). > > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1); > > } > > > > Regards, > I Hsin > > > /** > > -- > > 2.34.1 > >
Hi Yury,
On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > The cpumask_any_but() function can avoid using a loop to determine the
> > > CPU index to return. If the first set bit in the cpumask is not equal
> > > to the specified CPU, we can directly return the index of the first set
> > > bit. Otherwise, we return the next set bit's index.
> > >
> > > This optimization replaces the loop with a single if statement,
> > > allowing the compiler to generate more concise and efficient code.
>
> I thought compilers are smart enough to unroll loop in this case. Can
> you show disassembled code before and after?
>
Since cpumask_any_but() is an inline function, I added the following to
lib/cpumask.c for convenience:
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
{
return cpumask_any_but(mask, cpu);
}
I used objdump -d ./lib/cpumask.o to compare the differences.
* Before the patch:
00000000000001f0 <non_inline_cpumask_any_but>:
1f0: f3 0f 1e fa endbr64
1f4: 48 8b 3f mov (%rdi),%rdi
1f7: b8 40 00 00 00 mov $0x40,%eax
1fc: 48 85 ff test %rdi,%rdi
1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c>
201: f3 48 0f bc d7 tzcnt %rdi,%rdx
206: 89 d0 mov %edx,%eax
208: 39 d6 cmp %edx,%esi
20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c>
20c: 83 fa 3f cmp $0x3f,%edx
20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c>
211: 41 b8 01 00 00 00 mov $0x1,%r8d
217: 83 c0 01 add $0x1,%eax
21a: 83 f8 40 cmp $0x40,%eax
21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c>
21f: 89 c1 mov %eax,%ecx
221: 4c 89 c2 mov %r8,%rdx
224: 48 d3 e2 shl %cl,%rdx
227: 48 89 d0 mov %rdx,%rax
22a: 48 f7 d8 neg %rax
22d: 48 21 f8 and %rdi,%rax
230: 74 15 je 247 <non_inline_cpumask_any_but+0x57>
232: f3 48 0f bc d0 tzcnt %rax,%rdx
237: 89 d0 mov %edx,%eax
239: 39 d6 cmp %edx,%esi
23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c>
23d: 83 fa 3f cmp $0x3f,%edx
240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27>
242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57>
247: b8 40 00 00 00 mov $0x40,%eax
24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61>
* After the patch:
00000000000001f0 <non_inline_cpumask_any_but>:
1f0: f3 0f 1e fa endbr64
1f4: 48 8b 17 mov (%rdi),%rdx
1f7: 48 85 d2 test %rdx,%rdx
1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40>
1fc: f3 48 0f bc ca tzcnt %rdx,%rcx
201: 89 c8 mov %ecx,%eax
203: 39 ce cmp %ecx,%esi
205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45>
207: 83 c1 01 add $0x1,%ecx
20a: 83 f9 3f cmp $0x3f,%ecx
20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40>
20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
216: 48 d3 e0 shl %cl,%rax
219: 48 89 c1 mov %rax,%rcx
21c: b8 40 00 00 00 mov $0x40,%eax
221: 48 21 d1 and %rdx,%rcx
224: 74 0f je 235 <non_inline_cpumask_any_but+0x45>
226: f3 48 0f bc c1 tzcnt %rcx,%rax
22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40>
230: b8 40 00 00 00 mov $0x40,%eax
235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a>
> > >
> > > As a result, the size of the bzImage built with x86 defconfig is
> > > reduced by 4096 bytes:
> > >
> > > * Before:
> > > $ size arch/x86/boot/bzImage
> > > text data bss dec hex filename
> > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> > >
> > > * After:
> > > $ size arch/x86/boot/bzImage
> > > text data bss dec hex filename
> > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
>
> Comparing zipped images tells little about code generation. Please use
> scripts/bloat-o-meter.
>
$ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
Function old new delta
non_inline_cpumask_any_but 97 74 -23
Total: Before=522, After=499, chg -4.41%
> > >
> > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > > ---
> > > Not sure how to measure the efficiency difference, but I guess this
> > > patch might be slightly more efficient or nearly the same as before. If
> > > you have any good ideas for measuring efficiency, please let me know!
>
> Check lib/find_bit_benchmark.c
>
> > >
> > > include/linux/cpumask.h | 8 ++++----
> > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > index 9278a50d514f..b769fcdbaa10 100644
> > > --- a/include/linux/cpumask.h
> > > +++ b/include/linux/cpumask.h
> > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > > unsigned int i;
> > >
> > > cpumask_check(cpu);
> > > - for_each_cpu(i, mask)
> > > - if (i != cpu)
> > > - break;
> > > - return i;
> > > + i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> >
> > Hi Kuan-Wei,
> >
> > How about using cpumask_first(mask) here to keep better consistency?
>
> I would do it the other way: introduce find_first_but_bit(), and then
> make cpumask_any_but() a wrapper around it. Doing this you'll be able
> to leverage find_bit_benchmark infrastructure to measure performance
> difference, if any.
>
I'll try to conduct this experiment.
Regards,
Kuan-Wei
> > > + if (i != cpu)
> > > + return i;
> > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> > find_next_bit() ?
>
> Yes it would.
>
> Thanks,
> Yury
>
> > if "i >= nr_cpu_ids" holds it would be a waste to
> > perform find_next_bit().
> >
> > > + return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > > }
> > >
> >
> > Regards,
> > I Hsin
> >
> > > /**
> > > --
> > > 2.34.1
> > >
On Sat, Jan 18, 2025 at 03:32:29PM +0800, Kuan-Wei Chiu wrote:
> Hi Yury,
>
> On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> > On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > > The cpumask_any_but() function can avoid using a loop to determine the
> > > > CPU index to return. If the first set bit in the cpumask is not equal
> > > > to the specified CPU, we can directly return the index of the first set
> > > > bit. Otherwise, we return the next set bit's index.
> > > >
> > > > This optimization replaces the loop with a single if statement,
> > > > allowing the compiler to generate more concise and efficient code.
> >
> > I thought compilers are smart enough to unroll loop in this case. Can
> > you show disassembled code before and after?
> >
> Since cpumask_any_but() is an inline function, I added the following to
> lib/cpumask.c for convenience:
>
> unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
> unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> {
> return cpumask_any_but(mask, cpu);
> }
>
> I used objdump -d ./lib/cpumask.o to compare the differences.
>
> * Before the patch:
>
> 00000000000001f0 <non_inline_cpumask_any_but>:
> 1f0: f3 0f 1e fa endbr64
> 1f4: 48 8b 3f mov (%rdi),%rdi
> 1f7: b8 40 00 00 00 mov $0x40,%eax
> 1fc: 48 85 ff test %rdi,%rdi
> 1ff: 74 4b je 24c <non_inline_cpumask_any_but+0x5c>
> 201: f3 48 0f bc d7 tzcnt %rdi,%rdx
> 206: 89 d0 mov %edx,%eax
> 208: 39 d6 cmp %edx,%esi
> 20a: 75 40 jne 24c <non_inline_cpumask_any_but+0x5c>
> 20c: 83 fa 3f cmp $0x3f,%edx
> 20f: 77 3b ja 24c <non_inline_cpumask_any_but+0x5c>
> 211: 41 b8 01 00 00 00 mov $0x1,%r8d
> 217: 83 c0 01 add $0x1,%eax
> 21a: 83 f8 40 cmp $0x40,%eax
> 21d: 74 2d je 24c <non_inline_cpumask_any_but+0x5c>
> 21f: 89 c1 mov %eax,%ecx
> 221: 4c 89 c2 mov %r8,%rdx
> 224: 48 d3 e2 shl %cl,%rdx
> 227: 48 89 d0 mov %rdx,%rax
> 22a: 48 f7 d8 neg %rax
> 22d: 48 21 f8 and %rdi,%rax
> 230: 74 15 je 247 <non_inline_cpumask_any_but+0x57>
> 232: f3 48 0f bc d0 tzcnt %rax,%rdx
> 237: 89 d0 mov %edx,%eax
> 239: 39 d6 cmp %edx,%esi
> 23b: 75 0f jne 24c <non_inline_cpumask_any_but+0x5c>
> 23d: 83 fa 3f cmp $0x3f,%edx
> 240: 76 d5 jbe 217 <non_inline_cpumask_any_but+0x27>
> 242: e9 00 00 00 00 jmp 247 <non_inline_cpumask_any_but+0x57>
> 247: b8 40 00 00 00 mov $0x40,%eax
> 24c: e9 00 00 00 00 jmp 251 <non_inline_cpumask_any_but+0x61>
>
> * After the patch:
>
> 00000000000001f0 <non_inline_cpumask_any_but>:
> 1f0: f3 0f 1e fa endbr64
> 1f4: 48 8b 17 mov (%rdi),%rdx
> 1f7: 48 85 d2 test %rdx,%rdx
> 1fa: 74 34 je 230 <non_inline_cpumask_any_but+0x40>
> 1fc: f3 48 0f bc ca tzcnt %rdx,%rcx
> 201: 89 c8 mov %ecx,%eax
> 203: 39 ce cmp %ecx,%esi
> 205: 75 2e jne 235 <non_inline_cpumask_any_but+0x45>
> 207: 83 c1 01 add $0x1,%ecx
> 20a: 83 f9 3f cmp $0x3f,%ecx
> 20d: 77 21 ja 230 <non_inline_cpumask_any_but+0x40>
> 20f: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
> 216: 48 d3 e0 shl %cl,%rax
> 219: 48 89 c1 mov %rax,%rcx
> 21c: b8 40 00 00 00 mov $0x40,%eax
> 221: 48 21 d1 and %rdx,%rcx
> 224: 74 0f je 235 <non_inline_cpumask_any_but+0x45>
> 226: f3 48 0f bc c1 tzcnt %rcx,%rax
> 22b: e9 00 00 00 00 jmp 230 <non_inline_cpumask_any_but+0x40>
> 230: b8 40 00 00 00 mov $0x40,%eax
> 235: e9 00 00 00 00 jmp 23a <non_inline_cpumask_any_but+0x4a>
>
> > > >
> > > > As a result, the size of the bzImage built with x86 defconfig is
> > > > reduced by 4096 bytes:
> > > >
> > > > * Before:
> > > > $ size arch/x86/boot/bzImage
> > > > text data bss dec hex filename
> > > > 13537280 1024 0 13538304 ce9400 arch/x86/boot/bzImage
> > > >
> > > > * After:
> > > > $ size arch/x86/boot/bzImage
> > > > text data bss dec hex filename
> > > > 13533184 1024 0 13534208 ce8400 arch/x86/boot/bzImage
> >
> > Comparing zipped images tells little about code generation. Please use
> > scripts/bloat-o-meter.
> >
> $ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
> Function old new delta
> non_inline_cpumask_any_but 97 74 -23
> Total: Before=522, After=499, chg -4.41%
No need to introduce a wrapper. You need to build allyesconfig (or
defconfig) before and after your patch, and then run bloat-o-meter
against old and new vmlinux.
And specifically for cpumasks, can you please run this experiment
with NR_CPUS == 32 and NR_CPUS == 4096, for example. That way you
will test the change against small_cpumask_bits optimization.
Thanks,
Yury
© 2016 - 2026 Red Hat, Inc.