cpumask: Optimize cpumask_any_but()

[PATCH] cpumask: Optimize cpumask_any_but()

Posted by Kuan-Wei Chiu 1 year ago

The cpumask_any_but() function can avoid using a loop to determine the
CPU index to return. If the first set bit in the cpumask is not equal
to the specified CPU, we can directly return the index of the first set
bit. Otherwise, we return the next set bit's index.

This optimization replaces the loop with a single if statement,
allowing the compiler to generate more concise and efficient code.

As a result, the size of the bzImage built with x86 defconfig is
reduced by 4096 bytes:

* Before:
$ size arch/x86/boot/bzImage
   text    data     bss     dec     hex filename
13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage

* After:
$ size arch/x86/boot/bzImage
   text    data     bss     dec     hex filename
13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage

Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---
Not sure how to measure the efficiency difference, but I guess this
patch might be slightly more efficient or nearly the same as before. If
you have any good ideas for measuring efficiency, please let me know!

 include/linux/cpumask.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 9278a50d514f..b769fcdbaa10 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
 	unsigned int i;
 
 	cpumask_check(cpu);
-	for_each_cpu(i, mask)
-		if (i != cpu)
-			break;
-	return i;
+	i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
+	if (i != cpu)
+		return i;
+	return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
 }
 
 /**
-- 
2.34.1

Re: [PATCH] cpumask: Optimize cpumask_any_but()

Posted by I Hsin Cheng 1 year ago

On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> The cpumask_any_but() function can avoid using a loop to determine the
> CPU index to return. If the first set bit in the cpumask is not equal
> to the specified CPU, we can directly return the index of the first set
> bit. Otherwise, we return the next set bit's index.
> 
> This optimization replaces the loop with a single if statement,
> allowing the compiler to generate more concise and efficient code.
> 
> As a result, the size of the bzImage built with x86 defconfig is
> reduced by 4096 bytes:
> 
> * Before:
> $ size arch/x86/boot/bzImage
>    text    data     bss     dec     hex filename
> 13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage
> 
> * After:
> $ size arch/x86/boot/bzImage
>    text    data     bss     dec     hex filename
> 13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage
> 
> Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> ---
> Not sure how to measure the efficiency difference, but I guess this
> patch might be slightly more efficient or nearly the same as before. If
> you have any good ideas for measuring efficiency, please let me know!
> 
>  include/linux/cpumask.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 9278a50d514f..b769fcdbaa10 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
>  	unsigned int i;
>  
>  	cpumask_check(cpu);
> -	for_each_cpu(i, mask)
> -		if (i != cpu)
> -			break;
> -	return i;
> +	i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);

Hi Kuan-Wei,

How about using cpumask_first(mask) here to keep better consistency?

> +	if (i != cpu)
> +		return i;
Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to
perform find_next_bit().

> +	return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
>  }
>  

Regards,
I Hsin

>  /**
> -- 
> 2.34.1
>

Re: [PATCH] cpumask: Optimize cpumask_any_but()

Posted by Kuan-Wei Chiu 1 year ago

On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > The cpumask_any_but() function can avoid using a loop to determine the
> > CPU index to return. If the first set bit in the cpumask is not equal
> > to the specified CPU, we can directly return the index of the first set
> > bit. Otherwise, we return the next set bit's index.
> > 
> > This optimization replaces the loop with a single if statement,
> > allowing the compiler to generate more concise and efficient code.
> > 
> > As a result, the size of the bzImage built with x86 defconfig is
> > reduced by 4096 bytes:
> > 
> > * Before:
> > $ size arch/x86/boot/bzImage
> >    text    data     bss     dec     hex filename
> > 13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage
> > 
> > * After:
> > $ size arch/x86/boot/bzImage
> >    text    data     bss     dec     hex filename
> > 13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage
> > 
> > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > ---
> > Not sure how to measure the efficiency difference, but I guess this
> > patch might be slightly more efficient or nearly the same as before. If
> > you have any good ideas for measuring efficiency, please let me know!
> > 
> >  include/linux/cpumask.h | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > index 9278a50d514f..b769fcdbaa10 100644
> > --- a/include/linux/cpumask.h
> > +++ b/include/linux/cpumask.h
> > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> >  	unsigned int i;
> >  
> >  	cpumask_check(cpu);
> > -	for_each_cpu(i, mask)
> > -		if (i != cpu)
> > -			break;
> > -	return i;
> > +	i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> 
> Hi Kuan-Wei,
> 
> How about using cpumask_first(mask) here to keep better consistency?
>
Sure.

> > +	if (i != cpu)
> > +		return i;
> Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> find_next_bit() ? if "i >= nr_cpu_ids" holds it would be a waste to
> perform find_next_bit().
> 
Hmm, adding this check increases the image size back to what it was
before the patch. Also, it only saves the execution time of
find_next_bit() when the cpumask is entirely zero. I'm not sure how
often a fully zero cpumask actually occurs in practice, but if we
assume a uniform distribution, the probability of this case would
approach zero.

Regards,
Kuan-Wei
> > +	return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> >  }
> >  
> 
> Regards,
> I Hsin
> 
> >  /**
> > -- 
> > 2.34.1
> >

Re: [PATCH] cpumask: Optimize cpumask_any_but()

Posted by Yury Norov 1 year ago

On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > The cpumask_any_but() function can avoid using a loop to determine the
> > CPU index to return. If the first set bit in the cpumask is not equal
> > to the specified CPU, we can directly return the index of the first set
> > bit. Otherwise, we return the next set bit's index.
> > 
> > This optimization replaces the loop with a single if statement,
> > allowing the compiler to generate more concise and efficient code.

I thought compilers are smart enough to unroll loop in this case. Can
you show disassembled code before and after?

> > 
> > As a result, the size of the bzImage built with x86 defconfig is
> > reduced by 4096 bytes:
> > 
> > * Before:
> > $ size arch/x86/boot/bzImage
> >    text    data     bss     dec     hex filename
> > 13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage
> > 
> > * After:
> > $ size arch/x86/boot/bzImage
> >    text    data     bss     dec     hex filename
> > 13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage

Comparing zipped images tells little about code generation. Please use
scripts/bloat-o-meter.

> > 
> > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > ---
> > Not sure how to measure the efficiency difference, but I guess this
> > patch might be slightly more efficient or nearly the same as before. If
> > you have any good ideas for measuring efficiency, please let me know!

Check lib/find_bit_benchmark.c

> > 
> >  include/linux/cpumask.h | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > index 9278a50d514f..b769fcdbaa10 100644
> > --- a/include/linux/cpumask.h
> > +++ b/include/linux/cpumask.h
> > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> >  	unsigned int i;
> >  
> >  	cpumask_check(cpu);
> > -	for_each_cpu(i, mask)
> > -		if (i != cpu)
> > -			break;
> > -	return i;
> > +	i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> 
> Hi Kuan-Wei,
> 
> How about using cpumask_first(mask) here to keep better consistency?

I would do it the other way: introduce find_first_but_bit(), and then
make cpumask_any_but() a wrapper around it. Doing this you'll be able
to leverage find_bit_benchmark infrastructure to measure performance
difference, if any.
 
> > +	if (i != cpu)
> > +		return i;
> Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> find_next_bit() ?

Yes it would.

Thanks,
Yury

> if "i >= nr_cpu_ids" holds it would be a waste to
> perform find_next_bit().
> 
> > +	return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> >  }
> >  
> 
> Regards,
> I Hsin
> 
> >  /**
> > -- 
> > 2.34.1
> >

Re: [PATCH] cpumask: Optimize cpumask_any_but()

Posted by Kuan-Wei Chiu 1 year ago

Hi Yury,

On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > The cpumask_any_but() function can avoid using a loop to determine the
> > > CPU index to return. If the first set bit in the cpumask is not equal
> > > to the specified CPU, we can directly return the index of the first set
> > > bit. Otherwise, we return the next set bit's index.
> > > 
> > > This optimization replaces the loop with a single if statement,
> > > allowing the compiler to generate more concise and efficient code.
> 
> I thought compilers are smart enough to unroll loop in this case. Can
> you show disassembled code before and after?
> 
Since cpumask_any_but() is an inline function, I added the following to
lib/cpumask.c for convenience:

unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
{
	return cpumask_any_but(mask, cpu);
}

I used objdump -d ./lib/cpumask.o to compare the differences.

* Before the patch:

00000000000001f0 <non_inline_cpumask_any_but>:
 1f0:	f3 0f 1e fa          	endbr64 
 1f4:	48 8b 3f             	mov    (%rdi),%rdi
 1f7:	b8 40 00 00 00       	mov    $0x40,%eax
 1fc:	48 85 ff             	test   %rdi,%rdi
 1ff:	74 4b                	je     24c <non_inline_cpumask_any_but+0x5c>
 201:	f3 48 0f bc d7       	tzcnt  %rdi,%rdx
 206:	89 d0                	mov    %edx,%eax
 208:	39 d6                	cmp    %edx,%esi
 20a:	75 40                	jne    24c <non_inline_cpumask_any_but+0x5c>
 20c:	83 fa 3f             	cmp    $0x3f,%edx
 20f:	77 3b                	ja     24c <non_inline_cpumask_any_but+0x5c>
 211:	41 b8 01 00 00 00    	mov    $0x1,%r8d
 217:	83 c0 01             	add    $0x1,%eax
 21a:	83 f8 40             	cmp    $0x40,%eax
 21d:	74 2d                	je     24c <non_inline_cpumask_any_but+0x5c>
 21f:	89 c1                	mov    %eax,%ecx
 221:	4c 89 c2             	mov    %r8,%rdx
 224:	48 d3 e2             	shl    %cl,%rdx
 227:	48 89 d0             	mov    %rdx,%rax
 22a:	48 f7 d8             	neg    %rax
 22d:	48 21 f8             	and    %rdi,%rax
 230:	74 15                	je     247 <non_inline_cpumask_any_but+0x57>
 232:	f3 48 0f bc d0       	tzcnt  %rax,%rdx
 237:	89 d0                	mov    %edx,%eax
 239:	39 d6                	cmp    %edx,%esi
 23b:	75 0f                	jne    24c <non_inline_cpumask_any_but+0x5c>
 23d:	83 fa 3f             	cmp    $0x3f,%edx
 240:	76 d5                	jbe    217 <non_inline_cpumask_any_but+0x27>
 242:	e9 00 00 00 00       	jmp    247 <non_inline_cpumask_any_but+0x57>
 247:	b8 40 00 00 00       	mov    $0x40,%eax
 24c:	e9 00 00 00 00       	jmp    251 <non_inline_cpumask_any_but+0x61>

* After the patch:

00000000000001f0 <non_inline_cpumask_any_but>:
 1f0:	f3 0f 1e fa          	endbr64 
 1f4:	48 8b 17             	mov    (%rdi),%rdx
 1f7:	48 85 d2             	test   %rdx,%rdx
 1fa:	74 34                	je     230 <non_inline_cpumask_any_but+0x40>
 1fc:	f3 48 0f bc ca       	tzcnt  %rdx,%rcx
 201:	89 c8                	mov    %ecx,%eax
 203:	39 ce                	cmp    %ecx,%esi
 205:	75 2e                	jne    235 <non_inline_cpumask_any_but+0x45>
 207:	83 c1 01             	add    $0x1,%ecx
 20a:	83 f9 3f             	cmp    $0x3f,%ecx
 20d:	77 21                	ja     230 <non_inline_cpumask_any_but+0x40>
 20f:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
 216:	48 d3 e0             	shl    %cl,%rax
 219:	48 89 c1             	mov    %rax,%rcx
 21c:	b8 40 00 00 00       	mov    $0x40,%eax
 221:	48 21 d1             	and    %rdx,%rcx
 224:	74 0f                	je     235 <non_inline_cpumask_any_but+0x45>
 226:	f3 48 0f bc c1       	tzcnt  %rcx,%rax
 22b:	e9 00 00 00 00       	jmp    230 <non_inline_cpumask_any_but+0x40>
 230:	b8 40 00 00 00       	mov    $0x40,%eax
 235:	e9 00 00 00 00       	jmp    23a <non_inline_cpumask_any_but+0x4a>

> > > 
> > > As a result, the size of the bzImage built with x86 defconfig is
> > > reduced by 4096 bytes:
> > > 
> > > * Before:
> > > $ size arch/x86/boot/bzImage
> > >    text    data     bss     dec     hex filename
> > > 13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage
> > > 
> > > * After:
> > > $ size arch/x86/boot/bzImage
> > >    text    data     bss     dec     hex filename
> > > 13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage
> 
> Comparing zipped images tells little about code generation. Please use
> scripts/bloat-o-meter.
> 
$ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
Function                                     old     new   delta
non_inline_cpumask_any_but                    97      74     -23
Total: Before=522, After=499, chg -4.41%

> > > 
> > > Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > > Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > > ---
> > > Not sure how to measure the efficiency difference, but I guess this
> > > patch might be slightly more efficient or nearly the same as before. If
> > > you have any good ideas for measuring efficiency, please let me know!
> 
> Check lib/find_bit_benchmark.c
> 
> > > 
> > >  include/linux/cpumask.h | 8 ++++----
> > >  1 file changed, 4 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > index 9278a50d514f..b769fcdbaa10 100644
> > > --- a/include/linux/cpumask.h
> > > +++ b/include/linux/cpumask.h
> > > @@ -404,10 +404,10 @@ unsigned int cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> > >  	unsigned int i;
> > >  
> > >  	cpumask_check(cpu);
> > > -	for_each_cpu(i, mask)
> > > -		if (i != cpu)
> > > -			break;
> > > -	return i;
> > > +	i = find_first_bit(cpumask_bits(mask), small_cpumask_bits);
> > 
> > Hi Kuan-Wei,
> > 
> > How about using cpumask_first(mask) here to keep better consistency?
> 
> I would do it the other way: introduce find_first_but_bit(), and then
> make cpumask_any_but() a wrapper around it. Doing this you'll be able
> to leverage find_bit_benchmark infrastructure to measure performance
> difference, if any.
>  
I'll try to conduct this experiment.

Regards,
Kuan-Wei

> > > +	if (i != cpu)
> > > +		return i;
> > Wouldn't it benefit abit to check "i >= nr_cpu_ids" prior to
> > find_next_bit() ?
> 
> Yes it would.
> 
> Thanks,
> Yury
> 
> > if "i >= nr_cpu_ids" holds it would be a waste to
> > perform find_next_bit().
> > 
> > > +	return find_next_bit(cpumask_bits(mask), small_cpumask_bits, i + 1);
> > >  }
> > >  
> > 
> > Regards,
> > I Hsin
> > 
> > >  /**
> > > -- 
> > > 2.34.1
> > >

Re: [PATCH] cpumask: Optimize cpumask_any_but()

Posted by Yury Norov 1 year ago

On Sat, Jan 18, 2025 at 03:32:29PM +0800, Kuan-Wei Chiu wrote:
> Hi Yury,
> 
> On Fri, Jan 17, 2025 at 11:32:54AM -0500, Yury Norov wrote:
> > On Fri, Jan 17, 2025 at 10:59:31PM +0800, I Hsin Cheng wrote:
> > > On Fri, Jan 17, 2025 at 10:26:58PM +0800, Kuan-Wei Chiu wrote:
> > > > The cpumask_any_but() function can avoid using a loop to determine the
> > > > CPU index to return. If the first set bit in the cpumask is not equal
> > > > to the specified CPU, we can directly return the index of the first set
> > > > bit. Otherwise, we return the next set bit's index.
> > > > 
> > > > This optimization replaces the loop with a single if statement,
> > > > allowing the compiler to generate more concise and efficient code.
> > 
> > I thought compilers are smart enough to unroll loop in this case. Can
> > you show disassembled code before and after?
> > 
> Since cpumask_any_but() is an inline function, I added the following to
> lib/cpumask.c for convenience:
> 
> unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
> unsigned int non_inline_cpumask_any_but(const struct cpumask *mask, unsigned int cpu)
> {
> 	return cpumask_any_but(mask, cpu);
> }
> 
> I used objdump -d ./lib/cpumask.o to compare the differences.
> 
> * Before the patch:
> 
> 00000000000001f0 <non_inline_cpumask_any_but>:
>  1f0:	f3 0f 1e fa          	endbr64 
>  1f4:	48 8b 3f             	mov    (%rdi),%rdi
>  1f7:	b8 40 00 00 00       	mov    $0x40,%eax
>  1fc:	48 85 ff             	test   %rdi,%rdi
>  1ff:	74 4b                	je     24c <non_inline_cpumask_any_but+0x5c>
>  201:	f3 48 0f bc d7       	tzcnt  %rdi,%rdx
>  206:	89 d0                	mov    %edx,%eax
>  208:	39 d6                	cmp    %edx,%esi
>  20a:	75 40                	jne    24c <non_inline_cpumask_any_but+0x5c>
>  20c:	83 fa 3f             	cmp    $0x3f,%edx
>  20f:	77 3b                	ja     24c <non_inline_cpumask_any_but+0x5c>
>  211:	41 b8 01 00 00 00    	mov    $0x1,%r8d
>  217:	83 c0 01             	add    $0x1,%eax
>  21a:	83 f8 40             	cmp    $0x40,%eax
>  21d:	74 2d                	je     24c <non_inline_cpumask_any_but+0x5c>
>  21f:	89 c1                	mov    %eax,%ecx
>  221:	4c 89 c2             	mov    %r8,%rdx
>  224:	48 d3 e2             	shl    %cl,%rdx
>  227:	48 89 d0             	mov    %rdx,%rax
>  22a:	48 f7 d8             	neg    %rax
>  22d:	48 21 f8             	and    %rdi,%rax
>  230:	74 15                	je     247 <non_inline_cpumask_any_but+0x57>
>  232:	f3 48 0f bc d0       	tzcnt  %rax,%rdx
>  237:	89 d0                	mov    %edx,%eax
>  239:	39 d6                	cmp    %edx,%esi
>  23b:	75 0f                	jne    24c <non_inline_cpumask_any_but+0x5c>
>  23d:	83 fa 3f             	cmp    $0x3f,%edx
>  240:	76 d5                	jbe    217 <non_inline_cpumask_any_but+0x27>
>  242:	e9 00 00 00 00       	jmp    247 <non_inline_cpumask_any_but+0x57>
>  247:	b8 40 00 00 00       	mov    $0x40,%eax
>  24c:	e9 00 00 00 00       	jmp    251 <non_inline_cpumask_any_but+0x61>
> 
> * After the patch:
> 
> 00000000000001f0 <non_inline_cpumask_any_but>:
>  1f0:	f3 0f 1e fa          	endbr64 
>  1f4:	48 8b 17             	mov    (%rdi),%rdx
>  1f7:	48 85 d2             	test   %rdx,%rdx
>  1fa:	74 34                	je     230 <non_inline_cpumask_any_but+0x40>
>  1fc:	f3 48 0f bc ca       	tzcnt  %rdx,%rcx
>  201:	89 c8                	mov    %ecx,%eax
>  203:	39 ce                	cmp    %ecx,%esi
>  205:	75 2e                	jne    235 <non_inline_cpumask_any_but+0x45>
>  207:	83 c1 01             	add    $0x1,%ecx
>  20a:	83 f9 3f             	cmp    $0x3f,%ecx
>  20d:	77 21                	ja     230 <non_inline_cpumask_any_but+0x40>
>  20f:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
>  216:	48 d3 e0             	shl    %cl,%rax
>  219:	48 89 c1             	mov    %rax,%rcx
>  21c:	b8 40 00 00 00       	mov    $0x40,%eax
>  221:	48 21 d1             	and    %rdx,%rcx
>  224:	74 0f                	je     235 <non_inline_cpumask_any_but+0x45>
>  226:	f3 48 0f bc c1       	tzcnt  %rcx,%rax
>  22b:	e9 00 00 00 00       	jmp    230 <non_inline_cpumask_any_but+0x40>
>  230:	b8 40 00 00 00       	mov    $0x40,%eax
>  235:	e9 00 00 00 00       	jmp    23a <non_inline_cpumask_any_but+0x4a>
> 
> > > > 
> > > > As a result, the size of the bzImage built with x86 defconfig is
> > > > reduced by 4096 bytes:
> > > > 
> > > > * Before:
> > > > $ size arch/x86/boot/bzImage
> > > >    text    data     bss     dec     hex filename
> > > > 13537280           1024       0 13538304         ce9400 arch/x86/boot/bzImage
> > > > 
> > > > * After:
> > > > $ size arch/x86/boot/bzImage
> > > >    text    data     bss     dec     hex filename
> > > > 13533184           1024       0 13534208         ce8400 arch/x86/boot/bzImage
> > 
> > Comparing zipped images tells little about code generation. Please use
> > scripts/bloat-o-meter.
> > 
> $ ./scripts/bloat-o-meter ./old_cpumask.o ./new_cpumask.o
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-23 (-23)
> Function                                     old     new   delta
> non_inline_cpumask_any_but                    97      74     -23
> Total: Before=522, After=499, chg -4.41%

No need to introduce a wrapper. You need to build allyesconfig (or
defconfig) before and after your patch, and then run bloat-o-meter
against old and new vmlinux.

And specifically for cpumasks, can you please run this experiment
with NR_CPUS == 32 and NR_CPUS == 4096, for example. That way you
will test the change against small_cpumask_bits optimization.

Thanks,
Yury