[PATCH 0/4] Micro optimise to get this_cpu once

Shrikanth Hegde posted 4 patches 1 week, 3 days ago
kernel/sched/core.c | 6 ++++--
kernel/sched/fair.c | 4 ++--
kernel/smp.c        | 4 ++--
kernel/time/timer.c | 5 +++--
4 files changed, 11 insertions(+), 8 deletions(-)
[PATCH 0/4] Micro optimise to get this_cpu once
Posted by Shrikanth Hegde 1 week, 3 days ago
It was observed that compiler doesn't optimise this block to hoist
this_cpu calculations out of the loop in preempt disabled sections. 

for_each_cpu(c, mask) {
	if (c == smp_processor_id())
		do_something
	do_something_else
}

smp_processor_id() could be compiled with CONFIG_DEBUG_PREEMPT=y where it
can be used for warnings. So maybe that's one of the reason it can't
optimize. __smp_processor_id is arch specific, that maybe another
reason.

Even on CONFIG_DEBUG_PREEMPT=n, compiler didn't optimize it out of the
loop.

find_new_ilb dis-assembly in powerpc(CONFIG_DEBUG_PREEMPT=n).
c00000000028cc7c:       bl      c000000000a93c98 <_find_next_and_bit>
c00000000028cc80:       nop
c00000000028cc84:       lwz     r5,0(r29)
c00000000028cc88:       extsw   r30,r3  
c00000000028cc8c:       mr      r31,r3  
c00000000028cc90:       mr      r26,r3  
c00000000028cc94:       cmplw   r5,r3   
c00000000028cc98:       mr      r3,r30  
c00000000028cc9c:       ble     c00000000028ccf8 <kick_ilb+0x10c>
c00000000028cca0:       lhz     r9,8(r13)
     #This is where smp_processor_id is fetched i.e within the loop body.
c00000000028cca4:       cmpw    r9,r31  
c00000000028cca8:       beq     c00000000028ccc0 <kick_ilb+0xd4> 
c00000000028ccac:       bl      c0000000002cd938 <idle_cpu+0x8>
c00000000028ccb0:       nop
c00000000028ccb4:       cmpwi   r3,0    
c00000000028ccb8:       bne     c00000000028cd30 <kick_ilb+0x144>

find_new_ilb dis-assembly in x86(CONFIG_DEBUG_PREEMPT=n).
ffffffff813588eb: 	call   ffffffff81367b30 <housekeeping_cpumask>
ffffffff813588f0: 	xor    %ecx,%ecx
ffffffff813588f2: 	mov    $0xffffffffffffffff,%rsi
ffffffff813588f9: 	mov    %rax,%r8
ffffffff813588fc: 	mov    %rsi,%rdx
ffffffff813588ff: 	mov    0x29258ba(%rip),%rax        # ffffffff83c7e1c0 <nohz>
ffffffff81358906: 	and    (%r8),%rax
ffffffff81358909: 	shl    %cl,%rdx
ffffffff8135890c: 	and    %rdx,%rax
ffffffff8135890f: 	je     ffffffff81358952 <sched_balance_trigger+0x142>
ffffffff81358911: 	tzcnt  %rax,%rbx
ffffffff81358916: 	cmp    $0x3f,%ebx
ffffffff81358919: 	ja     ffffffff81358952 <sched_balance_trigger+0x142>
ffffffff8135891b: 	cmp    %ebx,%gs:0x28e7712(%rip)        # ffffffff83c40034 <cpu_number>
    #This is smp_processor_id() in the loop.
ffffffff81358922: 	mov    %ebx,%edi
ffffffff81358924: 	je     ffffffff81358946 <sched_balance_trigger+0x136>
ffffffff81358926: 	mov    %r8,0x8(%rsp)
ffffffff8135892b: 	mov    %ebx,(%rsp)
ffffffff8135892e: 	call   ffffffff81365140 <idle_cpu>
ffffffff81358933: 	mov    $0xffffffffffffffff,%rsi
ffffffff8135893a: 	mov    (%rsp),%edi
ffffffff8135893d: 	mov    0x8(%rsp),%r8
ffffffff81358942: 	test   %eax,%eax
ffffffff81358944: 	jne    ffffffff813589a4 <sched_balance_trigger+0x194>
ffffffff81358946: 	lea    0x1(%rbx),%ecx

Patched kernel on powerpc find_new_ilb disassembly.
c00000000028cc5c:       08 00 4d a3     lhz     r26,8(r13)
It is fetched once.
...
c00000000028cc94:     bl      c000000000a93cd8 <_find_next_and_bit>
c00000000028cc98:     nop
c00000000028cc9c:     lwz     r5,0(r29)
c00000000028cca0:     extsw   r30,r3
c00000000028cca4:     mr      r31,r3
...
c00000000028cca8:     cmpw    cr7,r26,r3
c00000000028ccb8:     ble     c00000000028cd14 <kick_ilb+0x118>
c00000000028ccbc:     nop
c00000000028ccc0:     beq     cr7,c00000000028ccd8 <kick_ilb+0xdc>
c00000000028ccc4:     bl      c0000000002cd958 <idle_cpu+0x8>


In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.

In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id.

So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated in loop could add up to a small value.

This is done only for hotpaths or function which gets called quite often.
It is skipped for init or conditional hotpaths such as tracing/events.

While it was sent out[1] along with other scheduler change, it made more sense
to send it out as separate series after observing a few more falling in
same bucket.
[1]: https://lore.kernel.org/all/20260319065314.343932-1-sshegde@linux.ibm.com/


Shrikanth Hegde (4):
  sched/fair: get this cpu once in find_new_ilb
  sched/core: get this cpu once in ttwu_queue_cond
  smp: get this_cpu once in smp_call_function
  timers: Get this_cpu once while clearing idle timer

 kernel/sched/core.c | 6 ++++--
 kernel/sched/fair.c | 4 ++--
 kernel/smp.c        | 4 ++--
 kernel/time/timer.c | 5 +++--
 4 files changed, 11 insertions(+), 8 deletions(-)

-- 
2.47.3
Re: [PATCH 0/4] Micro optimise to get this_cpu once
Posted by Mukesh Kumar Chaurasiya 1 week, 3 days ago
On Tue, Mar 24, 2026 at 01:06:26AM +0530, Shrikanth Hegde wrote:
> It was observed that compiler doesn't optimise this block to hoist
> this_cpu calculations out of the loop in preempt disabled sections. 
> 
> for_each_cpu(c, mask) {
> 	if (c == smp_processor_id())
> 		do_something
> 	do_something_else
> }
> 
> smp_processor_id() could be compiled with CONFIG_DEBUG_PREEMPT=y where it
> can be used for warnings. So maybe that's one of the reason it can't
> optimize. __smp_processor_id is arch specific, that maybe another
> reason.
> 
> Even on CONFIG_DEBUG_PREEMPT=n, compiler didn't optimize it out of the
> loop.
> 
> find_new_ilb dis-assembly in powerpc(CONFIG_DEBUG_PREEMPT=n).
> c00000000028cc7c:       bl      c000000000a93c98 <_find_next_and_bit>
> c00000000028cc80:       nop
> c00000000028cc84:       lwz     r5,0(r29)
> c00000000028cc88:       extsw   r30,r3  
> c00000000028cc8c:       mr      r31,r3  
> c00000000028cc90:       mr      r26,r3  
> c00000000028cc94:       cmplw   r5,r3   
> c00000000028cc98:       mr      r3,r30  
> c00000000028cc9c:       ble     c00000000028ccf8 <kick_ilb+0x10c>
> c00000000028cca0:       lhz     r9,8(r13)
>      #This is where smp_processor_id is fetched i.e within the loop body.
> c00000000028cca4:       cmpw    r9,r31  
> c00000000028cca8:       beq     c00000000028ccc0 <kick_ilb+0xd4> 
> c00000000028ccac:       bl      c0000000002cd938 <idle_cpu+0x8>
> c00000000028ccb0:       nop
> c00000000028ccb4:       cmpwi   r3,0    
> c00000000028ccb8:       bne     c00000000028cd30 <kick_ilb+0x144>
> 
> find_new_ilb dis-assembly in x86(CONFIG_DEBUG_PREEMPT=n).
> ffffffff813588eb: 	call   ffffffff81367b30 <housekeeping_cpumask>
> ffffffff813588f0: 	xor    %ecx,%ecx
> ffffffff813588f2: 	mov    $0xffffffffffffffff,%rsi
> ffffffff813588f9: 	mov    %rax,%r8
> ffffffff813588fc: 	mov    %rsi,%rdx
> ffffffff813588ff: 	mov    0x29258ba(%rip),%rax        # ffffffff83c7e1c0 <nohz>
> ffffffff81358906: 	and    (%r8),%rax
> ffffffff81358909: 	shl    %cl,%rdx
> ffffffff8135890c: 	and    %rdx,%rax
> ffffffff8135890f: 	je     ffffffff81358952 <sched_balance_trigger+0x142>
> ffffffff81358911: 	tzcnt  %rax,%rbx
> ffffffff81358916: 	cmp    $0x3f,%ebx
> ffffffff81358919: 	ja     ffffffff81358952 <sched_balance_trigger+0x142>
> ffffffff8135891b: 	cmp    %ebx,%gs:0x28e7712(%rip)        # ffffffff83c40034 <cpu_number>
>     #This is smp_processor_id() in the loop.
> ffffffff81358922: 	mov    %ebx,%edi
> ffffffff81358924: 	je     ffffffff81358946 <sched_balance_trigger+0x136>
> ffffffff81358926: 	mov    %r8,0x8(%rsp)
> ffffffff8135892b: 	mov    %ebx,(%rsp)
> ffffffff8135892e: 	call   ffffffff81365140 <idle_cpu>
> ffffffff81358933: 	mov    $0xffffffffffffffff,%rsi
> ffffffff8135893a: 	mov    (%rsp),%edi
> ffffffff8135893d: 	mov    0x8(%rsp),%r8
> ffffffff81358942: 	test   %eax,%eax
> ffffffff81358944: 	jne    ffffffff813589a4 <sched_balance_trigger+0x194>
> ffffffff81358946: 	lea    0x1(%rbx),%ecx
> 
> Patched kernel on powerpc find_new_ilb disassembly.
> c00000000028cc5c:       08 00 4d a3     lhz     r26,8(r13)
> It is fetched once.
> ...
> c00000000028cc94:     bl      c000000000a93cd8 <_find_next_and_bit>
> c00000000028cc98:     nop
> c00000000028cc9c:     lwz     r5,0(r29)
> c00000000028cca0:     extsw   r30,r3
> c00000000028cca4:     mr      r31,r3
> ...
> c00000000028cca8:     cmpw    cr7,r26,r3
> c00000000028ccb8:     ble     c00000000028cd14 <kick_ilb+0x118>
> c00000000028ccbc:     nop
> c00000000028ccc0:     beq     cr7,c00000000028ccd8 <kick_ilb+0xdc>
> c00000000028ccc4:     bl      c0000000002cd958 <idle_cpu+0x8>
> 
> 
> In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
> not print any warning.
> 
> In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
> __smp_processor_id.
> 
> So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
> it is better to cache the value. It could save a few cycles. Though
> tiny, repeated in loop could add up to a small value.
> 
> This is done only for hotpaths or function which gets called quite often.
> It is skipped for init or conditional hotpaths such as tracing/events.
> 
> While it was sent out[1] along with other scheduler change, it made more sense
> to send it out as separate series after observing a few more falling in
> same bucket.
> [1]: https://lore.kernel.org/all/20260319065314.343932-1-sshegde@linux.ibm.com/
> 
The changes in all 4 patches are very similar and they are quite small.
IMO they can be clubbed together as a single patch.

Regards,
Mukesh
> 
> Shrikanth Hegde (4):
>   sched/fair: get this cpu once in find_new_ilb
>   sched/core: get this cpu once in ttwu_queue_cond
>   smp: get this_cpu once in smp_call_function
>   timers: Get this_cpu once while clearing idle timer
> 
>  kernel/sched/core.c | 6 ++++--
>  kernel/sched/fair.c | 4 ++--
>  kernel/smp.c        | 4 ++--
>  kernel/time/timer.c | 5 +++--
>  4 files changed, 11 insertions(+), 8 deletions(-)
> 
> -- 
> 2.47.3
>