kernel/sched/core.c | 6 ++++-- kernel/sched/fair.c | 4 ++-- kernel/smp.c | 4 ++-- kernel/time/timer.c | 5 +++-- 4 files changed, 11 insertions(+), 8 deletions(-)
It was observed that compiler doesn't optimise this block to hoist
this_cpu calculations out of the loop in preempt disabled sections.
for_each_cpu(c, mask) {
if (c == smp_processor_id())
do_something
do_something_else
}
smp_processor_id() could be compiled with CONFIG_DEBUG_PREEMPT=y where it
can be used for warnings. So maybe that's one of the reason it can't
optimize. __smp_processor_id is arch specific, that maybe another
reason.
Even on CONFIG_DEBUG_PREEMPT=n, compiler didn't optimize it out of the
loop.
find_new_ilb dis-assembly in powerpc(CONFIG_DEBUG_PREEMPT=n).
c00000000028cc7c: bl c000000000a93c98 <_find_next_and_bit>
c00000000028cc80: nop
c00000000028cc84: lwz r5,0(r29)
c00000000028cc88: extsw r30,r3
c00000000028cc8c: mr r31,r3
c00000000028cc90: mr r26,r3
c00000000028cc94: cmplw r5,r3
c00000000028cc98: mr r3,r30
c00000000028cc9c: ble c00000000028ccf8 <kick_ilb+0x10c>
c00000000028cca0: lhz r9,8(r13)
#This is where smp_processor_id is fetched i.e within the loop body.
c00000000028cca4: cmpw r9,r31
c00000000028cca8: beq c00000000028ccc0 <kick_ilb+0xd4>
c00000000028ccac: bl c0000000002cd938 <idle_cpu+0x8>
c00000000028ccb0: nop
c00000000028ccb4: cmpwi r3,0
c00000000028ccb8: bne c00000000028cd30 <kick_ilb+0x144>
find_new_ilb dis-assembly in x86(CONFIG_DEBUG_PREEMPT=n).
ffffffff813588eb: call ffffffff81367b30 <housekeeping_cpumask>
ffffffff813588f0: xor %ecx,%ecx
ffffffff813588f2: mov $0xffffffffffffffff,%rsi
ffffffff813588f9: mov %rax,%r8
ffffffff813588fc: mov %rsi,%rdx
ffffffff813588ff: mov 0x29258ba(%rip),%rax # ffffffff83c7e1c0 <nohz>
ffffffff81358906: and (%r8),%rax
ffffffff81358909: shl %cl,%rdx
ffffffff8135890c: and %rdx,%rax
ffffffff8135890f: je ffffffff81358952 <sched_balance_trigger+0x142>
ffffffff81358911: tzcnt %rax,%rbx
ffffffff81358916: cmp $0x3f,%ebx
ffffffff81358919: ja ffffffff81358952 <sched_balance_trigger+0x142>
ffffffff8135891b: cmp %ebx,%gs:0x28e7712(%rip) # ffffffff83c40034 <cpu_number>
#This is smp_processor_id() in the loop.
ffffffff81358922: mov %ebx,%edi
ffffffff81358924: je ffffffff81358946 <sched_balance_trigger+0x136>
ffffffff81358926: mov %r8,0x8(%rsp)
ffffffff8135892b: mov %ebx,(%rsp)
ffffffff8135892e: call ffffffff81365140 <idle_cpu>
ffffffff81358933: mov $0xffffffffffffffff,%rsi
ffffffff8135893a: mov (%rsp),%edi
ffffffff8135893d: mov 0x8(%rsp),%r8
ffffffff81358942: test %eax,%eax
ffffffff81358944: jne ffffffff813589a4 <sched_balance_trigger+0x194>
ffffffff81358946: lea 0x1(%rbx),%ecx
Patched kernel on powerpc find_new_ilb disassembly.
c00000000028cc5c: 08 00 4d a3 lhz r26,8(r13)
It is fetched once.
...
c00000000028cc94: bl c000000000a93cd8 <_find_next_and_bit>
c00000000028cc98: nop
c00000000028cc9c: lwz r5,0(r29)
c00000000028cca0: extsw r30,r3
c00000000028cca4: mr r31,r3
...
c00000000028cca8: cmpw cr7,r26,r3
c00000000028ccb8: ble c00000000028cd14 <kick_ilb+0x118>
c00000000028ccbc: nop
c00000000028ccc0: beq cr7,c00000000028ccd8 <kick_ilb+0xdc>
c00000000028ccc4: bl c0000000002cd958 <idle_cpu+0x8>
In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.
In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id.
So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated in loop could add up to a small value.
This is done only for hotpaths or function which gets called quite often.
It is skipped for init or conditional hotpaths such as tracing/events.
While it was sent out[1] along with other scheduler change, it made more sense
to send it out as separate series after observing a few more falling in
same bucket.
[1]: https://lore.kernel.org/all/20260319065314.343932-1-sshegde@linux.ibm.com/
Shrikanth Hegde (4):
sched/fair: get this cpu once in find_new_ilb
sched/core: get this cpu once in ttwu_queue_cond
smp: get this_cpu once in smp_call_function
timers: Get this_cpu once while clearing idle timer
kernel/sched/core.c | 6 ++++--
kernel/sched/fair.c | 4 ++--
kernel/smp.c | 4 ++--
kernel/time/timer.c | 5 +++--
4 files changed, 11 insertions(+), 8 deletions(-)
--
2.47.3
On Tue, Mar 24, 2026 at 01:06:26AM +0530, Shrikanth Hegde wrote:
> It was observed that compiler doesn't optimise this block to hoist
> this_cpu calculations out of the loop in preempt disabled sections.
>
> for_each_cpu(c, mask) {
> if (c == smp_processor_id())
> do_something
> do_something_else
> }
>
> smp_processor_id() could be compiled with CONFIG_DEBUG_PREEMPT=y where it
> can be used for warnings. So maybe that's one of the reason it can't
> optimize. __smp_processor_id is arch specific, that maybe another
> reason.
>
> Even on CONFIG_DEBUG_PREEMPT=n, compiler didn't optimize it out of the
> loop.
>
> find_new_ilb dis-assembly in powerpc(CONFIG_DEBUG_PREEMPT=n).
> c00000000028cc7c: bl c000000000a93c98 <_find_next_and_bit>
> c00000000028cc80: nop
> c00000000028cc84: lwz r5,0(r29)
> c00000000028cc88: extsw r30,r3
> c00000000028cc8c: mr r31,r3
> c00000000028cc90: mr r26,r3
> c00000000028cc94: cmplw r5,r3
> c00000000028cc98: mr r3,r30
> c00000000028cc9c: ble c00000000028ccf8 <kick_ilb+0x10c>
> c00000000028cca0: lhz r9,8(r13)
> #This is where smp_processor_id is fetched i.e within the loop body.
> c00000000028cca4: cmpw r9,r31
> c00000000028cca8: beq c00000000028ccc0 <kick_ilb+0xd4>
> c00000000028ccac: bl c0000000002cd938 <idle_cpu+0x8>
> c00000000028ccb0: nop
> c00000000028ccb4: cmpwi r3,0
> c00000000028ccb8: bne c00000000028cd30 <kick_ilb+0x144>
>
> find_new_ilb dis-assembly in x86(CONFIG_DEBUG_PREEMPT=n).
> ffffffff813588eb: call ffffffff81367b30 <housekeeping_cpumask>
> ffffffff813588f0: xor %ecx,%ecx
> ffffffff813588f2: mov $0xffffffffffffffff,%rsi
> ffffffff813588f9: mov %rax,%r8
> ffffffff813588fc: mov %rsi,%rdx
> ffffffff813588ff: mov 0x29258ba(%rip),%rax # ffffffff83c7e1c0 <nohz>
> ffffffff81358906: and (%r8),%rax
> ffffffff81358909: shl %cl,%rdx
> ffffffff8135890c: and %rdx,%rax
> ffffffff8135890f: je ffffffff81358952 <sched_balance_trigger+0x142>
> ffffffff81358911: tzcnt %rax,%rbx
> ffffffff81358916: cmp $0x3f,%ebx
> ffffffff81358919: ja ffffffff81358952 <sched_balance_trigger+0x142>
> ffffffff8135891b: cmp %ebx,%gs:0x28e7712(%rip) # ffffffff83c40034 <cpu_number>
> #This is smp_processor_id() in the loop.
> ffffffff81358922: mov %ebx,%edi
> ffffffff81358924: je ffffffff81358946 <sched_balance_trigger+0x136>
> ffffffff81358926: mov %r8,0x8(%rsp)
> ffffffff8135892b: mov %ebx,(%rsp)
> ffffffff8135892e: call ffffffff81365140 <idle_cpu>
> ffffffff81358933: mov $0xffffffffffffffff,%rsi
> ffffffff8135893a: mov (%rsp),%edi
> ffffffff8135893d: mov 0x8(%rsp),%r8
> ffffffff81358942: test %eax,%eax
> ffffffff81358944: jne ffffffff813589a4 <sched_balance_trigger+0x194>
> ffffffff81358946: lea 0x1(%rbx),%ecx
>
> Patched kernel on powerpc find_new_ilb disassembly.
> c00000000028cc5c: 08 00 4d a3 lhz r26,8(r13)
> It is fetched once.
> ...
> c00000000028cc94: bl c000000000a93cd8 <_find_next_and_bit>
> c00000000028cc98: nop
> c00000000028cc9c: lwz r5,0(r29)
> c00000000028cca0: extsw r30,r3
> c00000000028cca4: mr r31,r3
> ...
> c00000000028cca8: cmpw cr7,r26,r3
> c00000000028ccb8: ble c00000000028cd14 <kick_ilb+0x118>
> c00000000028ccbc: nop
> c00000000028ccc0: beq cr7,c00000000028ccd8 <kick_ilb+0xdc>
> c00000000028ccc4: bl c0000000002cd958 <idle_cpu+0x8>
>
>
> In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
> not print any warning.
>
> In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
> __smp_processor_id.
>
> So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
> it is better to cache the value. It could save a few cycles. Though
> tiny, repeated in loop could add up to a small value.
>
> This is done only for hotpaths or function which gets called quite often.
> It is skipped for init or conditional hotpaths such as tracing/events.
>
> While it was sent out[1] along with other scheduler change, it made more sense
> to send it out as separate series after observing a few more falling in
> same bucket.
> [1]: https://lore.kernel.org/all/20260319065314.343932-1-sshegde@linux.ibm.com/
>
The changes in all 4 patches are very similar and they are quite small.
IMO they can be clubbed together as a single patch.
Regards,
Mukesh
>
> Shrikanth Hegde (4):
> sched/fair: get this cpu once in find_new_ilb
> sched/core: get this cpu once in ttwu_queue_cond
> smp: get this_cpu once in smp_call_function
> timers: Get this_cpu once while clearing idle timer
>
> kernel/sched/core.c | 6 ++++--
> kernel/sched/fair.c | 4 ++--
> kernel/smp.c | 4 ++--
> kernel/time/timer.c | 5 +++--
> 4 files changed, 11 insertions(+), 8 deletions(-)
>
> --
> 2.47.3
>
© 2016 - 2026 Red Hat, Inc.