kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
For TASK_IDLE task, we not should record the block_starts, it is
not real TASK_UNINTERRUPTIBLE task.
It is easy to find this problem in a idle machine as followe:
bpftrace -e 'tracepoint:sched:sched_stat_blocked { \
if (args->delay > 1000000) \
{ \
printf("%s %d\n", args->comm, args->delay); \
print(kstack()); \
} \
}
rcu_preempt 3881764
__update_stats_enqueue_sleeper+604
__update_stats_enqueue_sleeper+604
enqueue_entity+1014
enqueue_task_fair+156
activate_task+109
ttwu_do_activate+111
try_to_wake_up+615
wake_up_process+25
process_timeout+22
call_timer_fn+44
run_timer_softirq+1100
handle_softirqs+178
irq_exit_rcu+113
sysvec_apic_timer_interrupt+132
asm_sysvec_apic_timer_interrupt+31
pv_native_safe_halt+15
arch_cpu_idle+13
default_idle_call+48
do_idle+516
cpu_startup_entry+49
start_secondary+280
secondary_startup_64_no_verify+404
Signed-off-by: Olice Zou <olicezou@tencent.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a85539df75a5..e473e3244dda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
if (state & TASK_INTERRUPTIBLE)
__schedstat_set(tsk->stats.sleep_start,
rq_clock(rq_of(cfs_rq)));
- if (state & TASK_UNINTERRUPTIBLE)
+ if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE))
__schedstat_set(tsk->stats.block_start,
rq_clock(rq_of(cfs_rq)));
}
--
2.25.1
On Fri, Jun 20, 2025 at 11:14:50AM +0800, Olice Zou wrote: > For TASK_IDLE task, we not should record the block_starts, it is > not real TASK_UNINTERRUPTIBLE task. Why, I mean it is still blocked, right? > It is easy to find this problem in a idle machine as followe: > > bpftrace -e 'tracepoint:sched:sched_stat_blocked { \ > if (args->delay > 1000000) \ > { \ > printf("%s %d\n", args->comm, args->delay); \ > print(kstack()); \ > } \ > } > > rcu_preempt 3881764 > __update_stats_enqueue_sleeper+604 > __update_stats_enqueue_sleeper+604 > enqueue_entity+1014 > enqueue_task_fair+156 > activate_task+109 > ttwu_do_activate+111 > try_to_wake_up+615 > wake_up_process+25 > process_timeout+22 > call_timer_fn+44 > run_timer_softirq+1100 > handle_softirqs+178 > irq_exit_rcu+113 > sysvec_apic_timer_interrupt+132 > asm_sysvec_apic_timer_interrupt+31 > pv_native_safe_halt+15 > arch_cpu_idle+13 > default_idle_call+48 > do_idle+516 > cpu_startup_entry+49 > start_secondary+280 > secondary_startup_64_no_verify+404 Not sure what I'm looking at there. What is the problem? > Signed-off-by: Olice Zou <olicezou@tencent.com> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a85539df75a5..e473e3244dda 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl > if (state & TASK_INTERRUPTIBLE) > __schedstat_set(tsk->stats.sleep_start, > rq_clock(rq_of(cfs_rq))); > - if (state & TASK_UNINTERRUPTIBLE) > + if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE)) > __schedstat_set(tsk->stats.block_start, > rq_clock(rq_of(cfs_rq))); > } > -- > 2.25.1 >
On 6/20/25 16:55, Peter Zijlstra wrote: > On Fri, Jun 20, 2025 at 11:14:50AM +0800, Olice Zou wrote: >> For TASK_IDLE task, we not should record the block_starts, it is >> not real TASK_UNINTERRUPTIBLE task. > Why, I mean it is still blocked, right? Thank you for your reply. I find this problem when running test case for intense lock contention, it has contention among thousands of rwsem/mutex locks, them are real blocked task, but when idle machine, it also has so much of blocked kworker thread to be found, but the machine is idle. the TASK_IDLE not a blocked task, it more like sleeping task as follow: int kernel/workqueue.c 2690 static int worker_thread(void *__worker) 2691 { ................... 2758 sleep: 2759 /* 2760 * pool->lock is held and there's no work to process and no need to 2761 * manage, sleep. Workers are woken up only while holding 2762 * pool->lock or from local cpu, so setting the current state 2763 * before releasing pool->lock is enough to prevent losing any 2764 * event. 2765 */ 2766 worker_enter_idle(worker); 2767 __set_current_state(TASK_IDLE); ---> this set task->__stat to TASK_IDLE, it will cause the blocked measure, but it more like sleep task. 2768 raw_spin_unlock_irq(&pool->lock); 2769 schedule(); 2770 goto woke_up; 2771 } this trace of sched:sched_stat_blocked is a good point to measure the duration of lock contention, it provide the blocked delta time. after this patch, it is beautiful to observe the lock competition in a easy way. " #!/bin/bpftrace #include<linux/sched.h> tracepoint:sched:sched_stat_blocked { if (args->delay > 1000000) { @sa[args->pid] = 1; } } kprobe:finish_task_switch { $task = (struct task_struct *) arg0; if (@sa[tid] ) { print(kstack()); delete(@sa[tid]); } } " catch the lock bocked delta task as follow: dynamic_offline 8684678 finish_task_switch+1 schedule+108 schedule_timeout+567 wait_for_completion+149 __wait_rcu_gp+316 synchronize_rcu+237 rcu_sync_enter+92 percpu_down_write+41 --> this is real blocked task for percpu_rwsem wait. cgroup_procs_write_start+111 __cgroup1_procs_write.constprop.0+91 cgroup1_procs_write+23 cgroup_file_write+137 kernfs_fop_write_iter+304 vfs_write+618 ksys_write+107 __x64_sys_write+30 x64_sys_call+5679 do_syscall_64+55 entry_SYSCALL_64_after_hwframe+12 It is also useful the iowait task except TASK_IDLE. Or put the task_idle task into the sleep of sched_statistics to measure? >> It is easy to find this problem in a idle machine as followe: >> >> bpftrace -e 'tracepoint:sched:sched_stat_blocked { \ >> if (args->delay > 1000000) \ >> { \ >> printf("%s %d\n", args->comm, args->delay); \ >> print(kstack()); \ >> } \ >> } >> >> rcu_preempt 3881764 >> __update_stats_enqueue_sleeper+604 >> __update_stats_enqueue_sleeper+604 >> enqueue_entity+1014 >> enqueue_task_fair+156 >> activate_task+109 >> ttwu_do_activate+111 >> try_to_wake_up+615 >> wake_up_process+25 >> process_timeout+22 >> call_timer_fn+44 >> run_timer_softirq+1100 >> handle_softirqs+178 >> irq_exit_rcu+113 >> sysvec_apic_timer_interrupt+132 >> asm_sysvec_apic_timer_interrupt+31 >> pv_native_safe_halt+15 >> arch_cpu_idle+13 >> default_idle_call+48 >> do_idle+516 >> cpu_startup_entry+49 >> start_secondary+280 >> secondary_startup_64_no_verify+404 > Not sure what I'm looking at there. What is the problem? Sorry, i lost the setup as follow: echo 1 > /proc/sys/kernel/sched_schedstat we should enable the sched_schedstat sysctrl switch first >> Signed-off-by: Olice Zou <olicezou@tencent.com> >> --- >> kernel/sched/fair.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index a85539df75a5..e473e3244dda 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl >> if (state & TASK_INTERRUPTIBLE) >> __schedstat_set(tsk->stats.sleep_start, >> rq_clock(rq_of(cfs_rq))); >> - if (state & TASK_UNINTERRUPTIBLE) >> + if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE)) >> __schedstat_set(tsk->stats.block_start, >> rq_clock(rq_of(cfs_rq))); >> } >> -- >> 2.25.1 >>
© 2016 - 2025 Red Hat, Inc.