[PATCH] sched/stats: TASK_IDLE task bypass the block_starts time

Olice Zou posted 1 patch 3 months, 2 weeks ago
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] sched/stats: TASK_IDLE task bypass the block_starts time
Posted by Olice Zou 3 months, 2 weeks ago
For TASK_IDLE task, we not should record the block_starts, it is
not real TASK_UNINTERRUPTIBLE task.

It is easy to find this problem in a idle machine as followe:

bpftrace -e 'tracepoint:sched:sched_stat_blocked {  \
    if (args->delay > 1000000)  \
    {  \
	printf("%s %d\n", args->comm, args->delay);  \
	print(kstack());  \
    }  \
}

rcu_preempt 3881764
        __update_stats_enqueue_sleeper+604
        __update_stats_enqueue_sleeper+604
        enqueue_entity+1014
        enqueue_task_fair+156
        activate_task+109
        ttwu_do_activate+111
        try_to_wake_up+615
        wake_up_process+25
        process_timeout+22
        call_timer_fn+44
        run_timer_softirq+1100
        handle_softirqs+178
        irq_exit_rcu+113
        sysvec_apic_timer_interrupt+132
        asm_sysvec_apic_timer_interrupt+31
        pv_native_safe_halt+15
        arch_cpu_idle+13
        default_idle_call+48
        do_idle+516
        cpu_startup_entry+49
        start_secondary+280
        secondary_startup_64_no_verify+404

Signed-off-by: Olice Zou <olicezou@tencent.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a85539df75a5..e473e3244dda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
 		if (state & TASK_INTERRUPTIBLE)
 			__schedstat_set(tsk->stats.sleep_start,
 				      rq_clock(rq_of(cfs_rq)));
-		if (state & TASK_UNINTERRUPTIBLE)
+		if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE))
 			__schedstat_set(tsk->stats.block_start,
 				      rq_clock(rq_of(cfs_rq)));
 	}
-- 
2.25.1
Re: [PATCH] sched/stats: TASK_IDLE task bypass the block_starts time
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Fri, Jun 20, 2025 at 11:14:50AM +0800, Olice Zou wrote:
> For TASK_IDLE task, we not should record the block_starts, it is
> not real TASK_UNINTERRUPTIBLE task.

Why, I mean it is still blocked, right?

> It is easy to find this problem in a idle machine as followe:
> 
> bpftrace -e 'tracepoint:sched:sched_stat_blocked {  \
>     if (args->delay > 1000000)  \
>     {  \
> 	printf("%s %d\n", args->comm, args->delay);  \
> 	print(kstack());  \
>     }  \
> }
> 
> rcu_preempt 3881764
>         __update_stats_enqueue_sleeper+604
>         __update_stats_enqueue_sleeper+604
>         enqueue_entity+1014
>         enqueue_task_fair+156
>         activate_task+109
>         ttwu_do_activate+111
>         try_to_wake_up+615
>         wake_up_process+25
>         process_timeout+22
>         call_timer_fn+44
>         run_timer_softirq+1100
>         handle_softirqs+178
>         irq_exit_rcu+113
>         sysvec_apic_timer_interrupt+132
>         asm_sysvec_apic_timer_interrupt+31
>         pv_native_safe_halt+15
>         arch_cpu_idle+13
>         default_idle_call+48
>         do_idle+516
>         cpu_startup_entry+49
>         start_secondary+280
>         secondary_startup_64_no_verify+404

Not sure what I'm looking at there. What is the problem?

> Signed-off-by: Olice Zou <olicezou@tencent.com>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a85539df75a5..e473e3244dda 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
>  		if (state & TASK_INTERRUPTIBLE)
>  			__schedstat_set(tsk->stats.sleep_start,
>  				      rq_clock(rq_of(cfs_rq)));
> -		if (state & TASK_UNINTERRUPTIBLE)
> +		if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE))
>  			__schedstat_set(tsk->stats.block_start,
>  				      rq_clock(rq_of(cfs_rq)));
>  	}
> -- 
> 2.25.1
>
Re: [PATCH] sched/stats: TASK_IDLE task bypass the block_starts time
Posted by zoucao 3 months, 2 weeks ago
On 6/20/25 16:55, Peter Zijlstra wrote:
> On Fri, Jun 20, 2025 at 11:14:50AM +0800, Olice Zou wrote:
>> For TASK_IDLE task, we not should record the block_starts, it is
>> not real TASK_UNINTERRUPTIBLE task.
> Why, I mean it is still blocked, right?
Thank you for your reply.

I find this problem when  running test case for intense lock contention, 
it has ​​contention among thousands of rwsem/mutex locks​,

them are real blocked task, but when idle machine, it also  has so much 
of blocked  kworker thread to be found, but the machine is idle.

the TASK_IDLE not a blocked task, it more like sleeping task as follow:


int kernel/workqueue.c

2690 static int worker_thread(void *__worker)
2691 {

    ...................

2758 sleep:

2759     /*
2760      * pool->lock is held and there's no work to process and no need to
2761      * manage, sleep.  Workers are woken up only while holding
2762      * pool->lock or from local cpu, so setting the current state
2763      * before releasing pool->lock is enough to prevent losing any
2764      * event.
2765      */
2766     worker_enter_idle(worker);
2767     __set_current_state(TASK_IDLE);    ---> this set task->__stat  
to TASK_IDLE,  it will cause the blocked measure, but it more like sleep 
task.
2768     raw_spin_unlock_irq(&pool->lock);
2769     schedule();
2770     goto woke_up;
2771 }


this  trace of sched:sched_stat_blocked is a good point to ​​measure the 
duration of lock contention​, it provide the blocked delta time.

after this patch,  it is beautiful to observe the lock competition in a 
easy way.


"

#!/bin/bpftrace
#include<linux/sched.h>

tracepoint:sched:sched_stat_blocked
{
     if (args->delay > 1000000) {
         @sa[args->pid] = 1;
     }
}

kprobe:finish_task_switch
{
     $task = (struct task_struct *) arg0;
     if (@sa[tid] ) {
         print(kstack());
         delete(@sa[tid]);
     }
}
"

catch the lock bocked delta task as follow:

dynamic_offline 8684678

         finish_task_switch+1
         schedule+108
         schedule_timeout+567
         wait_for_completion+149
         __wait_rcu_gp+316
         synchronize_rcu+237
         rcu_sync_enter+92
         percpu_down_write+41     --> this is real blocked task for 
percpu_rwsem wait.
         cgroup_procs_write_start+111
         __cgroup1_procs_write.constprop.0+91
         cgroup1_procs_write+23
         cgroup_file_write+137
         kernfs_fop_write_iter+304
         vfs_write+618
         ksys_write+107
         __x64_sys_write+30
         x64_sys_call+5679
         do_syscall_64+55
         entry_SYSCALL_64_after_hwframe+12


It is also useful the iowait task except TASK_IDLE.


Or put the task_idle task into  the sleep  of sched_statistics to measure?

>> It is easy to find this problem in a idle machine as followe:
>>
>> bpftrace -e 'tracepoint:sched:sched_stat_blocked {  \
>>      if (args->delay > 1000000)  \
>>      {  \
>> 	printf("%s %d\n", args->comm, args->delay);  \
>> 	print(kstack());  \
>>      }  \
>> }
>>
>> rcu_preempt 3881764
>>          __update_stats_enqueue_sleeper+604
>>          __update_stats_enqueue_sleeper+604
>>          enqueue_entity+1014
>>          enqueue_task_fair+156
>>          activate_task+109
>>          ttwu_do_activate+111
>>          try_to_wake_up+615
>>          wake_up_process+25
>>          process_timeout+22
>>          call_timer_fn+44
>>          run_timer_softirq+1100
>>          handle_softirqs+178
>>          irq_exit_rcu+113
>>          sysvec_apic_timer_interrupt+132
>>          asm_sysvec_apic_timer_interrupt+31
>>          pv_native_safe_halt+15
>>          arch_cpu_idle+13
>>          default_idle_call+48
>>          do_idle+516
>>          cpu_startup_entry+49
>>          start_secondary+280
>>          secondary_startup_64_no_verify+404
> Not sure what I'm looking at there. What is the problem?

Sorry, i lost the setup as follow:

     echo 1 > /proc/sys/kernel/sched_schedstat

we should enable the sched_schedstat sysctrl switch first

>> Signed-off-by: Olice Zou <olicezou@tencent.com>
>> ---
>>   kernel/sched/fair.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a85539df75a5..e473e3244dda 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
>>   		if (state & TASK_INTERRUPTIBLE)
>>   			__schedstat_set(tsk->stats.sleep_start,
>>   				      rq_clock(rq_of(cfs_rq)));
>> -		if (state & TASK_UNINTERRUPTIBLE)
>> +		if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE))
>>   			__schedstat_set(tsk->stats.block_start,
>>   				      rq_clock(rq_of(cfs_rq)));
>>   	}
>> -- 
>> 2.25.1
>>