[PATCH v2 4/5] sched: Add task enqueue/dequeue trace points

Nam Cao posted 5 patches 2 months ago
There is a newer version of this series
[PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by Nam Cao 2 months ago
Add trace points into enqueue_task() and dequeue_task(). They are useful to
implement RV monitor which validates RT scheduling.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
---
v2: Move the tracepoints to cover all task enqueue/dequeue, not just RT
---
 include/trace/events/sched.h | 13 +++++++++++++
 kernel/sched/core.c          |  8 +++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index c08893bde255..ec38928e61e7 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -898,6 +898,19 @@ DECLARE_TRACE(sched_set_need_resched,
 	TP_PROTO(struct task_struct *tsk, int cpu, int tif),
 	TP_ARGS(tsk, cpu, tif));
 
+/*
+ * The two trace points below may not work as expected for fair tasks due
+ * to delayed dequeue. See:
+ * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
+ */
+DECLARE_TRACE(enqueue_task,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
+DECLARE_TRACE(dequeue_task,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b485e0639616..553c08a63395 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
 
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	trace_enqueue_task_tp(rq->cpu, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	 * and mark the task ->sched_delayed.
 	 */
 	uclamp_rq_dec(rq, p);
-	return p->sched_class->dequeue_task(rq, p, flags);
+	if (p->sched_class->dequeue_task(rq, p, flags)) {
+		trace_dequeue_task_tp(rq->cpu, p);
+		return true;
+	}
+	return false;
 }
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
-- 
2.39.5
Re: [PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by Peter Zijlstra 1 month, 2 weeks ago
On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote:

> +/*
> + * The two trace points below may not work as expected for fair tasks due
> + * to delayed dequeue. See:
> + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
> + */

> +DECLARE_TRACE(dequeue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +

> @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  	 * and mark the task ->sched_delayed.
>  	 */
>  	uclamp_rq_dec(rq, p);
> -	return p->sched_class->dequeue_task(rq, p, flags);
> +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> +		trace_dequeue_task_tp(rq->cpu, p);
> +		return true;
> +	}
> +	return false;
>  }

Hurmpff.. that's not very nice.

How about something like:

dequeue_task():
	...
	ret = p->sched_class->dequeue_task(rq, p, flags);
	if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP))
		__trace_dequeue_task_tp(rq->cpu, p);
	return ret;


__block_task():
	trace_dequeue_task_tp(rq->cpu, p);
	...


Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP
will eventually cause __block_task() to be called, either directly, or
delayed.
Re: [PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by Nam Cao 1 month, 2 weeks ago
On Fri, Aug 15, 2025 at 03:40:16PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote:
> 
> > +/*
> > + * The two trace points below may not work as expected for fair tasks due
> > + * to delayed dequeue. See:
> > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
> > + */
> 
> > +DECLARE_TRACE(dequeue_task,
> > +	TP_PROTO(int cpu, struct task_struct *task),
> > +	TP_ARGS(cpu, task));
> > +
> 
> > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> >  	 * and mark the task ->sched_delayed.
> >  	 */
> >  	uclamp_rq_dec(rq, p);
> > -	return p->sched_class->dequeue_task(rq, p, flags);
> > +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> > +		trace_dequeue_task_tp(rq->cpu, p);
> > +		return true;
> > +	}
> > +	return false;
> >  }
> 
> Hurmpff.. that's not very nice.
> 
> How about something like:
> 
> dequeue_task():
> 	...
> 	ret = p->sched_class->dequeue_task(rq, p, flags);
> 	if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP))
> 		__trace_dequeue_task_tp(rq->cpu, p);
> 	return ret;
> 
> 
> __block_task():
> 	trace_dequeue_task_tp(rq->cpu, p);
> 	...
> 
> 
> Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP
> will eventually cause __block_task() to be called, either directly, or
> delayed.

Thanks for the suggestion, this makes sense.

From my understanding, it makes the tracepoints work correctly for fair
tasks too, so I will get rid of the comment.

Nam
Re: [PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by Peter Zijlstra 1 month, 2 weeks ago
On Tue, Aug 19, 2025 at 09:49:20AM +0200, Nam Cao wrote:
> On Fri, Aug 15, 2025 at 03:40:16PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote:
> > 
> > > +/*
> > > + * The two trace points below may not work as expected for fair tasks due
> > > + * to delayed dequeue. See:
> > > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
> > > + */
> > 
> > > +DECLARE_TRACE(dequeue_task,
> > > +	TP_PROTO(int cpu, struct task_struct *task),
> > > +	TP_ARGS(cpu, task));
> > > +
> > 
> > > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > >  	 * and mark the task ->sched_delayed.
> > >  	 */
> > >  	uclamp_rq_dec(rq, p);
> > > -	return p->sched_class->dequeue_task(rq, p, flags);
> > > +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> > > +		trace_dequeue_task_tp(rq->cpu, p);
> > > +		return true;
> > > +	}
> > > +	return false;
> > >  }
> > 
> > Hurmpff.. that's not very nice.
> > 
> > How about something like:
> > 
> > dequeue_task():
> > 	...
> > 	ret = p->sched_class->dequeue_task(rq, p, flags);
> > 	if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP))
> > 		__trace_dequeue_task_tp(rq->cpu, p);
> > 	return ret;
> > 
> > 
> > __block_task():
> > 	trace_dequeue_task_tp(rq->cpu, p);
> > 	...
> > 
> > 
> > Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP
> > will eventually cause __block_task() to be called, either directly, or
> > delayed.
> 
> Thanks for the suggestion, this makes sense.
> 
> From my understanding, it makes the tracepoints work correctly for fair
> tasks too, so I will get rid of the comment.

Just so indeed :-)
Re: [PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by Peter Zijlstra 1 month, 2 weeks ago
On Fri, Aug 15, 2025 at 03:40:17PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote:
> 
> > +/*
> > + * The two trace points below may not work as expected for fair tasks due
> > + * to delayed dequeue. See:
> > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
> > + */
> 
> > +DECLARE_TRACE(dequeue_task,
> > +	TP_PROTO(int cpu, struct task_struct *task),
> > +	TP_ARGS(cpu, task));
> > +
> 
> > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> >  	 * and mark the task ->sched_delayed.
> >  	 */
> >  	uclamp_rq_dec(rq, p);
> > -	return p->sched_class->dequeue_task(rq, p, flags);
> > +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> > +		trace_dequeue_task_tp(rq->cpu, p);
> > +		return true;
> > +	}
> > +	return false;
> >  }
> 
> Hurmpff.. that's not very nice.
> 
> How about something like:
> 
> dequeue_task():
> 	...
> 	ret = p->sched_class->dequeue_task(rq, p, flags);
> 	if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP))
> 		__trace_dequeue_task_tp(rq->cpu, p);
> 	return ret;
> 
> 
> __block_task():
> 	trace_dequeue_task_tp(rq->cpu, p);
> 	...
> 
> 
> Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP
> will eventually cause __block_task() to be called, either directly, or
> delayed.

If you extend the tracepoint with the sleep state, you can probably
remove the nr_running tracepoints. Esp. once we get this new throttle
stuff sorted.
Re: [PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by Nam Cao 1 month, 2 weeks ago
On Fri, Aug 15, 2025 at 03:52:12PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 15, 2025 at 03:40:17PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote:
> > 
> > > +/*
> > > + * The two trace points below may not work as expected for fair tasks due
> > > + * to delayed dequeue. See:
> > > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
> > > + */
> > 
> > > +DECLARE_TRACE(dequeue_task,
> > > +	TP_PROTO(int cpu, struct task_struct *task),
> > > +	TP_ARGS(cpu, task));
> > > +
> > 
> > > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > >  	 * and mark the task ->sched_delayed.
> > >  	 */
> > >  	uclamp_rq_dec(rq, p);
> > > -	return p->sched_class->dequeue_task(rq, p, flags);
> > > +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> > > +		trace_dequeue_task_tp(rq->cpu, p);
> > > +		return true;
> > > +	}
> > > +	return false;
> > >  }
> > 
> > Hurmpff.. that's not very nice.
> > 
> > How about something like:
> > 
> > dequeue_task():
> > 	...
> > 	ret = p->sched_class->dequeue_task(rq, p, flags);
> > 	if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP))
> > 		__trace_dequeue_task_tp(rq->cpu, p);
> > 	return ret;
> > 
> > 
> > __block_task():
> > 	trace_dequeue_task_tp(rq->cpu, p);
> > 	...
> > 
> > 
> > Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP
> > will eventually cause __block_task() to be called, either directly, or
> > delayed.
> 
> If you extend the tracepoint with the sleep state, you can probably
> remove the nr_running tracepoints. Esp. once we get this new throttle
> stuff sorted.

Sorry, I'm a bit out of depth here. Can you elaborate?

By "sleep state" do you mean (flags & DEQUEUE_SLEEP)? The nr_running
tracepoints are not hit if the task is throttled, while these new
tracepoints are hit. How does the sleep state help with this difference?

Also +Cc Phil Auld <pauld@redhat.com>, who seems to care about the
nr_running tracepoints.

Nam
Re: [PATCH v2 4/5] sched: Add task enqueue/dequeue trace points
Posted by K Prateek Nayak 1 month, 2 weeks ago
Hello Nam,

On 8/21/2025 12:35 PM, Nam Cao wrote:
>>> How about something like:
>>>
>>> dequeue_task():
>>> 	...
>>> 	ret = p->sched_class->dequeue_task(rq, p, flags);
>>> 	if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP))
>>> 		__trace_dequeue_task_tp(rq->cpu, p);
>>> 	return ret;
>>>
>>>
>>> __block_task():
>>> 	trace_dequeue_task_tp(rq->cpu, p);
>>> 	...
>>>
>>>
>>> Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP
>>> will eventually cause __block_task() to be called, either directly, or
>>> delayed.
>>
>> If you extend the tracepoint with the sleep state, you can probably
>> remove the nr_running tracepoints. Esp. once we get this new throttle
>> stuff sorted.
> 
> Sorry, I'm a bit out of depth here. Can you elaborate?
> 
> By "sleep state" do you mean (flags & DEQUEUE_SLEEP)? The nr_running
> tracepoints are not hit if the task is throttled, while these new
> tracepoints are hit. How does the sleep state help with this difference?

Once we have per-task throttling being discussed in
https://lore.kernel.org/lkml/20250715071658.267-1-ziqianlu@bytedance.com/
throttled tasks will do a

    dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);

and remove themselves from the runqueue but they won't hit block_task().

To preserve current throttle behavior, I don't think per-task throttle
should call dequeue_task() directly since it does a bunch more stuff
with core-sched dequeue, psi, uclamp, etc or maybe it is fine to do
that now with per-task throttling?

Peter, what do you think?

If we don't what to do all that stuff in the throttle path, adding to
Peter's suggestion, perhaps we can have a wrapper like:
    
    int __dequeue_task(rq, p, flags)
        int ret = p->sched_class->dequeue_task(rq, p, flags);
        if (trace_dequeue_task_p_enabled() &&
            !((flags & (DEQUEUE_SLEEP | DEQUEUE_THROTTLE)) == DEQUEUE_SLEEP))
            __trace_dequeue_task_tp(rq->cpu, p);
       
        return ret;

and then per-task throttle can just call __dequeue_task() instead. I'll
let Peter chime in with his thoughts.

> 
> Also +Cc Phil Auld <pauld@redhat.com>, who seems to care about the
> nr_running tracepoints.
> 
> Nam

-- 
Thanks and Regards,
Prateek