Add trace points into enqueue_task() and dequeue_task(). They are useful to
implement RV monitor which validates RT scheduling.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
---
v2: Move the tracepoints to cover all task enqueue/dequeue, not just RT
---
include/trace/events/sched.h | 13 +++++++++++++
kernel/sched/core.c | 8 +++++++-
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index c08893bde255..ec38928e61e7 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -898,6 +898,19 @@ DECLARE_TRACE(sched_set_need_resched,
TP_PROTO(struct task_struct *tsk, int cpu, int tif),
TP_ARGS(tsk, cpu, tif));
+/*
+ * The two trace points below may not work as expected for fair tasks due
+ * to delayed dequeue. See:
+ * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/
+ */
+DECLARE_TRACE(enqueue_task,
+ TP_PROTO(int cpu, struct task_struct *task),
+ TP_ARGS(cpu, task));
+
+DECLARE_TRACE(dequeue_task,
+ TP_PROTO(int cpu, struct task_struct *task),
+ TP_ARGS(cpu, task));
+
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b485e0639616..553c08a63395 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
+ trace_enqueue_task_tp(rq->cpu, p);
+
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
@@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
* and mark the task ->sched_delayed.
*/
uclamp_rq_dec(rq, p);
- return p->sched_class->dequeue_task(rq, p, flags);
+ if (p->sched_class->dequeue_task(rq, p, flags)) {
+ trace_dequeue_task_tp(rq->cpu, p);
+ return true;
+ }
+ return false;
}
void activate_task(struct rq *rq, struct task_struct *p, int flags)
--
2.39.5
On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote: > +/* > + * The two trace points below may not work as expected for fair tasks due > + * to delayed dequeue. See: > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/ > + */ > +DECLARE_TRACE(dequeue_task, > + TP_PROTO(int cpu, struct task_struct *task), > + TP_ARGS(cpu, task)); > + > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) > * and mark the task ->sched_delayed. > */ > uclamp_rq_dec(rq, p); > - return p->sched_class->dequeue_task(rq, p, flags); > + if (p->sched_class->dequeue_task(rq, p, flags)) { > + trace_dequeue_task_tp(rq->cpu, p); > + return true; > + } > + return false; > } Hurmpff.. that's not very nice. How about something like: dequeue_task(): ... ret = p->sched_class->dequeue_task(rq, p, flags); if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP)) __trace_dequeue_task_tp(rq->cpu, p); return ret; __block_task(): trace_dequeue_task_tp(rq->cpu, p); ... Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP will eventually cause __block_task() to be called, either directly, or delayed.
On Fri, Aug 15, 2025 at 03:40:16PM +0200, Peter Zijlstra wrote: > On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote: > > > +/* > > + * The two trace points below may not work as expected for fair tasks due > > + * to delayed dequeue. See: > > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/ > > + */ > > > +DECLARE_TRACE(dequeue_task, > > + TP_PROTO(int cpu, struct task_struct *task), > > + TP_ARGS(cpu, task)); > > + > > > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) > > * and mark the task ->sched_delayed. > > */ > > uclamp_rq_dec(rq, p); > > - return p->sched_class->dequeue_task(rq, p, flags); > > + if (p->sched_class->dequeue_task(rq, p, flags)) { > > + trace_dequeue_task_tp(rq->cpu, p); > > + return true; > > + } > > + return false; > > } > > Hurmpff.. that's not very nice. > > How about something like: > > dequeue_task(): > ... > ret = p->sched_class->dequeue_task(rq, p, flags); > if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP)) > __trace_dequeue_task_tp(rq->cpu, p); > return ret; > > > __block_task(): > trace_dequeue_task_tp(rq->cpu, p); > ... > > > Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP > will eventually cause __block_task() to be called, either directly, or > delayed. Thanks for the suggestion, this makes sense. From my understanding, it makes the tracepoints work correctly for fair tasks too, so I will get rid of the comment. Nam
On Tue, Aug 19, 2025 at 09:49:20AM +0200, Nam Cao wrote: > On Fri, Aug 15, 2025 at 03:40:16PM +0200, Peter Zijlstra wrote: > > On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote: > > > > > +/* > > > + * The two trace points below may not work as expected for fair tasks due > > > + * to delayed dequeue. See: > > > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/ > > > + */ > > > > > +DECLARE_TRACE(dequeue_task, > > > + TP_PROTO(int cpu, struct task_struct *task), > > > + TP_ARGS(cpu, task)); > > > + > > > > > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) > > > * and mark the task ->sched_delayed. > > > */ > > > uclamp_rq_dec(rq, p); > > > - return p->sched_class->dequeue_task(rq, p, flags); > > > + if (p->sched_class->dequeue_task(rq, p, flags)) { > > > + trace_dequeue_task_tp(rq->cpu, p); > > > + return true; > > > + } > > > + return false; > > > } > > > > Hurmpff.. that's not very nice. > > > > How about something like: > > > > dequeue_task(): > > ... > > ret = p->sched_class->dequeue_task(rq, p, flags); > > if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP)) > > __trace_dequeue_task_tp(rq->cpu, p); > > return ret; > > > > > > __block_task(): > > trace_dequeue_task_tp(rq->cpu, p); > > ... > > > > > > Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP > > will eventually cause __block_task() to be called, either directly, or > > delayed. > > Thanks for the suggestion, this makes sense. > > From my understanding, it makes the tracepoints work correctly for fair > tasks too, so I will get rid of the comment. Just so indeed :-)
On Fri, Aug 15, 2025 at 03:40:17PM +0200, Peter Zijlstra wrote: > On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote: > > > +/* > > + * The two trace points below may not work as expected for fair tasks due > > + * to delayed dequeue. See: > > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/ > > + */ > > > +DECLARE_TRACE(dequeue_task, > > + TP_PROTO(int cpu, struct task_struct *task), > > + TP_ARGS(cpu, task)); > > + > > > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) > > * and mark the task ->sched_delayed. > > */ > > uclamp_rq_dec(rq, p); > > - return p->sched_class->dequeue_task(rq, p, flags); > > + if (p->sched_class->dequeue_task(rq, p, flags)) { > > + trace_dequeue_task_tp(rq->cpu, p); > > + return true; > > + } > > + return false; > > } > > Hurmpff.. that's not very nice. > > How about something like: > > dequeue_task(): > ... > ret = p->sched_class->dequeue_task(rq, p, flags); > if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP)) > __trace_dequeue_task_tp(rq->cpu, p); > return ret; > > > __block_task(): > trace_dequeue_task_tp(rq->cpu, p); > ... > > > Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP > will eventually cause __block_task() to be called, either directly, or > delayed. If you extend the tracepoint with the sleep state, you can probably remove the nr_running tracepoints. Esp. once we get this new throttle stuff sorted.
On Fri, Aug 15, 2025 at 03:52:12PM +0200, Peter Zijlstra wrote: > On Fri, Aug 15, 2025 at 03:40:17PM +0200, Peter Zijlstra wrote: > > On Wed, Aug 06, 2025 at 10:01:20AM +0200, Nam Cao wrote: > > > > > +/* > > > + * The two trace points below may not work as expected for fair tasks due > > > + * to delayed dequeue. See: > > > + * https://lore.kernel.org/lkml/179674c6-f82a-4718-ace2-67b5e672fdee@amd.com/ > > > + */ > > > > > +DECLARE_TRACE(dequeue_task, > > > + TP_PROTO(int cpu, struct task_struct *task), > > > + TP_ARGS(cpu, task)); > > > + > > > > > @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) > > > * and mark the task ->sched_delayed. > > > */ > > > uclamp_rq_dec(rq, p); > > > - return p->sched_class->dequeue_task(rq, p, flags); > > > + if (p->sched_class->dequeue_task(rq, p, flags)) { > > > + trace_dequeue_task_tp(rq->cpu, p); > > > + return true; > > > + } > > > + return false; > > > } > > > > Hurmpff.. that's not very nice. > > > > How about something like: > > > > dequeue_task(): > > ... > > ret = p->sched_class->dequeue_task(rq, p, flags); > > if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP)) > > __trace_dequeue_task_tp(rq->cpu, p); > > return ret; > > > > > > __block_task(): > > trace_dequeue_task_tp(rq->cpu, p); > > ... > > > > > > Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP > > will eventually cause __block_task() to be called, either directly, or > > delayed. > > If you extend the tracepoint with the sleep state, you can probably > remove the nr_running tracepoints. Esp. once we get this new throttle > stuff sorted. Sorry, I'm a bit out of depth here. Can you elaborate? By "sleep state" do you mean (flags & DEQUEUE_SLEEP)? The nr_running tracepoints are not hit if the task is throttled, while these new tracepoints are hit. How does the sleep state help with this difference? Also +Cc Phil Auld <pauld@redhat.com>, who seems to care about the nr_running tracepoints. Nam
Hello Nam, On 8/21/2025 12:35 PM, Nam Cao wrote: >>> How about something like: >>> >>> dequeue_task(): >>> ... >>> ret = p->sched_class->dequeue_task(rq, p, flags); >>> if (trace_dequeue_task_p_enabled() && !(flags & DEQUEUE_SLEEP)) >>> __trace_dequeue_task_tp(rq->cpu, p); >>> return ret; >>> >>> >>> __block_task(): >>> trace_dequeue_task_tp(rq->cpu, p); >>> ... >>> >>> >>> Specifically, only DEQUEUE_SLEEP is allowed to fail, and DEQUEUE_SLEEP >>> will eventually cause __block_task() to be called, either directly, or >>> delayed. >> >> If you extend the tracepoint with the sleep state, you can probably >> remove the nr_running tracepoints. Esp. once we get this new throttle >> stuff sorted. > > Sorry, I'm a bit out of depth here. Can you elaborate? > > By "sleep state" do you mean (flags & DEQUEUE_SLEEP)? The nr_running > tracepoints are not hit if the task is throttled, while these new > tracepoints are hit. How does the sleep state help with this difference? Once we have per-task throttling being discussed in https://lore.kernel.org/lkml/20250715071658.267-1-ziqianlu@bytedance.com/ throttled tasks will do a dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE); and remove themselves from the runqueue but they won't hit block_task(). To preserve current throttle behavior, I don't think per-task throttle should call dequeue_task() directly since it does a bunch more stuff with core-sched dequeue, psi, uclamp, etc or maybe it is fine to do that now with per-task throttling? Peter, what do you think? If we don't what to do all that stuff in the throttle path, adding to Peter's suggestion, perhaps we can have a wrapper like: int __dequeue_task(rq, p, flags) int ret = p->sched_class->dequeue_task(rq, p, flags); if (trace_dequeue_task_p_enabled() && !((flags & (DEQUEUE_SLEEP | DEQUEUE_THROTTLE)) == DEQUEUE_SLEEP)) __trace_dequeue_task_tp(rq->cpu, p); return ret; and then per-task throttle can just call __dequeue_task() instead. I'll let Peter chime in with his thoughts. > > Also +Cc Phil Auld <pauld@redhat.com>, who seems to care about the > nr_running tracepoints. > > Nam -- Thanks and Regards, Prateek
© 2016 - 2025 Red Hat, Inc.