[v1] rv: LTL per-cpu monitor type and real-time scheduling monitor

[PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Nam Cao 2 months, 1 week ago

Add trace points into enqueue_task_rt() and dequeue_task_rt(). They are
useful to implement RV monitor which validates RT scheduling.

Signed-off-by: Nam Cao <namcao@linutronix.de>
---
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
---
 include/trace/events/sched.h | 8 ++++++++
 kernel/sched/rt.c            | 4 ++++
 2 files changed, 12 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index c08893bde255..c38f12f7f903 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -898,6 +898,14 @@ DECLARE_TRACE(sched_set_need_resched,
 	TP_PROTO(struct task_struct *tsk, int cpu, int tif),
 	TP_ARGS(tsk, cpu, tif));
 
+DECLARE_TRACE(enqueue_task_rt,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
+DECLARE_TRACE(dequeue_task_rt,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..f4d3f5e7fbec 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1480,6 +1480,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 
+	trace_enqueue_task_rt_tp(rq->cpu, p);
+
 	if (flags & ENQUEUE_WAKEUP)
 		rt_se->timeout = 0;
 
@@ -1501,6 +1503,8 @@ static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	dequeue_pushable_task(rq, p);
 
+	trace_dequeue_task_rt_tp(rq->cpu, p);
+
 	return true;
 }
 
-- 
2.39.5

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Gabriele Monaco 2 months, 1 week ago

On Wed, 2025-07-30 at 14:45 +0200, Nam Cao wrote:
> Add trace points into enqueue_task_rt() and dequeue_task_rt(). They
> are useful to implement RV monitor which validates RT scheduling.
> 

I get it's much simpler this way, but is it that different to follow
the task's existing tracepoints?

* task going to sleep (switch:prev_state != RUNNING) is dequeued
* task waking up is enqueued
* changing the tasks's policy (setpolicy and setattr syscalls) should
enqueue/dequeue as well

This is more thinking out loud, but I'm doing right now something
rather similar with the deadline tasks and this seems reasonable, at
least on paper.

What do you think?

Thanks,
Gabriele

> Signed-off-by: Nam Cao <namcao@linutronix.de>
> ---
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> ---
>  include/trace/events/sched.h | 8 ++++++++
>  kernel/sched/rt.c            | 4 ++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/include/trace/events/sched.h
> b/include/trace/events/sched.h
> index c08893bde255..c38f12f7f903 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -898,6 +898,14 @@ DECLARE_TRACE(sched_set_need_resched,
>  	TP_PROTO(struct task_struct *tsk, int cpu, int tif),
>  	TP_ARGS(tsk, cpu, tif));
>  
> +DECLARE_TRACE(enqueue_task_rt,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
> +DECLARE_TRACE(dequeue_task_rt,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index e40422c37033..f4d3f5e7fbec 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1480,6 +1480,8 @@ enqueue_task_rt(struct rq *rq, struct
> task_struct *p, int flags)
>  {
>  	struct sched_rt_entity *rt_se = &p->rt;
>  
> +	trace_enqueue_task_rt_tp(rq->cpu, p);
> +
>  	if (flags & ENQUEUE_WAKEUP)
>  		rt_se->timeout = 0;
>  
> @@ -1501,6 +1503,8 @@ static bool dequeue_task_rt(struct rq *rq,
> struct task_struct *p, int flags)
>  
>  	dequeue_pushable_task(rq, p);
>  
> +	trace_dequeue_task_rt_tp(rq->cpu, p);
> +
>  	return true;
>  }
>

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Nam Cao 2 months, 1 week ago

On Wed, Jul 30, 2025 at 03:53:14PM +0200, Gabriele Monaco wrote:
> On Wed, 2025-07-30 at 14:45 +0200, Nam Cao wrote:
> > Add trace points into enqueue_task_rt() and dequeue_task_rt(). They
> > are useful to implement RV monitor which validates RT scheduling.
> > 
> 
> I get it's much simpler this way, but is it that different to follow
> the task's existing tracepoints?
> 
> * task going to sleep (switch:prev_state != RUNNING) is dequeued
> * task waking up is enqueued
> * changing the tasks's policy (setpolicy and setattr syscalls) should
> enqueue/dequeue as well
> 
> This is more thinking out loud, but I'm doing right now something
> rather similar with the deadline tasks and this seems reasonable, at
> least on paper.
> 
> What do you think?

I think more or less the same. The fewer tracepoints, the better. But the
monitor is way more obvious this way.

Let me see how hard it is to use the existing tracepoints...

Nam

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Gabriele Monaco 2 months, 1 week ago

On Wed, 2025-07-30 at 17:18 +0200, Nam Cao wrote:
> On Wed, Jul 30, 2025 at 03:53:14PM +0200, Gabriele Monaco wrote:
> > On Wed, 2025-07-30 at 14:45 +0200, Nam Cao wrote:
> > > Add trace points into enqueue_task_rt() and dequeue_task_rt().
> > > They
> > > are useful to implement RV monitor which validates RT scheduling.
> > > 
> > 
> > I get it's much simpler this way, but is it that different to
> > follow
> > the task's existing tracepoints?
> > 
> > * task going to sleep (switch:prev_state != RUNNING) is dequeued
> > * task waking up is enqueued
> > * changing the tasks's policy (setpolicy and setattr syscalls)
> > should
> > enqueue/dequeue as well
> > 
> > This is more thinking out loud, but I'm doing right now something
> > rather similar with the deadline tasks and this seems reasonable,
> > at
> > least on paper.
> > 
> > What do you think?
> 
> I think more or less the same. The fewer tracepoints, the better. But
> the
> monitor is way more obvious this way.
> 
> Let me see how hard it is to use the existing tracepoints...

Well, thinking about it again, these tracepoints might simplify things
considerably when tasks change policy..

Syscalls may fail, for that you could register to sys_exit and check
the return value, but at that point the policy changed already, so you
cannot tell if it's a relevant event or not (e.g. same policy).
Also sched_setscheduler_nocheck would be out of the picture here, not
sure how recurrent that is though (and might not matter if you only
focus on userspace tasks).

If you go down the route of adding tracepoints, why not have other
classes benefit too? I believe calling them from the enqueue_task /
dequeue_task in sched/core.c would allow you to easily filter out by
policy anyway (haven't tested).

Thanks,
Gabriele

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Nam Cao 2 months ago

On Wed, Jul 30, 2025 at 06:18:45PM +0200, Gabriele Monaco wrote:
> Well, thinking about it again, these tracepoints might simplify things
> considerably when tasks change policy..
> 
> Syscalls may fail, for that you could register to sys_exit and check
> the return value, but at that point the policy changed already, so you
> cannot tell if it's a relevant event or not (e.g. same policy).
> Also sched_setscheduler_nocheck would be out of the picture here, not
> sure how recurrent that is though (and might not matter if you only
> focus on userspace tasks).
> 
> If you go down the route of adding tracepoints, why not have other
> classes benefit too? I believe calling them from the enqueue_task /
> dequeue_task in sched/core.c would allow you to easily filter out by
> policy anyway (haven't tested).

Something like the untested patch below?

Will you have a use case for it too? Then I will try to accommodate your
use case, otherwise I will do just enough for my case.

Nam

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index c38f12f7f903..b50668052f99 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -906,6 +906,14 @@ DECLARE_TRACE(dequeue_task_rt,
 	TP_PROTO(int cpu, struct task_struct *task),
 	TP_ARGS(cpu, task));
 
+DECLARE_TRACE(enqueue_task,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
+DECLARE_TRACE(dequeue_task,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b485e0639616..2af90532982a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
 
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	trace_enqueue_task_tp(rq->cpu, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -2103,6 +2105,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
  */
 inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	trace_dequeue_task_tp(rq->cpu, p);
+
 	if (sched_core_enabled(rq))
 		sched_core_dequeue(rq, p, flags);

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by K Prateek Nayak 2 months ago

On 7/31/2025 1:05 PM, Nam Cao wrote:
> Something like the untested patch below?
> 
> Will you have a use case for it too? Then I will try to accommodate your
> use case, otherwise I will do just enough for my case.
> 
> Nam
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index c38f12f7f903..b50668052f99 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -906,6 +906,14 @@ DECLARE_TRACE(dequeue_task_rt,
>  	TP_PROTO(int cpu, struct task_struct *task),
>  	TP_ARGS(cpu, task));
>  
> +DECLARE_TRACE(enqueue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
> +DECLARE_TRACE(dequeue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b485e0639616..2af90532982a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
>  
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	trace_enqueue_task_tp(rq->cpu, p);
> +
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
>  
> @@ -2103,6 +2105,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>   */
>  inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	trace_dequeue_task_tp(rq->cpu, p);

Just thinking out loud, putting this tracepoint here can lead to a
"dequeued -> dequeued" transition for fair task when they are in delayed
dequeue state.

    dequeue_task(p)
      trace_dequeue_task_tp(p) # First time
      dequeue_task_fair(p)
        p->se.delayed = 1
    ...
    <sched_switch> # p is still delayed
    ...
    sched_setscheduler(p)
      if (prev_class != next_class && p->se.sched_delayed)
        dequeue_task(p, DEQUEUE_DELAYED);
          trace_dequeue_task_tp(p) # Second time

It is not an issue as such but it might come as a surprise if users are
expecting a behavior like below which would be the case for !fair task
currently (and for all tasks before v6.12):

    digraph state_automaton {
        center = true;
        size = "7,11";
        {node [shape = plaintext, style=invis, label=""] "__init_enqueue_dequeue_cycle"};
        {node [shape = ellipse] "enqueued"};
        {node [shape = ellipse] "dequeued"};
        "__init_enqueue_dequeue_cycle" -> "enqueued";
        "__init_enqueue_dequeue_cycle" -> "dequeued";
        "enqueued" [label = "enqueued", color = green3];
        "enqueued" -> "dequeued" [ label = "dequeue_task" ];
        "dequeued" [label = "dequeued", color = red];
        "dequeued" -> "enqueued" [ label = "enqueue_task" ];
        { rank = min ;
            "__init_enqueue_dequeue_cycle";
            "dequeued";
            "enqueued";
        }
    }


Another:

   "dequeued" -> "dequeued" [ label = "dequeue_task" ];

edge would be needed in that case for >= v6.12. It is probably nothing
and can be easily handled by the users if they run into it but just
putting it out there for the record in case you only want to consider a
complete dequeue as "dequeued". Feel free to ignore since I'm completely
out of my depth when it comes to the usage of RV in the field :)

> +
>  	if (sched_core_enabled(rq))
>  		sched_core_dequeue(rq, p, flags);
>  
> 

-- 
Thanks and Regards,
Prateek

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Nam Cao 2 months ago

On Fri, Aug 01, 2025 at 09:12:08AM +0530, K Prateek Nayak wrote:
> Just thinking out loud, putting this tracepoint here can lead to a
> "dequeued -> dequeued" transition for fair task when they are in delayed
> dequeue state.
> 
>     dequeue_task(p)
>       trace_dequeue_task_tp(p) # First time
>       dequeue_task_fair(p)
>         p->se.delayed = 1
>     ...
>     <sched_switch> # p is still delayed
>     ...
>     sched_setscheduler(p)
>       if (prev_class != next_class && p->se.sched_delayed)
>         dequeue_task(p, DEQUEUE_DELAYED);
>           trace_dequeue_task_tp(p) # Second time
> 
> It is not an issue as such but it might come as a surprise if users are
> expecting a behavior like below which would be the case for !fair task
> currently (and for all tasks before v6.12):
> 
>     digraph state_automaton {
>         center = true;
>         size = "7,11";
>         {node [shape = plaintext, style=invis, label=""] "__init_enqueue_dequeue_cycle"};
>         {node [shape = ellipse] "enqueued"};
>         {node [shape = ellipse] "dequeued"};
>         "__init_enqueue_dequeue_cycle" -> "enqueued";
>         "__init_enqueue_dequeue_cycle" -> "dequeued";
>         "enqueued" [label = "enqueued", color = green3];
>         "enqueued" -> "dequeued" [ label = "dequeue_task" ];
>         "dequeued" [label = "dequeued", color = red];
>         "dequeued" -> "enqueued" [ label = "enqueue_task" ];
>         { rank = min ;
>             "__init_enqueue_dequeue_cycle";
>             "dequeued";
>             "enqueued";
>         }
>     }
> 
> 
> Another:
> 
>    "dequeued" -> "dequeued" [ label = "dequeue_task" ];
> 
> edge would be needed in that case for >= v6.12. It is probably nothing
> and can be easily handled by the users if they run into it but just
> putting it out there for the record in case you only want to consider a
> complete dequeue as "dequeued". Feel free to ignore since I'm completely
> out of my depth when it comes to the usage of RV in the field :)

Ah, thanks for pointing this out. I do want to only consider complete
dequeue as "dequeued".

These tracepoints are not visible from userspace, and RV does not care
about enqueue/dequeue of fair tasks at the moment, so it is not a problem
for now. But as a precaution, I trust the below patch will do.

Nam

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index c38f12f7f903..b50668052f99 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -906,6 +906,14 @@ DECLARE_TRACE(dequeue_task_rt,
 	TP_PROTO(int cpu, struct task_struct *task),
 	TP_ARGS(cpu, task));
 
+DECLARE_TRACE(enqueue_task,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
+DECLARE_TRACE(dequeue_task,
+	TP_PROTO(int cpu, struct task_struct *task),
+	TP_ARGS(cpu, task));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b485e0639616..553c08a63395 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
 
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	trace_enqueue_task_tp(rq->cpu, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	 * and mark the task ->sched_delayed.
 	 */
 	uclamp_rq_dec(rq, p);
-	return p->sched_class->dequeue_task(rq, p, flags);
+	if (p->sched_class->dequeue_task(rq, p, flags)) {
+		trace_dequeue_task_tp(rq->cpu, p);
+		return true;
+	}
+	return false;
 }
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by K Prateek Nayak 2 months ago

Hello Nam,

On 8/1/2025 12:59 PM, Nam Cao wrote:
> On Fri, Aug 01, 2025 at 09:12:08AM +0530, K Prateek Nayak wrote:
>> Just thinking out loud, putting this tracepoint here can lead to a
>> "dequeued -> dequeued" transition for fair task when they are in delayed
>> dequeue state.
>>
>>     dequeue_task(p)
>>       trace_dequeue_task_tp(p) # First time
>>       dequeue_task_fair(p)
>>         p->se.delayed = 1
>>     ...
>>     <sched_switch> # p is still delayed
>>     ...
>>     sched_setscheduler(p)
>>       if (prev_class != next_class && p->se.sched_delayed)
>>         dequeue_task(p, DEQUEUE_DELAYED);
>>           trace_dequeue_task_tp(p) # Second time
>>
>> It is not an issue as such but it might come as a surprise if users are
>> expecting a behavior like below which would be the case for !fair task
>> currently (and for all tasks before v6.12):
>>
>>     digraph state_automaton {
>>         center = true;
>>         size = "7,11";
>>         {node [shape = plaintext, style=invis, label=""] "__init_enqueue_dequeue_cycle"};
>>         {node [shape = ellipse] "enqueued"};
>>         {node [shape = ellipse] "dequeued"};
>>         "__init_enqueue_dequeue_cycle" -> "enqueued";
>>         "__init_enqueue_dequeue_cycle" -> "dequeued";
>>         "enqueued" [label = "enqueued", color = green3];
>>         "enqueued" -> "dequeued" [ label = "dequeue_task" ];
>>         "dequeued" [label = "dequeued", color = red];
>>         "dequeued" -> "enqueued" [ label = "enqueue_task" ];
>>         { rank = min ;
>>             "__init_enqueue_dequeue_cycle";
>>             "dequeued";
>>             "enqueued";
>>         }
>>     }
>>
>>
>> Another:
>>
>>    "dequeued" -> "dequeued" [ label = "dequeue_task" ];
>>
>> edge would be needed in that case for >= v6.12. It is probably nothing
>> and can be easily handled by the users if they run into it but just
>> putting it out there for the record in case you only want to consider a
>> complete dequeue as "dequeued". Feel free to ignore since I'm completely
>> out of my depth when it comes to the usage of RV in the field :)
> 
> Ah, thanks for pointing this out. I do want to only consider complete
> dequeue as "dequeued".
> 
> These tracepoints are not visible from userspace, and RV does not care
> about enqueue/dequeue of fair tasks at the moment, so it is not a problem
> for now. But as a precaution, I trust the below patch will do.

There are a few more cases with delayed dequeue:

1. A delayed task can be reenqueued before it is completely dequeued and
   will lead to a enqueue -> enqueue transition if we don't trace the
   first dequeue.

2. There are cases in set_user_nice() and __sched_setscheduler() where
   a delayed task is dequeued in delayed state and be put back in the
   cfs_rq in delayed state - an attempt to handle 1. can trip this.

Best I could think of is the below diff on top of your suggestion where
a "delayed -> reenqueue" is skipped but the case 2. is captures in case
one needs to inspect some properties from set_user_nice() /
__sched_setscheduler():

(only build tested on top of the diff you had pasted)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9598984bee8d..1fc5a97bba6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2071,7 +2071,8 @@ unsigned long get_wchan(struct task_struct *p)
 
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	trace_enqueue_task_tp(rq->cpu, p);
+	if (!p->se.sched_delayed || !move_entity(flags))
+		trace_enqueue_task_tp(rq->cpu, p);
 
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..1e2a636d6804 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5444,7 +5444,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * put back on, and if we advance min_vruntime, we'll be placed back
 	 * further than we started -- i.e. we'll be penalized.
 	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
+	if (move_entity(flags))
 		update_min_vruntime(cfs_rq);
 
 	if (flags & DEQUEUE_DELAYED)
@@ -7054,6 +7054,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* Fix-up what dequeue_task_fair() skipped */
 		hrtick_update(rq);
+		trace_dequeue_task_tp(rq->cpu, p);
 
 		/*
 		 * Fix-up what block_task() skipped.
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d4333731..33897d35744a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1196,19 +1196,6 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	dec_rt_group(rt_se, rt_rq);
 }
 
-/*
- * Change rt_se->run_list location unless SAVE && !MOVE
- *
- * assumes ENQUEUE/DEQUEUE flags match
- */
-static inline bool move_entity(unsigned int flags)
-{
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
-		return false;
-
-	return true;
-}
-
 static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
 {
 	list_del_init(&rt_se->run_list);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3f33d10c58c..37730cd834ba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2361,6 +2361,20 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+/*
+ * Checks for a SAVE/RESTORE without MOVE. Returns false if
+ * SAVE and !MOVE.
+ *
+ * Assumes ENQUEUE/DEQUEUE flags match.
+ */
+static inline bool move_entity(unsigned int flags)
+{
+	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
+		return false;
+
+	return true;
+}
+
 struct affinity_context {
 	const struct cpumask	*new_mask;
 	struct cpumask		*user_mask;
---

Thoughts?

-- 
Thanks and Regards,
Prateek

P.S. move_entity() probably requires a better naming and perhaps can
even be simplified. I wrote out the below table just to convince myself
to reuse move_entity() in fair.c

    flags contains    (SAVE | MOVE)   (SAVE | MOVE)
                         == SAVE         != SAVE

    neiter                false           true
    SAVE                  true            false
    MOVE                  false           true
    SAVE | MOVE           false           true

> 
> Nam
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index c38f12f7f903..b50668052f99 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -906,6 +906,14 @@ DECLARE_TRACE(dequeue_task_rt,
>  	TP_PROTO(int cpu, struct task_struct *task),
>  	TP_ARGS(cpu, task));
>  
> +DECLARE_TRACE(enqueue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
> +DECLARE_TRACE(dequeue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b485e0639616..553c08a63395 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
>  
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	trace_enqueue_task_tp(rq->cpu, p);
> +
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
>  
> @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  	 * and mark the task ->sched_delayed.
>  	 */
>  	uclamp_rq_dec(rq, p);
> -	return p->sched_class->dequeue_task(rq, p, flags);
> +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> +		trace_dequeue_task_tp(rq->cpu, p);
> +		return true;
> +	}
> +	return false;
>  }
>  
>  void activate_task(struct rq *rq, struct task_struct *p, int flags)

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Gabriele Monaco 2 months ago

On Fri, 2025-08-01 at 15:26 +0530, K Prateek Nayak wrote:
> There are a few more cases with delayed dequeue:
> 
> 1. A delayed task can be reenqueued before it is completely dequeued
> and
>    will lead to a enqueue -> enqueue transition if we don't trace the
>    first dequeue.
> 
> 2. There are cases in set_user_nice() and __sched_setscheduler()
> where
>    a delayed task is dequeued in delayed state and be put back in the
>    cfs_rq in delayed state - an attempt to handle 1. can trip this.
> 
> Best I could think of is the below diff on top of your suggestion
> where
> a "delayed -> reenqueue" is skipped but the case 2. is captures in
> case
> one needs to inspect some properties from set_user_nice() /
> __sched_setscheduler():
> 
> (only build tested on top of the diff you had pasted)
> 

Hello Prateek,

thanks for the comments, this looks much more convoluted than I would
have expected.
As Nam pointed out, currently RV is not going to rely on those events
for fair tasks (existing monitors are fine with tracepoints like
wakeup/set_state/switch).

For the time being it might be better just add the tracepoints in the
RT enqueue/dequeue (the only needed for this series) and not complicate
things.

At most we may want to keep tracepoint names general, allowing the
tracing call to be added later to other locations (or moved to a
general one) without changing too much, besides existing users.
And perhaps a comment saying the tracepoints are currently only
supported on RT would do.

Anyway, that's your call Nam, I'm fine with your initial proposal as
well, unless some scheduler guys complain.

Thanks,
Gabriele

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9598984bee8d..1fc5a97bba6b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2071,7 +2071,8 @@ unsigned long get_wchan(struct task_struct *p)
>  
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> -	trace_enqueue_task_tp(rq->cpu, p);
> +	if (!p->se.sched_delayed || !move_entity(flags))
> +		trace_enqueue_task_tp(rq->cpu, p);
>  
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..1e2a636d6804 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5444,7 +5444,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *se, int flags)
>  	 * put back on, and if we advance min_vruntime, we'll be
> placed back
>  	 * further than we started -- i.e. we'll be penalized.
>  	 */
> -	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
> +	if (move_entity(flags))
>  		update_min_vruntime(cfs_rq);
>  
>  	if (flags & DEQUEUE_DELAYED)
> @@ -7054,6 +7054,7 @@ static int dequeue_entities(struct rq *rq,
> struct sched_entity *se, int flags)
>  
>  		/* Fix-up what dequeue_task_fair() skipped */
>  		hrtick_update(rq);
> +		trace_dequeue_task_tp(rq->cpu, p);
>  
>  		/*
>  		 * Fix-up what block_task() skipped.
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7936d4333731..33897d35744a 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1196,19 +1196,6 @@ void dec_rt_tasks(struct sched_rt_entity
> *rt_se, struct rt_rq *rt_rq)
>  	dec_rt_group(rt_se, rt_rq);
>  }
>  
> -/*
> - * Change rt_se->run_list location unless SAVE && !MOVE
> - *
> - * assumes ENQUEUE/DEQUEUE flags match
> - */
> -static inline bool move_entity(unsigned int flags)
> -{
> -	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
> -		return false;
> -
> -	return true;
> -}
> -
>  static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct
> rt_prio_array *array)
>  {
>  	list_del_init(&rt_se->run_list);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d3f33d10c58c..37730cd834ba 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2361,6 +2361,20 @@ extern const
> u32		sched_prio_to_wmult[40];
>  
>  #define RETRY_TASK		((void *)-1UL)
>  
> +/*
> + * Checks for a SAVE/RESTORE without MOVE. Returns false if
> + * SAVE and !MOVE.
> + *
> + * Assumes ENQUEUE/DEQUEUE flags match.
> + */
> +static inline bool move_entity(unsigned int flags)
> +{
> +	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
> +		return false;
> +
> +	return true;
> +}
> +
>  struct affinity_context {
>  	const struct cpumask	*new_mask;
>  	struct cpumask		*user_mask;
> ---
> 
> Thoughts?

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by K Prateek Nayak 2 months ago

Hello Gabriele,

On 8/1/2025 4:34 PM, Gabriele Monaco wrote:
> On Fri, 2025-08-01 at 15:26 +0530, K Prateek Nayak wrote:
>> There are a few more cases with delayed dequeue:
>>
>> 1. A delayed task can be reenqueued before it is completely dequeued
>> and
>>    will lead to a enqueue -> enqueue transition if we don't trace the
>>    first dequeue.
>>
>> 2. There are cases in set_user_nice() and __sched_setscheduler()
>> where
>>    a delayed task is dequeued in delayed state and be put back in the
>>    cfs_rq in delayed state - an attempt to handle 1. can trip this.
>>
>> Best I could think of is the below diff on top of your suggestion
>> where
>> a "delayed -> reenqueue" is skipped but the case 2. is captures in
>> case
>> one needs to inspect some properties from set_user_nice() /
>> __sched_setscheduler():
>>
>> (only build tested on top of the diff you had pasted)
>>
> 
> Hello Prateek,
> 
> thanks for the comments, this looks much more convoluted than I would
> have expected.
> As Nam pointed out, currently RV is not going to rely on those events
> for fair tasks (existing monitors are fine with tracepoints like
> wakeup/set_state/switch).
> 
> For the time being it might be better just add the tracepoints in the
> RT enqueue/dequeue (the only needed for this series) and not complicate
> things.
> 
> At most we may want to keep tracepoint names general, allowing the
> tracing call to be added later to other locations (or moved to a
> general one) without changing too much, besides existing users.
> And perhaps a comment saying the tracepoints are currently only
> supported on RT would do.

Most of my comments was just thinking out loud around fair tasks being
delayed on the dequeue path. If RV filters out RT tasks and the use-case
one concerns them, then Nam's suggestion is good.

I was just being cautious of folks expecting a "enqueued <--> dequeued"
transition for *all* tasks and finding it doesn't hold after delayed
dequeue. Since these are internal tracepoints, I'm sure folks using them
with RV would do their due diligence when testing these monitors before
deployment.

> 
> Anyway, that's your call Nam, I'm fine with your initial proposal as
> well, unless some scheduler guys complain.

I would be happy to help test alternate approaches if others have
concerns around delayed dequeue but for all other cases, Nam's approach
looks good. Sorry if my paranoia caused you folks any trouble!

-- 
Thanks and Regards,
Prateek

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Nam Cao 2 months ago

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> Hello Gabriele,
>
> On 8/1/2025 4:34 PM, Gabriele Monaco wrote:
>> Hello Prateek,
>> 
>> thanks for the comments, this looks much more convoluted than I would
>> have expected.
>> As Nam pointed out, currently RV is not going to rely on those events
>> for fair tasks (existing monitors are fine with tracepoints like
>> wakeup/set_state/switch).
>> 
>> For the time being it might be better just add the tracepoints in the
>> RT enqueue/dequeue (the only needed for this series) and not complicate
>> things.
>> 
>> At most we may want to keep tracepoint names general, allowing the
>> tracing call to be added later to other locations (or moved to a
>> general one) without changing too much, besides existing users.
>> And perhaps a comment saying the tracepoints are currently only
>> supported on RT would do.
>
> Most of my comments was just thinking out loud around fair tasks being
> delayed on the dequeue path. If RV filters out RT tasks and the use-case
> one concerns them, then Nam's suggestion is good.
>
> I was just being cautious of folks expecting a "enqueued <--> dequeued"
> transition for *all* tasks and finding it doesn't hold after delayed
> dequeue. Since these are internal tracepoints, I'm sure folks using them
> with RV would do their due diligence when testing these monitors before
> deployment.
>
>> 
>> Anyway, that's your call Nam, I'm fine with your initial proposal as
>> well, unless some scheduler guys complain.
>
> I would be happy to help test alternate approaches if others have
> concerns around delayed dequeue but for all other cases, Nam's approach
> looks good. Sorry if my paranoia caused you folks any trouble!

No trouble at all, it was all helpful comments.

I agree with Gabriele, it is not important right now, so I will stick to
the latest diff I sent. Leaving it to the poor soul who needs this for
fair tasks to figure it out (which will probably be future me).

Thanks for the insights,
Nam

Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Posted by Gabriele Monaco 2 months ago

On Thu, 2025-07-31 at 09:35 +0200, Nam Cao wrote:
> On Wed, Jul 30, 2025 at 06:18:45PM +0200, Gabriele Monaco wrote:
> > Well, thinking about it again, these tracepoints might simplify
> > things
> > considerably when tasks change policy..
> > 
> > Syscalls may fail, for that you could register to sys_exit and
> > check
> > the return value, but at that point the policy changed already, so
> > you
> > cannot tell if it's a relevant event or not (e.g. same policy).
> > Also sched_setscheduler_nocheck would be out of the picture here,
> > not
> > sure how recurrent that is though (and might not matter if you only
> > focus on userspace tasks).
> > 
> > If you go down the route of adding tracepoints, why not have other
> > classes benefit too? I believe calling them from the enqueue_task /
> > dequeue_task in sched/core.c would allow you to easily filter out
> > by
> > policy anyway (haven't tested).
> 
> Something like the untested patch below?
> 
> Will you have a use case for it too? Then I will try to accommodate
> your use case, otherwise I will do just enough for my case.

Well, I'm still defining the best set of tracepoints I need, if you see
it cleaner go ahead the way you're currently doing, then.
Unless anyone else complains let's keep it like this.

Thanks,
Gabriele

> 
> Nam
> 
> diff --git a/include/trace/events/sched.h
> b/include/trace/events/sched.h
> index c38f12f7f903..b50668052f99 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -906,6 +906,14 @@ DECLARE_TRACE(dequeue_task_rt,
>  	TP_PROTO(int cpu, struct task_struct *task),
>  	TP_ARGS(cpu, task));
>  
> +DECLARE_TRACE(enqueue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
> +DECLARE_TRACE(dequeue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b485e0639616..2af90532982a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
>  
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	trace_enqueue_task_tp(rq->cpu, p);
> +
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
>  
> @@ -2103,6 +2105,8 @@ void enqueue_task(struct rq *rq, struct
> task_struct *p, int flags)
>   */
>  inline bool dequeue_task(struct rq *rq, struct task_struct *p, int
> flags)
>  {
> +	trace_dequeue_task_tp(rq->cpu, p);
> +
>  	if (sched_core_enabled(rq))
>  		sched_core_dequeue(rq, p, flags);
>