[RFC PATCH sched/urgent] sched: Task still delay-dequeued after switched from fair

Tejun Heo posted 1 patch 3 weeks, 5 days ago
kernel/sched/ext.c |    6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
[RFC PATCH sched/urgent] sched: Task still delay-dequeued after switched from fair
Posted by Tejun Heo 3 weeks, 5 days ago
On the current tip/sched/urgent, the following can be easily triggered by
running `tools/testing/selftests/sched_ext/runner -t reload_loop`:

  p->se.sched_delayed
  WARNING: CPU: 0 PID: 1686 at kernel/sched/fair.c:13191 switched_to_fair+0x7a/0x80
  ...
  Sched_ext: maximal (disabling)
  RIP: 0010:switched_to_fair+0x7a/0x80
  Code: a6 fe ff 5b 41 5e c3 cc cc cc cc cc 4c 89 f7 5b 41 5e e9 49 7f fe ff c6 05 53 c0 80 02 01 48 c7 c7 27 4a e6 82 e8 c6 8f fa ff <0f> 0b eb a2 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
  RSP: 0018:ffffc90001253d40 EFLAGS: 00010086
  RAX: 0000000000000013 RBX: ffff888103a6d380 RCX: 0000000000000027
  RDX: 0000000000000002 RSI: 00000000ffffdfff RDI: ffff888237c1b448
  RBP: 0000000000030380 R08: 0000000000001fff R09: ffffffff8368e000
  R10: 0000000000005ffd R11: 0000000000000004 R12: ffffc90001253d58
  R13: ffffffff82eda0c0 R14: ffff888237db0380 R15: ffff888103a6d380
  FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fa289417000 CR3: 0000000003e58000 CR4: 0000000000750eb0
  PKRU: 55555554
  Call Trace:
   <TASK>
   scx_ops_disable_workfn+0x71b/0x930
   kthread_worker_fn+0x105/0x2a0
   kthread+0xe8/0x110
   ret_from_fork+0x33/0x40
   ret_from_fork_asm+0x1a/0x30
   </TASK>

The problem is that when tasks are switched from fair to ext, it can remain
delay-dequeued triggering the above warning when the task goes back to fair.
I can work around with the following patch but it doesn't seem like the
right way to handle it. Shouldn't e.g. fair->switched_from() cancel delayed
dequeue?

Thanks.

---
 kernel/sched/ext.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 65334c13ffa5..601aad1a2625 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5205,8 +5205,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
 		struct sched_enq_and_set_ctx ctx;
+		int deq_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
 
-		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+		if (p->se.sched_delayed)
+			deq_flags |= DEQUEUE_SLEEP | DEQUEUE_DELAYED;
+
+		sched_deq_and_put_task(p, deq_flags, &ctx);
 
 		p->scx.slice = SCX_SLICE_DFL;
 		p->sched_class = __setscheduler_class(p->policy, p->prio);
Re: [RFC PATCH sched/urgent] sched: Task still delay-dequeued after switched from fair
Posted by Peter Zijlstra 3 weeks, 4 days ago
On Tue, Oct 29, 2024 at 02:07:11PM -1000, Tejun Heo wrote:
> On the current tip/sched/urgent, the following can be easily triggered by
> running `tools/testing/selftests/sched_ext/runner -t reload_loop`:

> The problem is that when tasks are switched from fair to ext, it can
> remain delay-dequeued triggering the above warning when the task goes
> back to fair. 

> I can work around with the following patch but it
> doesn't seem like the right way to handle it. Shouldn't e.g.
> fair->switched_from() cancel delayed dequeue?

->switched_from() used to do this, but it is too late. I have a TODO
item fairly high on the todo list to rework the whole
switch{ing,ed}_{from,to} hookery to make all this more sane.

But yeah, it seems I missed the below case where we are switching class.

> ---
>  kernel/sched/ext.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 65334c13ffa5..601aad1a2625 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5205,8 +5205,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	while ((p = scx_task_iter_next_locked(&sti))) {
>  		const struct sched_class *old_class = p->sched_class;
>  		struct sched_enq_and_set_ctx ctx;
> +		int deq_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
>  
> -		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
> +		if (p->se.sched_delayed)
> +			deq_flags |= DEQUEUE_SLEEP | DEQUEUE_DELAYED;
> +
> +		sched_deq_and_put_task(p, deq_flags, &ctx);

I don't think this is quite right, the problem is that in this case
ctx.queued is reporting true, even though you want it false.

This is why 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()")
adds a second dequeue.

Also, you seem to have a second instance of all that.

Does the below work for you? I suppose I might as well go work on that
TODO item now.

---
 kernel/sched/ext.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 40bdfe84e4f0..587e7d1a1e96 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4489,11 +4489,16 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
+		const struct sched_class *new_class =
+			__setscheduler_class(p->policy, p->prio);
 		struct sched_enq_and_set_ctx ctx;
 
+		if (old_class != new_class && p->se.sched_delayed)
+			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DELAYED);
+
 		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
 
-		p->sched_class = __setscheduler_class(p->policy, p->prio);
+		p->sched_class = new_class;
 		check_class_changing(task_rq(p), p, old_class);
 
 		sched_enq_and_set_task(&ctx);
@@ -5199,12 +5204,17 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
+		const struct sched_class *new_class =
+			__setscheduler_class(p->policy, p->prio);
 		struct sched_enq_and_set_ctx ctx;
 
+		if (old_class != new_class && p->se.sched_delayed)
+			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DELAYED);
+
 		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
 
 		p->scx.slice = SCX_SLICE_DFL;
-		p->sched_class = __setscheduler_class(p->policy, p->prio);
+		p->sched_class = new_class;
 		check_class_changing(task_rq(p), p, old_class);
 
 		sched_enq_and_set_task(&ctx);
Re: [RFC PATCH sched/urgent] sched: Task still delay-dequeued after switched from fair
Posted by Tejun Heo 3 weeks, 4 days ago
Hello,

On Wed, Oct 30, 2024 at 11:49:34AM +0100, Peter Zijlstra wrote:
...
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 65334c13ffa5..601aad1a2625 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -5205,8 +5205,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
> >  	while ((p = scx_task_iter_next_locked(&sti))) {
> >  		const struct sched_class *old_class = p->sched_class;
> >  		struct sched_enq_and_set_ctx ctx;
> > +		int deq_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
> >  
> > -		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
> > +		if (p->se.sched_delayed)
> > +			deq_flags |= DEQUEUE_SLEEP | DEQUEUE_DELAYED;
> > +
> > +		sched_deq_and_put_task(p, deq_flags, &ctx);
> 
> I don't think this is quite right, the problem is that in this case
> ctx.queued is reporting true, even though you want it false.
> 
> This is why 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()")
> adds a second dequeue.

I see. Yeah, ctx.queued would be set incorrectly.

> Also, you seem to have a second instance of all that.

The disable path doesn't really need it because the transition direction is
always scx -> fair but yeah keeping the two loops in sync is fine too.

> Does the below work for you? I suppose I might as well go work on that
> TODO item now.

Yeap, it works. Will ack on the other thread.

Thanks.

-- 
tejun
[tip: sched/urgent] sched/ext: Fix scx vs sched_delayed
Posted by tip-bot2 for Peter Zijlstra 3 weeks, 3 days ago
The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     69d5e722be949a1e2409c3f2865ba6020c279db6
Gitweb:        https://git.kernel.org/tip/69d5e722be949a1e2409c3f2865ba6020c279db6
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 30 Oct 2024 11:49:34 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 30 Oct 2024 22:42:12 +01:00

sched/ext: Fix scx vs sched_delayed

Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs
switched_from_fair()") forgot about scx :/

Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20241030104934.GK14555@noisy.programming.kicks-ass.net
---
 kernel/sched/ext.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 40bdfe8..721a754 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4489,11 +4489,16 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
+		const struct sched_class *new_class =
+			__setscheduler_class(p->policy, p->prio);
 		struct sched_enq_and_set_ctx ctx;
 
+		if (old_class != new_class && p->se.sched_delayed)
+			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+
 		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
 
-		p->sched_class = __setscheduler_class(p->policy, p->prio);
+		p->sched_class = new_class;
 		check_class_changing(task_rq(p), p, old_class);
 
 		sched_enq_and_set_task(&ctx);
@@ -5199,12 +5204,17 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
 		const struct sched_class *old_class = p->sched_class;
+		const struct sched_class *new_class =
+			__setscheduler_class(p->policy, p->prio);
 		struct sched_enq_and_set_ctx ctx;
 
+		if (old_class != new_class && p->se.sched_delayed)
+			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+
 		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
 
 		p->scx.slice = SCX_SLICE_DFL;
-		p->sched_class = __setscheduler_class(p->policy, p->prio);
+		p->sched_class = new_class;
 		check_class_changing(task_rq(p), p, old_class);
 
 		sched_enq_and_set_task(&ctx);