Repeated rto_push_irq_work_func() invocation.

[RFC] Repeated rto_push_irq_work_func() invocation.

Posted by Sebastian Andrzej Siewior 1 month, 3 weeks ago

I have this in my RT queue:

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2193,8 +2193,11 @@ static int rto_next_cpu(struct root_doma
 
 		rd->rto_cpu = cpu;
 
-		if (cpu < nr_cpu_ids)
+		if (cpu < nr_cpu_ids) {
+			if (!has_pushable_tasks(cpu_rq(cpu)))
+				continue;
 			return cpu;
+		}
 
 		rd->rto_cpu = -1;
 
This avoided a large number of IPIs to queue and invoke rto_push_work
while a RT task was scheduled. This improved with commit
   612f769edd06a ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask")

Now, looking at this again I still see invocations which are skipped due
this patch on an idle CPU more often than on a busy CPU. Given that the
task is removed from the list and the mask is cleaned almost immediately
this looks like a small window which is probably neglectable.

One thing I am not sure what to do about it (from a busy trace):
|      ksoftirqd/5-63      [005] dN.31  4446.750055: sched_waking: comm=rcu_preempt pid=17 prio=98 target_cpu=005
|      ksoftirqd/5-63      [005] dN.41  4446.750058: enqueue_pushable_task: Add rcu_preempt-17
|      ksoftirqd/5-63      [005] dN.41  4446.750059: enqueue_pushable_task: Set 5

Since the enqueued task is not yet on the CPU it gets added to the
pushable list (the task_current() check could be removed since an
enqueued task can never be on CPU, right?). Give the priorities, the new
task will preempt the current task.

|      ksoftirqd/5-63      [005] dN.41  4446.750060: sched_wakeup: comm=rcu_preempt pid=17 prio=98 target_cpu=005
|      ksoftirqd/5-63      [005] dN.31  4446.750062: sched_stat_runtime: comm=ksoftirqd/5 pid=63 runtime=14625 [ns]
|       cyclictest-5192    [003] d..2.  4446.750062: sched_stat_runtime: comm=cyclictest pid=5192 runtime=13066 [ns]
|       cyclictest-5192    [003] d..2.  4446.750064: dequeue_pushable_task: Del cyclictest-5192
|       cyclictest-5192    [003] d..3.  4446.750065: rto_next_cpu.constprop.0: Look count 1
|       cyclictest-5192    [003] d..3.  4446.750066: rto_next_cpu.constprop.0: Leave CPU 5

This is then observed by other CPUs in the system so rto_next_cpu()
returns CPU 5, resulting in a schedule of rto_push_work to CPU5.

|      ksoftirqd/5-63      [005] dNh1.  4446.750069: push_rt_task: Start
|      ksoftirqd/5-63      [005] dNh1.  4446.750070: push_rt_task: Push rcu_preempt-17 98
|      ksoftirqd/5-63      [005] dNh1.  4446.750071: push_rt_task: resched

push_rt_task() didn't do anything because Need-resched is already set.

|      ksoftirqd/5-63      [005] dNh1.  4446.750071: rto_next_cpu.constprop.0: Look count 1
|      ksoftirqd/5-63      [005] dNh1.  4446.750072: rto_next_cpu.constprop.0: Leave CPU 5

but scheduled rto_push_work again.

|      ksoftirqd/5-63      [005] dNh2.  4446.750074: push_rt_task: Start
|      ksoftirqd/5-63      [005] dNh2.  4446.750074: push_rt_task: Push rcu_preempt-17 98
|      ksoftirqd/5-63      [005] dNh2.  4446.750075: push_rt_task: resched

came to the same conclusion.

|      ksoftirqd/5-63      [005] dNh2.  4446.750075: rto_next_cpu.constprop.0: Look count 1
|      ksoftirqd/5-63      [005] dNh2.  4446.750076: rto_next_cpu.constprop.0: Leave no CPU count: 1

It left with no CPU because it wrapped around. Nothing scheduled.

|      ksoftirqd/5-63      [005] dNh3.  4446.750077: sched_waking: comm=irq_work/5 pid=60 prio=98 target_cpu=005
|       cyclictest-5216    [027] d..2.  4446.750077: dequeue_pushable_task: Del cyclictest-5216
|       cyclictest-5216    [027] d..3.  4446.750079: rto_next_cpu.constprop.0: Look count 1
|      ksoftirqd/5-63      [005] dNh4.  4446.750079: sched_wakeup: comm=irq_work/5 pid=60 prio=98 target_cpu=005
|       cyclictest-5216    [027] d..3.  4446.750080: rto_next_cpu.constprop.0: Leave CPU 5

CPU5 is making progress in terms of scheduling but then CPU27 noticed
the mask and scheduled another rto_push_work.

|      ksoftirqd/5-63      [005] dN.2.  4446.750084: dequeue_pushable_task: Del rcu_preempt-17
|      ksoftirqd/5-63      [005] dN.2.  4446.750085: dequeue_pushable_task: Clear 5
|      ksoftirqd/5-63      [005] d..2.  4446.750086: sched_switch: prev_comm=ksoftirqd/5 prev_pid=63 prev_prio=120 prev_state=R+ ==> next_comm=rcu_preempt next_pid=17 next_prio=98
|      rcu_preempt-17      [005] d.h21  4446.750089: rto_next_cpu.constprop.0: Look count 0
|      rcu_preempt-17      [005] d.h21  4446.750089: rto_next_cpu.constprop.0: Leave no CPU count: 0

This rto_next_cpu() was triggered earlier by CPU27.

At this point I'm not sure if there is something that could be done
about it or if it is a special case.

Would it make sense to avoid scheduling rto_push_work if rq->curr has
NEED_RESCHED set and make the scheduler do push_rt_task()?

Sebastian

Re: [RFC] Repeated rto_push_irq_work_func() invocation.

Posted by Steven Rostedt 1 month, 3 weeks ago

On Wed, 2 Oct 2024 13:21:05 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> Would it make sense to avoid scheduling rto_push_work if rq->curr has
> NEED_RESCHED set and make the scheduler do push_rt_task()?

Can we safely check rq->curr without taking any locks on that CPU?

I guess tasks do not get freed while a CPU has preemption disabled, so it
may be safe to do:

	task = READ_ONCE(rq->curr);
	if (test_task_need_resched(task))
		/* skip */

??

-- Steve

Re: [RFC] Repeated rto_push_irq_work_func() invocation.

Posted by Sebastian Andrzej Siewior 1 month, 3 weeks ago

On 2024-10-02 10:37:49 [-0400], Steven Rostedt wrote:
> On Wed, 2 Oct 2024 13:21:05 +0200
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > Would it make sense to avoid scheduling rto_push_work if rq->curr has
> > NEED_RESCHED set and make the scheduler do push_rt_task()?
> 
> Can we safely check rq->curr without taking any locks on that CPU?
> 
> I guess tasks do not get freed while a CPU has preemption disabled, so it
> may be safe to do:
> 
> 	task = READ_ONCE(rq->curr);
> 	if (test_task_need_resched(task))
> 		/* skip */
> 
> ??

+rcu_read_lock() but yes. This would be one part. You need to check if
the task on pull-list has lower priority than the current task on rq.
This would be need to be moved to somewhere in schedule() probably after
pick_next_task().

> -- Steve

Sebastian

Re: [RFC] Repeated rto_push_irq_work_func() invocation.

Posted by Steven Rostedt 1 month, 3 weeks ago

On Wed, 2 Oct 2024 17:05:43 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2024-10-02 10:37:49 [-0400], Steven Rostedt wrote:
> > On Wed, 2 Oct 2024 13:21:05 +0200
> > Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> >   
> > > Would it make sense to avoid scheduling rto_push_work if rq->curr has
> > > NEED_RESCHED set and make the scheduler do push_rt_task()?  
> > 
> > Can we safely check rq->curr without taking any locks on that CPU?
> > 
> > I guess tasks do not get freed while a CPU has preemption disabled, so it
> > may be safe to do:
> > 
> > 	task = READ_ONCE(rq->curr);
> > 	if (test_task_need_resched(task))
> > 		/* skip */
> > 
> > ??  
> 
> +rcu_read_lock() but yes. This would be one part. You need to check if

Yeah, preempt_disable() is pretty much equivalent today to rcu_read_lock(),
as synchronize_rcu() and synchronize_sched() have been merged into one.

> the task on pull-list has lower priority than the current task on rq.
> This would be need to be moved to somewhere in schedule() probably after
> pick_next_task().

Does it matter? The target CPU has NEED_RESCHED set, which should handle
the pushing logic anyway, right?

Hmm, I probably should look deeper to make sure that anytime a schedule
happens where there's overloaded RT tasks, it tries to push.

-- Steve

Re: [RFC] Repeated rto_push_irq_work_func() invocation.

Posted by Sebastian Andrzej Siewior 1 month, 3 weeks ago

On 2024-10-02 11:19:20 [-0400], Steven Rostedt wrote:
> > +rcu_read_lock() but yes. This would be one part. You need to check if
> 
> Yeah, preempt_disable() is pretty much equivalent today to rcu_read_lock(),
> as synchronize_rcu() and synchronize_sched() have been merged into one.

I know but…

> > the task on pull-list has lower priority than the current task on rq.
> > This would be need to be moved to somewhere in schedule() probably after
> > pick_next_task().
> 
> Does it matter? The target CPU has NEED_RESCHED set, which should handle
> the pushing logic anyway, right?

I am not sure. The task can only be added with the rq and push-list with
rq-lock held then schedule() should see it by the time it picks the task
with the highest priority. But part of what needs to be done as of
rto_push_irq_work_func()/ push_rt_task() is to push tasks with lower
priority to another CPU. This isn't the case as far as I understand the
code.
This looks expensive to be performed in schedule() but then it is also
done via the IPI. On the plus side, after the RT has been made curr it
is removed from the list. So this could be already used as a hint if
this needs to be done or not. 

> Hmm, I probably should look deeper to make sure that anytime a schedule
> happens where there's overloaded RT tasks, it tries to push.

I think it tried pull on schedule but then it got outsourced to push via
the IPI with this new code. Don't remember the details 

> -- Steve

Sebastian

Re: [RFC] Repeated rto_push_irq_work_func() invocation.

Posted by Steven Rostedt 1 month, 3 weeks ago

On Wed, 2 Oct 2024 17:35:00 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> > Hmm, I probably should look deeper to make sure that anytime a schedule
> > happens where there's overloaded RT tasks, it tries to push.  
> 
> I think it tried pull on schedule but then it got outsourced to push via
> the IPI with this new code. Don't remember the details 

Pull is much more expensive than push. That's because we keep track of the
"cpuprio" which is the priority of tasks running on each CPU. To do a push,
you simply look for the CPU running the lowest priority task using the
cpuprio logic and send the task there.

For a pull, we have no idea which is the highest priority RT task that is
waiting. We originally did a search of all the CPUs with overloaded RT
tasks, but this search takes way to long, and since it's done in the
scheduler, it caused high latency. Which is what the IPI dance is trying to
mitigate.

-- Steve