[v2] sched/rt: Skip currently executing CPU in rto_next_cpu()

[PATCH v2] sched/rt: Skip currently executing CPU in rto_next_cpu()

Posted by Chen Jinghuang 2 months, 2 weeks ago

CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).

However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between
rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&& rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
wasting CPU cycles on unnecessary interrupt handling.

The triggering scenario is as follows:

         cpu0	        	   cpu1               	      cpu2
                   	        pull_rt_task
	                      tell_cpu_to_push
                 <------------irq_work_queue_on
rto_push_irq_work_func
       push_rt_task
    resched_curr(rq)                                      pull_rt_task
    rto_next_cpu                                        tell_cpu_to_push
     			 <-------------------------- atomic_inc(rto_loop_next)
rd->rto_loop != next
     rto_next_cpu
   irq_work_queue_on
rto_push_irq_work_func

Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().

Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
v1-->v2:
	- Replace the original "check NEED_RESCHED on target CPU"
	  logic with "skip the currently executing CPU"
	- This modification eliminates self-IPIS
---
---
 kernel/sched/rt.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d4333731..dc0d583aa59a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2100,6 +2100,7 @@ static void push_rt_tasks(struct rq *rq)
  */
 static int rto_next_cpu(struct root_domain *rd)
 {
+	int this_cpu = smp_processor_id();
 	int next;
 	int cpu;
 
@@ -2118,14 +2119,17 @@ static int rto_next_cpu(struct root_domain *rd)
 	 */
 	for (;;) {
 
-		/* When rto_cpu is -1 this acts like cpumask_first() */
-		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
-
-		rd->rto_cpu = cpu;
+		do {
+			/* When rto_cpu is -1 this acts like cpumask_first() */
+			cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
+			rd->rto_cpu = cpu;
+			/* Do not send IPI to self */
+		} while (cpu == this_cpu);
 
 		if (cpu < nr_cpu_ids)
 			return cpu;
 
+
 		rd->rto_cpu = -1;
 
 		/*
-- 
2.34.1

Re: [PATCH v2] sched/rt: Skip currently executing CPU in rto_next_cpu()

Posted by Steven Rostedt 2 months, 2 weeks ago

On Tue, 25 Nov 2025 08:36:49 +0000
Chen Jinghuang <chenjinghuang2@huawei.com> wrote:

> CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
> RT task, and a CFS task stuck in kernel space. When other CPUs switch from
> RT to non-RT tasks, RT load balancing (LB) is triggered; with
> HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
> of rto_push_irq_work_func. During push_rt_task on CPU0,
> if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED
> and after the push operation completes, CPU0 calls rto_next_cpu().
> Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
> ideally return -1 (no further IPI needed).
> 
> However, multiple CPUs invoking tell_cpu_to_push() during LB increments
> rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between
> rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its
> search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
> && rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to
> itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
> other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
> wasting CPU cycles on unnecessary interrupt handling.
> 
> The triggering scenario is as follows:
> 
>          cpu0	        	   cpu1               	      cpu2
>                    	        pull_rt_task
> 	                      tell_cpu_to_push
>                  <------------irq_work_queue_on
> rto_push_irq_work_func
>        push_rt_task
>     resched_curr(rq)                                      pull_rt_task
>     rto_next_cpu                                        tell_cpu_to_push
>      			 <-------------------------- atomic_inc(rto_loop_next)
> rd->rto_loop != next
>      rto_next_cpu
>    irq_work_queue_on
> rto_push_irq_work_func
> 
> Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().
> 
> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
> ---
> v1-->v2:
> 	- Replace the original "check NEED_RESCHED on target CPU"
> 	  logic with "skip the currently executing CPU"
> 	- This modification eliminates self-IPIS
> ---
> ---
>  kernel/sched/rt.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7936d4333731..dc0d583aa59a 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2100,6 +2100,7 @@ static void push_rt_tasks(struct rq *rq)
>   */
>  static int rto_next_cpu(struct root_domain *rd)
>  {
> +	int this_cpu = smp_processor_id();
>  	int next;
>  	int cpu;
>  
> @@ -2118,14 +2119,17 @@ static int rto_next_cpu(struct root_domain *rd)
>  	 */
>  	for (;;) {
>  
> -		/* When rto_cpu is -1 this acts like cpumask_first() */
> -		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
> -
> -		rd->rto_cpu = cpu;
> +		do {
> +			/* When rto_cpu is -1 this acts like cpumask_first() */
> +			cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
> +			rd->rto_cpu = cpu;
> +			/* Do not send IPI to self */
> +		} while (cpu == this_cpu);

So this fixes the issue you see too! Great!

>  
>  		if (cpu < nr_cpu_ids)
>  			return cpu;
>  
> +

Unneeded extra whitespace.

Other than that:

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

>  		rd->rto_cpu = -1;
>  
>  		/*

[PATCH v3] sched/rt: Skip currently executing CPU in rto_next_cpu() - Request for merge

Posted by Chen Jinghuang 2 months ago

Hi Steven Rostedt,

This is a follow-up on my v3 patch for sched/rt: "Skip currently executing
CPU in rto_next_cpu()" (archive link:
https://lore.kernel.org/lkml/20251126055403.2076735-1-chenjinghuang2@huawei.com/)

You previously confirmed that this patch resolves the issue of a CPU
sending an IPI to itself. This patch has also fully resolved the issue
I encountered in my testing environment. Could you please help merge this
v3 patch into mainline? I'm happy to address any code review.

Thanks a lot for your time and review!

Re: [PATCH v3] sched/rt: Skip currently executing CPU in rto_next_cpu() - Request for merge

Posted by Steven Rostedt 2 months ago

On Thu, 4 Dec 2025 07:35:44 +0000
Chen Jinghuang <chenjinghuang2@huawei.com> wrote:

> Hi Steven Rostedt,
> 
> This is a follow-up on my v3 patch for sched/rt: "Skip currently executing
> CPU in rto_next_cpu()" (archive link:
> https://lore.kernel.org/lkml/20251126055403.2076735-1-chenjinghuang2@huawei.com/)
> 
> You previously confirmed that this patch resolves the issue of a CPU
> sending an IPI to itself. This patch has also fully resolved the issue

I didn't confirm it resolved the issue. I just said it looks like it would.
I'm not the one that found the bug. I would assume the one that found this
issue tested this new patch and confirmed that it passed.

> I encountered in my testing environment. Could you please help merge this
> v3 patch into mainline? I'm happy to address any code review.
> 
> Thanks a lot for your time and review!

I already gave my reviewed-by tag. It's Peter Zijlstra that needs to accept
it. I'm only a reviewer and not one of the scheduling maintainers.

One issue you have here is that you are replying to the previous patch with
a new patch. That's not how it works. A new patch must be a start of a new
thread. Otherwise it gets very confusing for the maintainer, and most of
the time, maintainers miss these new patches embedded into threads of old
patches.

What you should also do is reference the lore link in your "changes"
portion after the '---'. For example, v3 should have been a new thread with
the following:

Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v2: https://lore.kernel.org/all/20251125083649.1814558-1-chenjinghuang2@huawei.com/

- Replace the original "check NEED_RESCHED on target CPU"
  logic with "skip the currently executing CPU"
- This modification eliminates self-IPIS

And the patch of v2 should have had:

Changes since v1: https://lore.kernel.org/all/20251121014004.564508-1-chenjinghuang2@huawei.com/
- Remove unneeded extra whitespace
- Add Reviewed-by tag from Steven Rostedt

And that allows people to find the previous version of the patch without
having the new version be a reply to it.

-- Steve

[PATCH v3] sched/rt: Skip currently executing CPU in rto_next_cpu()

Posted by Chen Jinghuang 2 months, 1 week ago

CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).

However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between
rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&& rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
wasting CPU cycles on unnecessary interrupt handling.

The triggering scenario is as follows:

         cpu0	        	   cpu1               	      cpu2
                   	        pull_rt_task
	                      tell_cpu_to_push
                 <------------irq_work_queue_on
rto_push_irq_work_func
       push_rt_task
    resched_curr(rq)                                      pull_rt_task
    rto_next_cpu                                        tell_cpu_to_push
     			 <-------------------------- atomic_inc(rto_loop_next)
rd->rto_loop != next
     rto_next_cpu
   irq_work_queue_on
rto_push_irq_work_func

Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().

Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
v1-->v2:
- Replace the original "check NEED_RESCHED on target CPU"
  logic with "skip the currently executing CPU"
- This modification eliminates self-IPIS
v2-->v3:
- Remove unneeded extra whitespace
- Add Reviewed-by tag from Steven Rostedt
---
---
 kernel/sched/rt.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d4333731..f6af41e27508 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2100,6 +2100,7 @@ static void push_rt_tasks(struct rq *rq)
  */
 static int rto_next_cpu(struct root_domain *rd)
 {
+	int this_cpu = smp_processor_id();
 	int next;
 	int cpu;
 
@@ -2118,10 +2119,12 @@ static int rto_next_cpu(struct root_domain *rd)
 	 */
 	for (;;) {
 
-		/* When rto_cpu is -1 this acts like cpumask_first() */
-		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
-
-		rd->rto_cpu = cpu;
+		do {
+			/* When rto_cpu is -1 this acts like cpumask_first() */
+			cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
+			rd->rto_cpu = cpu;
+			/* Do not send IPI to self */
+		} while (cpu == this_cpu);
 
 		if (cpu < nr_cpu_ids)
 			return cpu;
-- 
2.34.1