sched/core: Fixes and enhancements around spurious need_resched() and idle load balancing

[PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

Posted by K Prateek Nayak 1 year, 7 months ago

The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")

Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:

o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
  on wakelist if wakee cpu is idle") a target perceived to be idle
  (target_rq->nr_running == 0) will return true for
  ttwu_queue_cond(target) which will offload the task wakeup to the idle
  target via an IPI.

  In all such cases target_rq->ttwu_pending will be set to 1 before
  queuing the wake function.

  If an idle load balance races here, following scenarios are possible:

  - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
    IPI is sent to the CPU to wake it out of idle. If the
    nohz_csd_func() queues before sched_ttwu_pending(), the idle load
    balance will bail out since idle_cpu(target) returns 0 since
    target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
    sched_ttwu_pending() it should see rq->nr_running to be non-zero and
    bail out of idle load balancing.

  - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
    the sender will simply set TIF_NEED_RESCHED for the target to put it
    out of idle and flush_smp_call_function_queue() in do_idle() will
    execute the call function. Depending on the ordering of the queuing
    of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
    nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
    target_rq->nr_running to be non-zero if there is a genuine task
    wakeup racing with the idle load balance kick.

o The waker CPU perceives the target CPU to be busy
  (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
  to a series of unfortunate events, the system reaches a case where the
  waker CPU decides to perform the wakeup by itself in ttwu_queue() on
  the target CPU but target is concurrently selected for idle load
  balance (Can this happen? I'm not sure, but we'll consider its
  possibility to estimate the worst case scenario).

  ttwu_do_activate() calls enqueue_task() which would increment
  "rq->nr_running" post which it calls wakeup_preempt() which is
  responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
  setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
  thing to note in this case is that rq->nr_running is already non-zero
  in case of a wakeup before TIF_NEED_RESCHED is set which would
  lead to idle_cpu() check returning false.

In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".

Chasing the reason why this check might have existed in the first place,
I came across  Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:

	sched_ttwu_do_pending(list);

	if (unlikely((rq->idle == current) &&
	    rq->nohz_balance_kick &&
	    !need_resched()))
		raise_softirq_irqoff(SCHED_SOFTIRQ);

However, since this was preceded by sched_ttwu_do_pending() which is
equivalent of sched_ttwu_pending() in the current upstream kernel, the
need_resched() check was necessary to catch a newly queued task. Peter
suggested modifying it to:

	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
		raise_softirq_irqoff(SCHED_SOFTIRQ);

where idle_cpu() seems to have replaced "rq->idle == current" check.
However, even back then, the idle_cpu() check would have been sufficient
to have caught the enqueue of a new task and since commit b2a02fc43a1f
("smp: Optimize send_call_function_single_ipi()") overloads the
interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove
the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.

Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..1e0c77eac65a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1205,7 +1205,7 @@ static void nohz_csd_func(void *info)
 	WARN_ON(!(flags & NOHZ_KICK_MASK));
 
 	rq->idle_balance = idle_cpu(cpu);
-	if (rq->idle_balance && !need_resched()) {
+	if (rq->idle_balance) {
 		rq->nohz_idle_balance = flags;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
-- 
2.34.1

Re: [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

Posted by K Prateek Nayak 1 year, 6 months ago

(+ Thomas, Sebastian, Christoph)

Hello everyone,

Adding folks who were cc'd on
https://lore.kernel.org/all/20220413133024.356509586@linutronix.de/

On 7/10/2024 2:32 PM, K Prateek Nayak wrote:
> The need_resched() check currently in nohz_csd_func() can be tracked
> to have been added in scheduler_ipi() back in 2011 via commit
> ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
> 
> Since then, it has travelled quite a bit but it seems like an idle_cpu()
> check currently is sufficient to detect the need to bail out from an
> idle load balancing. To justify this removal, consider all the following
> case where an idle load balancing could race with a task wakeup:
> 
> o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
>    on wakelist if wakee cpu is idle") a target perceived to be idle
>    (target_rq->nr_running == 0) will return true for
>    ttwu_queue_cond(target) which will offload the task wakeup to the idle
>    target via an IPI.
> 
>    In all such cases target_rq->ttwu_pending will be set to 1 before
>    queuing the wake function.
> 
>    If an idle load balance races here, following scenarios are possible:
> 
>    - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
>      IPI is sent to the CPU to wake it out of idle. If the
>      nohz_csd_func() queues before sched_ttwu_pending(), the idle load
>      balance will bail out since idle_cpu(target) returns 0 since
>      target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
>      sched_ttwu_pending() it should see rq->nr_running to be non-zero and
>      bail out of idle load balancing.
> 
>    - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
>      the sender will simply set TIF_NEED_RESCHED for the target to put it
>      out of idle and flush_smp_call_function_queue() in do_idle() will
>      execute the call function. Depending on the ordering of the queuing
>      of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
>      nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
>      target_rq->nr_running to be non-zero if there is a genuine task
>      wakeup racing with the idle load balance kick.
> 
> o The waker CPU perceives the target CPU to be busy
>    (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
>    to a series of unfortunate events, the system reaches a case where the
>    waker CPU decides to perform the wakeup by itself in ttwu_queue() on
>    the target CPU but target is concurrently selected for idle load
>    balance (Can this happen? I'm not sure, but we'll consider its
>    possibility to estimate the worst case scenario).
> 
>    ttwu_do_activate() calls enqueue_task() which would increment
>    "rq->nr_running" post which it calls wakeup_preempt() which is
>    responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
>    setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
>    thing to note in this case is that rq->nr_running is already non-zero
>    in case of a wakeup before TIF_NEED_RESCHED is set which would
>    lead to idle_cpu() check returning false.
> 
> In all cases, it seems that need_resched() check is unnecessary when
> checking for idle_cpu() first since an impending wakeup racing with idle
> load balancer will either set the "rq->ttwu_pending" or indicate a newly
> woken task via "rq->nr_running".
> 
> Chasing the reason why this check might have existed in the first place,
> I came across  Peter's suggestion on the fist iteration of Suresh's
> patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
> 
> 	sched_ttwu_do_pending(list);
> 
> 	if (unlikely((rq->idle == current) &&
> 	    rq->nohz_balance_kick &&
> 	    !need_resched()))
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> However, since this was preceded by sched_ttwu_do_pending() which is
> equivalent of sched_ttwu_pending() in the current upstream kernel, the
> need_resched() check was necessary to catch a newly queued task. Peter
> suggested modifying it to:
> 
> 	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> where idle_cpu() seems to have replaced "rq->idle == current" check.
> However, even back then, the idle_cpu() check would have been sufficient
> to have caught the enqueue of a new task and since commit b2a02fc43a1f
> ("smp: Optimize send_call_function_single_ipi()") overloads the
> interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove
> the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
> on Peter's suggestion.
> 
> Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
> Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
> Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")

Turns out the above commit + commit 1a90bfd22020 ("smp: Make softirq
handling RT safe in flush_smp_call_function_queue()") will trigger the
WARN_ON_ONCE() in do_softirq_post_smp_call_flush for RT kernels after
this change since the nohz_csd_func() will now raise a SCHED_SOFTIRQ
to trigger the idle balance and is executed from
flush_smp_call_function_queue() in do_idle().

I noticed the following splat early into the boot during my testing
of the series:

     ------------[ cut here ]------------
     WARNING: CPU: 4 PID: 0 at kernel/softirq.c:326 do_softirq_post_smp_call_flush+0x1a/0x40
     Modules linked in:
     CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.10.0-rc6-rt11-test-rt+ #1160
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     RIP: 0010:do_softirq_post_smp_call_flush+0x1a/0x40
     Code: ...
     RSP: 0018:ffffb3ae003a7eb8 EFLAGS: 00010002
     RAX: 0000000000000080 RBX: 0000000000000282 RCX: 0000000000000007
     RDX: 0000000000000000 RSI: ffff9fc3fb4492e0 RDI: 0000000000000000
     RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
     R10: 000000000000009b R11: ffff9f8586e2d4d0 R12: 0000000000000000
     R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
     FS:  0000000000000000(0000) GS:ffff9fc3fb400000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 0000000000000000 CR3: 000000807d470001 CR4: 0000000000770ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      ? __warn+0x88/0x180
      ? do_softirq_post_smp_call_flush+0x1a/0x40
      ? report_bug+0x18e/0x1a0
      ? handle_bug+0x42/0x70
      ? exc_invalid_op+0x18/0x70
      ? asm_exc_invalid_op+0x1a/0x20
      ? do_softirq_post_smp_call_flush+0x1a/0x40
      ? srso_alias_return_thunk+0x5/0xfbef5
      flush_smp_call_function_queue+0x7a/0x90
      do_idle+0x15f/0x270
      cpu_startup_entry+0x29/0x30
      start_secondary+0x12b/0x160
      common_startup_64+0x13e/0x141
      </TASK>
     ---[ end trace 0000000000000000 ]---

which points to:

     WARN_ON_ONCE(was_pending != local_softirq_pending())

Since MWAIT based idling on x86 sets the TIF_POLLING_NRFLAG, IPIs to
an idle CPU are optimized out by commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") and the logic instead relies on
flush_smp_call_function_queue() in the idle exit path to execute the
SMP-call-function. This previously went undetected since the sender of
IPI sets the TIF_NEED_RESCHED bit which would have tripped the
need_resched() check in nohz_csd_func() and prevented it from raising
the SOFTIRQ.

Would it be okay to allow raising a SCHED_SOFTIRQ from
flush_smp_call_function_queue() on PREEMPT_RT kernels? Something like:

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 8308687fc7b9..d8ce76e6e318 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -314,17 +314,24 @@ static inline void invoke_softirq(void)
  		wakeup_softirqd();
  }
  
+#define SCHED_SOFTIRQ_MASK	BIT(SCHED_SOFTIRQ)
+
  /*
   * flush_smp_call_function_queue() can raise a soft interrupt in a function
- * call. On RT kernels this is undesired and the only known functionality
- * in the block layer which does this is disabled on RT. If soft interrupts
- * get raised which haven't been raised before the flush, warn so it can be
+ * call. On RT kernels this is undesired and the only known functionalities
+ * are in the block layer which is disabled on RT, and in the scheduler for
+ * idle load balancing. If soft interrupts get raised which haven't been
+ * raised before the flush, warn if it is not a SCHED_SOFTIRQ so it can be
   * investigated.
   */
  void do_softirq_post_smp_call_flush(unsigned int was_pending)
  {
-	if (WARN_ON_ONCE(was_pending != local_softirq_pending()))
+	unsigned int is_pending = local_softirq_pending();
+
+	if (unlikely(was_pending != is_pending)) {
+		WARN_ON_ONCE(was_pending != (is_pending & ~SCHED_SOFTIRQ_MASK));
  		invoke_softirq();
+	}
  }
  
  #else /* CONFIG_PREEMPT_RT */
--

With the above diff, I do not see the splat I was seeing initially. If
there are no strong objections, I can fold in the above diff in v2.
-- 
Thanks and Regards,
Prateek

> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..1e0c77eac65a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1205,7 +1205,7 @@ static void nohz_csd_func(void *info)
>   	WARN_ON(!(flags & NOHZ_KICK_MASK));
>   
>   	rq->idle_balance = idle_cpu(cpu);
> -	if (rq->idle_balance && !need_resched()) {
> +	if (rq->idle_balance) {
>   		rq->nohz_idle_balance = flags;
>   		raise_softirq_irqoff(SCHED_SOFTIRQ);
>   	}

Re: [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

Posted by Peter Zijlstra 1 year, 7 months ago

On Wed, Jul 10, 2024 at 09:02:08AM +0000, K Prateek Nayak wrote:
> The need_resched() check currently in nohz_csd_func() can be tracked
> to have been added in scheduler_ipi() back in 2011 via commit
> ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
> 
> Since then, it has travelled quite a bit but it seems like an idle_cpu()
> check currently is sufficient to detect the need to bail out from an
> idle load balancing. To justify this removal, consider all the following
> case where an idle load balancing could race with a task wakeup:
> 
> o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
>   on wakelist if wakee cpu is idle") a target perceived to be idle
>   (target_rq->nr_running == 0) will return true for
>   ttwu_queue_cond(target) which will offload the task wakeup to the idle
>   target via an IPI.
> 
>   In all such cases target_rq->ttwu_pending will be set to 1 before
>   queuing the wake function.
> 
>   If an idle load balance races here, following scenarios are possible:
> 
>   - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
>     IPI is sent to the CPU to wake it out of idle. If the
>     nohz_csd_func() queues before sched_ttwu_pending(), the idle load
>     balance will bail out since idle_cpu(target) returns 0 since
>     target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
>     sched_ttwu_pending() it should see rq->nr_running to be non-zero and
>     bail out of idle load balancing.
> 
>   - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
>     the sender will simply set TIF_NEED_RESCHED for the target to put it
>     out of idle and flush_smp_call_function_queue() in do_idle() will
>     execute the call function. Depending on the ordering of the queuing
>     of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
>     nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
>     target_rq->nr_running to be non-zero if there is a genuine task
>     wakeup racing with the idle load balance kick.

For completion sake, we should also consider the !TTWU_QUEUE case, this
configuration is default for PREEMPT_RT, where the wake_list is a source
of non-determinism.

In quick reading I think that case should be fine, since we directly
enqueue remotely and ->nr_running adjusts accordingly, but it is late in
the day and I'm easily mistaken.

> o The waker CPU perceives the target CPU to be busy
>   (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
>   to a series of unfortunate events, the system reaches a case where the
>   waker CPU decides to perform the wakeup by itself in ttwu_queue() on
>   the target CPU but target is concurrently selected for idle load
>   balance (Can this happen? I'm not sure, but we'll consider its
>   possibility to estimate the worst case scenario).
> 
>   ttwu_do_activate() calls enqueue_task() which would increment
>   "rq->nr_running" post which it calls wakeup_preempt() which is
>   responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
>   setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
>   thing to note in this case is that rq->nr_running is already non-zero
>   in case of a wakeup before TIF_NEED_RESCHED is set which would
>   lead to idle_cpu() check returning false.
> 
> In all cases, it seems that need_resched() check is unnecessary when
> checking for idle_cpu() first since an impending wakeup racing with idle
> load balancer will either set the "rq->ttwu_pending" or indicate a newly
> woken task via "rq->nr_running".

Right.

> Chasing the reason why this check might have existed in the first place,
> I came across  Peter's suggestion on the fist iteration of Suresh's
> patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
> 
> 	sched_ttwu_do_pending(list);
> 
> 	if (unlikely((rq->idle == current) &&
> 	    rq->nohz_balance_kick &&
> 	    !need_resched()))
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> However, since this was preceded by sched_ttwu_do_pending() which is
> equivalent of sched_ttwu_pending() in the current upstream kernel, the
> need_resched() check was necessary to catch a newly queued task. Peter
> suggested modifying it to:
> 
> 	if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
> 		raise_softirq_irqoff(SCHED_SOFTIRQ);
> 
> where idle_cpu() seems to have replaced "rq->idle == current" check.
> However, even back then, the idle_cpu() check would have been sufficient
> to have caught the enqueue of a new task and since commit b2a02fc43a1f
> ("smp: Optimize send_call_function_single_ipi()") overloads the
> interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove
> the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
> on Peter's suggestion.

... sooo many years ago :-)

> Link: https://lore.kernel.org/all/1317670590.20367.38.camel@twins/ [1]
> Link: https://lore.kernel.org/lkml/20240615014521.GR8774@noisy.programming.kicks-ass.net/
> Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()")
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..1e0c77eac65a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1205,7 +1205,7 @@ static void nohz_csd_func(void *info)
>  	WARN_ON(!(flags & NOHZ_KICK_MASK));
>  
>  	rq->idle_balance = idle_cpu(cpu);
> -	if (rq->idle_balance && !need_resched()) {
> +	if (rq->idle_balance) {
>  		rq->nohz_idle_balance = flags;
>  		raise_softirq_irqoff(SCHED_SOFTIRQ);
>  	}
> -- 
> 2.34.1
>

Re: [PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()

Posted by K Prateek Nayak 1 year, 7 months ago

Hello Peter,

On 7/10/2024 8:23 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2024 at 09:02:08AM +0000, K Prateek Nayak wrote:
>> The need_resched() check currently in nohz_csd_func() can be tracked
>> to have been added in scheduler_ipi() back in 2011 via commit
>> ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance")
>>
>> Since then, it has travelled quite a bit but it seems like an idle_cpu()
>> check currently is sufficient to detect the need to bail out from an
>> idle load balancing. To justify this removal, consider all the following
>> case where an idle load balancing could race with a task wakeup:
>>
>> o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU
>>    on wakelist if wakee cpu is idle") a target perceived to be idle
>>    (target_rq->nr_running == 0) will return true for
>>    ttwu_queue_cond(target) which will offload the task wakeup to the idle
>>    target via an IPI.
>>
>>    In all such cases target_rq->ttwu_pending will be set to 1 before
>>    queuing the wake function.
>>
>>    If an idle load balance races here, following scenarios are possible:
>>
>>    - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
>>      IPI is sent to the CPU to wake it out of idle. If the
>>      nohz_csd_func() queues before sched_ttwu_pending(), the idle load
>>      balance will bail out since idle_cpu(target) returns 0 since
>>      target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
>>      sched_ttwu_pending() it should see rq->nr_running to be non-zero and
>>      bail out of idle load balancing.
>>
>>    - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
>>      the sender will simply set TIF_NEED_RESCHED for the target to put it
>>      out of idle and flush_smp_call_function_queue() in do_idle() will
>>      execute the call function. Depending on the ordering of the queuing
>>      of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
>>      nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
>>      target_rq->nr_running to be non-zero if there is a genuine task
>>      wakeup racing with the idle load balance kick.
> 
> For completion sake, we should also consider the !TTWU_QUEUE case, this
> configuration is default for PREEMPT_RT, where the wake_list is a source
> of non-determinism.
> 
> In quick reading I think that case should be fine, since we directly
> enqueue remotely and ->nr_running adjusts accordingly, but it is late in
> the day and I'm easily mistaken.

 From what I've seen, an enqueue will always update "rq->nr_running"
before setting the "NEED_RESCHED" flag but I'll go confirm that again
and report back in case what that is false.

> 
>> o The waker CPU perceives the target CPU to be busy
>>    (targer_rq->nr_running != 0) but the CPU is in fact going idle and due
>>    to a series of unfortunate events, the system reaches a case where the
>>    waker CPU decides to perform the wakeup by itself in ttwu_queue() on
>>    the target CPU but target is concurrently selected for idle load
>>    balance (Can this happen? I'm not sure, but we'll consider its
>>    possibility to estimate the worst case scenario).
>>
>>    ttwu_do_activate() calls enqueue_task() which would increment
>>    "rq->nr_running" post which it calls wakeup_preempt() which is
>>    responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
>>    setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
>>    thing to note in this case is that rq->nr_running is already non-zero
>>    in case of a wakeup before TIF_NEED_RESCHED is set which would
>>    lead to idle_cpu() check returning false.
>>
>> In all cases, it seems that need_resched() check is unnecessary when
>> checking for idle_cpu() first since an impending wakeup racing with idle
>> load balancer will either set the "rq->ttwu_pending" or indicate a newly
>> woken task via "rq->nr_running".
> 
> Right.
> 
>> [..snip..]

-- 
Thanks and Regards,
Prateek

[PATCH 1/3] sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
[PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
[RFC PATCH 3/3] softirq: Avoid waking up ksoftirqd from flush_smp_call_function_queue()