[PATCH] sched_ext: Refresh idle state when kicking CPUs

Andrea Righi posted 1 patch 1 year, 1 month ago
kernel/sched/ext.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
[PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by Andrea Righi 1 year, 1 month ago
Selecting an idle CPU and marking it as busy without directly
dispatching a task can lead to scheduling inefficiencies, as the CPU
remains incorrectly marked as busy (even if it returns back to an idle
state), making it ineligible for selection in ops.select_cpu() and
similar operations.

This results in suboptimal core utilization, see for example [1]. This
issue was introduced by the pick_next_task() refactoring, where the
in-kernel CPU idle state is now updated only during transitions between
non-idle and idle tasks. Previously, it was refreshed during every idle
task cycle as part of the put_prev_task() call.

To address this, update the idle state when a CPU is kicked from idle,
provided no task is queued to the local DSQ. This ensures the CPU is
correctly marked as idle when not running tasks, avoiding scheduling
bubbles and maintaining efficient core utilization.

[1] https://github.com/sched-ext/scx/pull/1139

Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 926579624c41..bdee66e7b353 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6079,6 +6079,19 @@ static bool can_skip_idle_kick(struct rq *rq)
 	return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE);
 }
 
+static void refresh_idle_state_on_kick(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * If the CPU is idle and the local DSQ has no queued tasks, update
+	 * its idle state to prevent the CPU from staying busy even if it
+	 * returns back to idle without executing any task.
+	 */
+	if (is_idle_task(rq->curr) && !rq->scx.local_dsq.nr)
+		__scx_update_idle(rq, true);
+}
+
 static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -6104,6 +6117,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
 			should_wait = true;
 		}
 
+		refresh_idle_state_on_kick(rq);
 		resched_curr(rq);
 	} else {
 		cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
@@ -6123,8 +6137,10 @@ static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
 	raw_spin_rq_lock_irqsave(rq, flags);
 
 	if (!can_skip_idle_kick(rq) &&
-	    (cpu_online(cpu) || cpu == cpu_of(this_rq)))
+	    (cpu_online(cpu) || cpu == cpu_of(this_rq))) {
+		refresh_idle_state_on_kick(rq);
 		resched_curr(rq);
+	}
 
 	raw_spin_rq_unlock_irqrestore(rq, flags);
 }
-- 
2.47.1
Re: [PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by kernel test robot 1 year, 1 month ago
Hi Andrea,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.13-rc6 next-20250107]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrea-Righi/sched_ext-Refresh-idle-state-when-kicking-CPUs/20250102-022650
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20250101182449.21517-1-arighi%40nvidia.com
patch subject: [PATCH] sched_ext: Refresh idle state when kicking CPUs
config: x86_64-buildonly-randconfig-005-20250108 (https://download.01.org/0day-ci/archive/20250108/202501080743.HLPZzVsv-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250108/202501080743.HLPZzVsv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501080743.HLPZzVsv-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/sched/build_policy.c:63:
   kernel/sched/ext.c: In function 'refresh_idle_state_on_kick':
>> kernel/sched/ext.c:6031:17: error: implicit declaration of function '__scx_update_idle'; did you mean 'scx_update_idle'? [-Werror=implicit-function-declaration]
    6031 |                 __scx_update_idle(rq, true);
         |                 ^~~~~~~~~~~~~~~~~
         |                 scx_update_idle
   cc1: some warnings being treated as errors


vim +6031 kernel/sched/ext.c

  6020	
  6021	static void refresh_idle_state_on_kick(struct rq *rq)
  6022	{
  6023		lockdep_assert_rq_held(rq);
  6024	
  6025		/*
  6026		 * If the CPU is idle and the local DSQ has no queued tasks, update
  6027		 * its idle state to prevent the CPU from staying busy even if it
  6028		 * returns back to idle without executing any task.
  6029		 */
  6030		if (is_idle_task(rq->curr) && !rq->scx.local_dsq.nr)
> 6031			__scx_update_idle(rq, true);
  6032	}
  6033	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by Tejun Heo 1 year, 1 month ago
Hello,

On Wed, Jan 01, 2025 at 07:24:49PM +0100, Andrea Righi wrote:
...
> @@ -6104,6 +6117,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
>  			should_wait = true;
>  		}
>  
> +		refresh_idle_state_on_kick(rq);
>  		resched_curr(rq);

I'm not sure this is quite correct. e.g. This can cause multiple
back-to-back busy->busy transitions and can incorrectly assert idle when the
CPU ends up running non-idle tasks afterwards.

When the put_prev/set_next paths were reorganized, we lost the signal on the
CPU re-entering idle from idle. However, that signal is still available if
we hook into idle_class->pick_task(), right? So, if we move
update_idle(true) call there and make sure that we don't generate an event
on busy->busy transitions, we should be able to restore the previous
behavior?

Thanks.

-- 
tejun
Re: [PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by Andrea Righi 1 year, 1 month ago
On Thu, Jan 02, 2025 at 12:33:06PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 01, 2025 at 07:24:49PM +0100, Andrea Righi wrote:
> ...
> > @@ -6104,6 +6117,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
> >  			should_wait = true;
> >  		}
> >  
> > +		refresh_idle_state_on_kick(rq);
> >  		resched_curr(rq);
> 
> I'm not sure this is quite correct. e.g. This can cause multiple
> back-to-back busy->busy transitions and can incorrectly assert idle when the
> CPU ends up running non-idle tasks afterwards.

Right, we solve one issue, but it might result in some CPUs being marked as
idle even though they're actively running a task.

> 
> When the put_prev/set_next paths were reorganized, we lost the signal on the
> CPU re-entering idle from idle. However, that signal is still available if
> we hook into idle_class->pick_task(), right? So, if we move
> update_idle(true) call there and make sure that we don't generate an event
> on busy->busy transitions, we should be able to restore the previous
> behavior?

Which is basically what I did here:
https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/

We didn't fully like this, because it'd introduce unbalanced transitions,
as update_idle(cpu, true) can be generated multiple times. But it's
probably fine, at the end we would just restore the original behavior and
it'd allow to solve both the "pick_idle + kick CPU" and the "kick from
update_idle()" scenarios.

If we like this approach I can send a new patch updating the comment to
better clarify the scenarios that we are trying to solve. What do you
think?

Thanks,
-Andrea
Re: [PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by Tejun Heo 1 year, 1 month ago
Hello,

On Fri, Jan 03, 2025 at 09:55:14AM +0100, Andrea Righi wrote:
...
> > When the put_prev/set_next paths were reorganized, we lost the signal on the
> > CPU re-entering idle from idle. However, that signal is still available if
> > we hook into idle_class->pick_task(), right? So, if we move
> > update_idle(true) call there and make sure that we don't generate an event
> > on busy->busy transitions, we should be able to restore the previous
> > behavior?
> 
> Which is basically what I did here:
> https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/
> 
> We didn't fully like this, because it'd introduce unbalanced transitions,
> as update_idle(cpu, true) can be generated multiple times. But it's
> probably fine, at the end we would just restore the original behavior and
> it'd allow to solve both the "pick_idle + kick CPU" and the "kick from
> update_idle()" scenarios.
> 
> If we like this approach I can send a new patch updating the comment to
> better clarify the scenarios that we are trying to solve. What do you
> think?

Maybe we can solve the unbalanced transitions by tracking per-cpu idle state
separately and invoking ops.update_idle() only on actual transitions?

Thanks.

-- 
tejun
Re: [PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by Andrea Righi 1 year, 1 month ago
On Fri, Jan 03, 2025 at 11:39:51AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 03, 2025 at 09:55:14AM +0100, Andrea Righi wrote:
> ...
> > > When the put_prev/set_next paths were reorganized, we lost the signal on the
> > > CPU re-entering idle from idle. However, that signal is still available if
> > > we hook into idle_class->pick_task(), right? So, if we move
> > > update_idle(true) call there and make sure that we don't generate an event
> > > on busy->busy transitions, we should be able to restore the previous
> > > behavior?
> > 
> > Which is basically what I did here:
> > https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/
> > 
> > We didn't fully like this, because it'd introduce unbalanced transitions,
> > as update_idle(cpu, true) can be generated multiple times. But it's
> > probably fine, at the end we would just restore the original behavior and
> > it'd allow to solve both the "pick_idle + kick CPU" and the "kick from
> > update_idle()" scenarios.
> > 
> > If we like this approach I can send a new patch updating the comment to
> > better clarify the scenarios that we are trying to solve. What do you
> > think?
> 
> Maybe we can solve the unbalanced transitions by tracking per-cpu idle state
> separately and invoking ops.update_idle() only on actual transitions?

We could call scx_update_idle() from pick_task_idle() to refresh the idle
cpumasks, but skip the call to ops.update_idle() to avoid unbalanced
transitions.

However, if a scheduler implements a custom idle tracking policy through
ops.update_idle() we might face a similar issue: the typical sequence
scx_bpf_pick_idle_cpu() + scx_bpf_kick_cpu() + CPU going back to idle state
without dispatching a task would leave the CPU marked as busy, incorrectly.

The issue is that we call ops.update_idle() when a CPU enters or exits
SCHED_IDLE, whereas it should ideally be called when the CPU transitions
in/out of the idle state. So perhaps a kick from idle should trigger
ops.update_idle(cpu, false)? Still, I'm not sure if that would provide any
benefit... after all, do you see any practical scenarios where having
unbalanced transitions could be a problem?

Thanks,
-Andrea
Re: [PATCH] sched_ext: Refresh idle state when kicking CPUs
Posted by Tejun Heo 1 year, 1 month ago
Hello,

On Sat, Jan 04, 2025 at 08:06:26AM +0100, Andrea Righi wrote:
...
> The issue is that we call ops.update_idle() when a CPU enters or exits
> SCHED_IDLE, whereas it should ideally be called when the CPU transitions
> in/out of the idle state. So perhaps a kick from idle should trigger
> ops.update_idle(cpu, false)? Still, I'm not sure if that would provide any
> benefit... after all, do you see any practical scenarios where having
> unbalanced transitions could be a problem?

If possible, we should keep the calls balanced even if that means a bit of
complications on the core side as these things can become subtle bugs from
BPF scheduler side. And, yeah, I think kicking transitioning the CPU out of
idle makes sense to me.

Thanks.

-- 
tejun