kernel/sched/ext.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)
Selecting an idle CPU and marking it as busy without directly
dispatching a task can lead to scheduling inefficiencies, as the CPU
remains incorrectly marked as busy (even if it returns back to an idle
state), making it ineligible for selection in ops.select_cpu() and
similar operations.
This results in suboptimal core utilization, see for example [1]. This
issue was introduced by the pick_next_task() refactoring, where the
in-kernel CPU idle state is now updated only during transitions between
non-idle and idle tasks. Previously, it was refreshed during every idle
task cycle as part of the put_prev_task() call.
To address this, update the idle state when a CPU is kicked from idle,
provided no task is queued to the local DSQ. This ensures the CPU is
correctly marked as idle when not running tasks, avoiding scheduling
bubbles and maintaining efficient core utilization.
[1] https://github.com/sched-ext/scx/pull/1139
Fixes: 7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 926579624c41..bdee66e7b353 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6079,6 +6079,19 @@ static bool can_skip_idle_kick(struct rq *rq)
return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE);
}
+static void refresh_idle_state_on_kick(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * If the CPU is idle and the local DSQ has no queued tasks, update
+ * its idle state to prevent the CPU from staying busy even if it
+ * returns back to idle without executing any task.
+ */
+ if (is_idle_task(rq->curr) && !rq->scx.local_dsq.nr)
+ __scx_update_idle(rq, true);
+}
+
static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
{
struct rq *rq = cpu_rq(cpu);
@@ -6104,6 +6117,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
should_wait = true;
}
+ refresh_idle_state_on_kick(rq);
resched_curr(rq);
} else {
cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
@@ -6123,8 +6137,10 @@ static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
raw_spin_rq_lock_irqsave(rq, flags);
if (!can_skip_idle_kick(rq) &&
- (cpu_online(cpu) || cpu == cpu_of(this_rq)))
+ (cpu_online(cpu) || cpu == cpu_of(this_rq))) {
+ refresh_idle_state_on_kick(rq);
resched_curr(rq);
+ }
raw_spin_rq_unlock_irqrestore(rq, flags);
}
--
2.47.1
Hi Andrea,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.13-rc6 next-20250107]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andrea-Righi/sched_ext-Refresh-idle-state-when-kicking-CPUs/20250102-022650
base: tip/sched/core
patch link: https://lore.kernel.org/r/20250101182449.21517-1-arighi%40nvidia.com
patch subject: [PATCH] sched_ext: Refresh idle state when kicking CPUs
config: x86_64-buildonly-randconfig-005-20250108 (https://download.01.org/0day-ci/archive/20250108/202501080743.HLPZzVsv-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250108/202501080743.HLPZzVsv-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501080743.HLPZzVsv-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/build_policy.c:63:
kernel/sched/ext.c: In function 'refresh_idle_state_on_kick':
>> kernel/sched/ext.c:6031:17: error: implicit declaration of function '__scx_update_idle'; did you mean 'scx_update_idle'? [-Werror=implicit-function-declaration]
6031 | __scx_update_idle(rq, true);
| ^~~~~~~~~~~~~~~~~
| scx_update_idle
cc1: some warnings being treated as errors
vim +6031 kernel/sched/ext.c
6020
6021 static void refresh_idle_state_on_kick(struct rq *rq)
6022 {
6023 lockdep_assert_rq_held(rq);
6024
6025 /*
6026 * If the CPU is idle and the local DSQ has no queued tasks, update
6027 * its idle state to prevent the CPU from staying busy even if it
6028 * returns back to idle without executing any task.
6029 */
6030 if (is_idle_task(rq->curr) && !rq->scx.local_dsq.nr)
> 6031 __scx_update_idle(rq, true);
6032 }
6033
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hello, On Wed, Jan 01, 2025 at 07:24:49PM +0100, Andrea Righi wrote: ... > @@ -6104,6 +6117,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs) > should_wait = true; > } > > + refresh_idle_state_on_kick(rq); > resched_curr(rq); I'm not sure this is quite correct. e.g. This can cause multiple back-to-back busy->busy transitions and can incorrectly assert idle when the CPU ends up running non-idle tasks afterwards. When the put_prev/set_next paths were reorganized, we lost the signal on the CPU re-entering idle from idle. However, that signal is still available if we hook into idle_class->pick_task(), right? So, if we move update_idle(true) call there and make sure that we don't generate an event on busy->busy transitions, we should be able to restore the previous behavior? Thanks. -- tejun
On Thu, Jan 02, 2025 at 12:33:06PM -1000, Tejun Heo wrote: > Hello, > > On Wed, Jan 01, 2025 at 07:24:49PM +0100, Andrea Righi wrote: > ... > > @@ -6104,6 +6117,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs) > > should_wait = true; > > } > > > > + refresh_idle_state_on_kick(rq); > > resched_curr(rq); > > I'm not sure this is quite correct. e.g. This can cause multiple > back-to-back busy->busy transitions and can incorrectly assert idle when the > CPU ends up running non-idle tasks afterwards. Right, we solve one issue, but it might result in some CPUs being marked as idle even though they're actively running a task. > > When the put_prev/set_next paths were reorganized, we lost the signal on the > CPU re-entering idle from idle. However, that signal is still available if > we hook into idle_class->pick_task(), right? So, if we move > update_idle(true) call there and make sure that we don't generate an event > on busy->busy transitions, we should be able to restore the previous > behavior? Which is basically what I did here: https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/ We didn't fully like this, because it'd introduce unbalanced transitions, as update_idle(cpu, true) can be generated multiple times. But it's probably fine, at the end we would just restore the original behavior and it'd allow to solve both the "pick_idle + kick CPU" and the "kick from update_idle()" scenarios. If we like this approach I can send a new patch updating the comment to better clarify the scenarios that we are trying to solve. What do you think? Thanks, -Andrea
Hello, On Fri, Jan 03, 2025 at 09:55:14AM +0100, Andrea Righi wrote: ... > > When the put_prev/set_next paths were reorganized, we lost the signal on the > > CPU re-entering idle from idle. However, that signal is still available if > > we hook into idle_class->pick_task(), right? So, if we move > > update_idle(true) call there and make sure that we don't generate an event > > on busy->busy transitions, we should be able to restore the previous > > behavior? > > Which is basically what I did here: > https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/ > > We didn't fully like this, because it'd introduce unbalanced transitions, > as update_idle(cpu, true) can be generated multiple times. But it's > probably fine, at the end we would just restore the original behavior and > it'd allow to solve both the "pick_idle + kick CPU" and the "kick from > update_idle()" scenarios. > > If we like this approach I can send a new patch updating the comment to > better clarify the scenarios that we are trying to solve. What do you > think? Maybe we can solve the unbalanced transitions by tracking per-cpu idle state separately and invoking ops.update_idle() only on actual transitions? Thanks. -- tejun
On Fri, Jan 03, 2025 at 11:39:51AM -1000, Tejun Heo wrote: > Hello, > > On Fri, Jan 03, 2025 at 09:55:14AM +0100, Andrea Righi wrote: > ... > > > When the put_prev/set_next paths were reorganized, we lost the signal on the > > > CPU re-entering idle from idle. However, that signal is still available if > > > we hook into idle_class->pick_task(), right? So, if we move > > > update_idle(true) call there and make sure that we don't generate an event > > > on busy->busy transitions, we should be able to restore the previous > > > behavior? > > > > Which is basically what I did here: > > https://lore.kernel.org/lkml/20241015111539.12136-1-andrea.righi@linux.dev/ > > > > We didn't fully like this, because it'd introduce unbalanced transitions, > > as update_idle(cpu, true) can be generated multiple times. But it's > > probably fine, at the end we would just restore the original behavior and > > it'd allow to solve both the "pick_idle + kick CPU" and the "kick from > > update_idle()" scenarios. > > > > If we like this approach I can send a new patch updating the comment to > > better clarify the scenarios that we are trying to solve. What do you > > think? > > Maybe we can solve the unbalanced transitions by tracking per-cpu idle state > separately and invoking ops.update_idle() only on actual transitions? We could call scx_update_idle() from pick_task_idle() to refresh the idle cpumasks, but skip the call to ops.update_idle() to avoid unbalanced transitions. However, if a scheduler implements a custom idle tracking policy through ops.update_idle() we might face a similar issue: the typical sequence scx_bpf_pick_idle_cpu() + scx_bpf_kick_cpu() + CPU going back to idle state without dispatching a task would leave the CPU marked as busy, incorrectly. The issue is that we call ops.update_idle() when a CPU enters or exits SCHED_IDLE, whereas it should ideally be called when the CPU transitions in/out of the idle state. So perhaps a kick from idle should trigger ops.update_idle(cpu, false)? Still, I'm not sure if that would provide any benefit... after all, do you see any practical scenarios where having unbalanced transitions could be a problem? Thanks, -Andrea
Hello, On Sat, Jan 04, 2025 at 08:06:26AM +0100, Andrea Righi wrote: ... > The issue is that we call ops.update_idle() when a CPU enters or exits > SCHED_IDLE, whereas it should ideally be called when the CPU transitions > in/out of the idle state. So perhaps a kick from idle should trigger > ops.update_idle(cpu, false)? Still, I'm not sure if that would provide any > benefit... after all, do you see any practical scenarios where having > unbalanced transitions could be a problem? If possible, we should keep the calls balanced even if that means a bit of complications on the core side as these things can become subtle bugs from BPF scheduler side. And, yeah, I think kicking transitioning the CPU out of idle makes sense to me. Thanks. -- tejun
© 2016 - 2026 Red Hat, Inc.