[PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context

Andrea Righi posted 2 patches 9 months ago
[PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Andrea Righi 9 months ago
Allow scx_bpf_select_cpu_and() to be used from any context, not just
from ops.enqueue() or ops.select_cpu().

This enables schedulers, including user-space ones, to implement a
consistent idle CPU selection policy and helps reduce code duplication.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext_idle.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 6915685cd3d6b..5373132db02b6 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -943,10 +943,15 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
 	if (!check_builtin_idle_enabled())
 		return -EBUSY;
 
-	if (!scx_kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE))
-		return -EPERM;
-
 #ifdef CONFIG_SMP
+	/*
+	 * If called from an unlocked context, try to acquire
+	 * cpus_read_lock() to avoid races with CPU hotplug.
+	 */
+	if (scx_kf_allowed_if_unlocked())
+		if (!cpus_read_trylock())
+			return -EBUSY;
+
 	/*
 	 * This may also be called from ops.enqueue(), so we need to handle
 	 * per-CPU tasks as well. For these tasks, we can skip all idle CPU
@@ -956,10 +961,16 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
 	if (p->nr_cpus_allowed == 1) {
 		if (cpumask_test_cpu(prev_cpu, cpus_allowed) &&
 		    scx_idle_test_and_clear_cpu(prev_cpu))
-			return prev_cpu;
-		return -EBUSY;
+			cpu = prev_cpu;
+		else
+			cpu = -EBUSY;
+		goto out_unlock;
 	}
 	cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, cpus_allowed, flags);
+
+out_unlock:
+	if (scx_kf_allowed_if_unlocked())
+		cpus_read_unlock();
 #else
 	cpu = -EBUSY;
 #endif
@@ -1266,6 +1277,7 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_idle)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
@@ -1275,7 +1287,6 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
 
 BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
 BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
-BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
-- 
2.49.0
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Tejun Heo 9 months ago
Hello,

On Mon, May 12, 2025 at 05:14:56PM +0200, Andrea Righi wrote:
>  #ifdef CONFIG_SMP
> +	/*
> +	 * If called from an unlocked context, try to acquire
> +	 * cpus_read_lock() to avoid races with CPU hotplug.
> +	 */
> +	if (scx_kf_allowed_if_unlocked())
> +		if (!cpus_read_trylock())
> +			return -EBUSY;

Is this meaningful? The idle CPU selection is already racy against CPU
hotplugs and we depend on the scheduler core to fix it up afterwards. Even
if scx_bpf_select_cpu_and() is not racy, it will drop the cpus lock before
returning and becomes racy again right there. ie. This doesn't add any
meaningful protection.

Thanks.

-- 
tejun
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Andrea Righi 9 months ago
Hi Tejun,

On Mon, May 12, 2025 at 07:19:53AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, May 12, 2025 at 05:14:56PM +0200, Andrea Righi wrote:
> >  #ifdef CONFIG_SMP
> > +	/*
> > +	 * If called from an unlocked context, try to acquire
> > +	 * cpus_read_lock() to avoid races with CPU hotplug.
> > +	 */
> > +	if (scx_kf_allowed_if_unlocked())
> > +		if (!cpus_read_trylock())
> > +			return -EBUSY;
> 
> Is this meaningful? The idle CPU selection is already racy against CPU
> hotplugs and we depend on the scheduler core to fix it up afterwards. Even
> if scx_bpf_select_cpu_and() is not racy, it will drop the cpus lock before
> returning and becomes racy again right there. ie. This doesn't add any
> meaningful protection.

I was concerned that accessing llc_span() / llc_weight() from
scx_select_cpu_dfl() might be problematic if a CPU goes offline underneath,
but we are accessing them with rcu_read_lock() held, we probably don't need
cpus_read_lock() protection.

And about the scheduler picking a bad CPU, that can be fixed by the
sched_ext core, so there's no problem for that.

I'll think more about the CPU hotplugging scenario.

Thanks,
-Andrea

> 
> Thanks.
> 
> -- 
> tejun
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by David Vernet 9 months ago
On Mon, May 12, 2025 at 05:14:56PM +0200, Andrea Righi wrote:
> Allow scx_bpf_select_cpu_and() to be used from any context, not just
> from ops.enqueue() or ops.select_cpu().
> 
> This enables schedulers, including user-space ones, to implement a
> consistent idle CPU selection policy and helps reduce code duplication.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Hi Andrea,

> ---
>  kernel/sched/ext_idle.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 6915685cd3d6b..5373132db02b6 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -943,10 +943,15 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64

FYI it looks like there are a few comments sprinkled around here that specify
that only the .select_cpu() and .enqueue() ops are allowed.

>  	if (!check_builtin_idle_enabled())
>  		return -EBUSY;
>  
> -	if (!scx_kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE))
> -		return -EPERM;
> -
>  #ifdef CONFIG_SMP
> +	/*
> +	 * If called from an unlocked context, try to acquire
> +	 * cpus_read_lock() to avoid races with CPU hotplug.
> +	 */
> +	if (scx_kf_allowed_if_unlocked())
> +		if (!cpus_read_trylock())
> +			return -EBUSY;

Hmm, so this now assumes that this function can be called without the rq lock
held. I'm not sure if that's safe, as scx_select_cpu_dfl() queries p->cpus_ptr,
which is protected by the rq lock. Additionally, scx_bpf_select_cpu_and()
checks p->nr_cpus_allowed which is protected sometimes by pi_lock, sometimes by
the rq lock, etc depending on its state.

This might be fine in practice as I _think_ the only unsafe thing would be if
p->cpus_ptr could have a torn read or write. scx_select_cpu_dfl() does do
preempt_disable() so no issues in querying the per-cpu variables there.

As long as this is safe (or can be made safe) I don't have a strong opinion one
way or the other. I think it's probably a good idea to allow users to remove
code duplication, and in general it's up to the user to use this API correctly
(e.g. if you're preempted during a call to scx_bpf_select_cpu_and() and
prev_cpu changes out from under you due to the task being migrated, that's
likely just a bug in the scheduler).

Thanks,
David

>  	/*
>  	 * This may also be called from ops.enqueue(), so we need to handle
>  	 * per-CPU tasks as well. For these tasks, we can skip all idle CPU
> @@ -956,10 +961,16 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
>  	if (p->nr_cpus_allowed == 1) {
>  		if (cpumask_test_cpu(prev_cpu, cpus_allowed) &&
>  		    scx_idle_test_and_clear_cpu(prev_cpu))
> -			return prev_cpu;
> -		return -EBUSY;
> +			cpu = prev_cpu;
> +		else
> +			cpu = -EBUSY;
> +		goto out_unlock;
>  	}
>  	cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, cpus_allowed, flags);
> +
> +out_unlock:
> +	if (scx_kf_allowed_if_unlocked())
> +		cpus_read_unlock();
>  #else
>  	cpu = -EBUSY;
>  #endif
> @@ -1266,6 +1277,7 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
>  BTF_KFUNCS_END(scx_kfunc_ids_idle)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
> @@ -1275,7 +1287,6 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
>  BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> -BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
>  BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> -- 
> 2.49.0
> 
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Andrea Righi 9 months ago
Hi David,

On Mon, May 12, 2025 at 11:58:29AM -0500, David Vernet wrote:
> On Mon, May 12, 2025 at 05:14:56PM +0200, Andrea Righi wrote:
> > Allow scx_bpf_select_cpu_and() to be used from any context, not just
> > from ops.enqueue() or ops.select_cpu().
> > 
> > This enables schedulers, including user-space ones, to implement a
> > consistent idle CPU selection policy and helps reduce code duplication.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> 
> Hi Andrea,
> 
> > ---
> >  kernel/sched/ext_idle.c | 23 +++++++++++++++++------
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> > index 6915685cd3d6b..5373132db02b6 100644
> > --- a/kernel/sched/ext_idle.c
> > +++ b/kernel/sched/ext_idle.c
> > @@ -943,10 +943,15 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
> 
> FYI it looks like there are a few comments sprinkled around here that specify
> that only the .select_cpu() and .enqueue() ops are allowed.

Ah, good catch! I'll fix the comments.

> 
> >  	if (!check_builtin_idle_enabled())
> >  		return -EBUSY;
> >  
> > -	if (!scx_kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE))
> > -		return -EPERM;
> > -
> >  #ifdef CONFIG_SMP
> > +	/*
> > +	 * If called from an unlocked context, try to acquire
> > +	 * cpus_read_lock() to avoid races with CPU hotplug.
> > +	 */
> > +	if (scx_kf_allowed_if_unlocked())
> > +		if (!cpus_read_trylock())
> > +			return -EBUSY;
> 
> Hmm, so this now assumes that this function can be called without the rq lock
> held. I'm not sure if that's safe, as scx_select_cpu_dfl() queries p->cpus_ptr,
> which is protected by the rq lock. Additionally, scx_bpf_select_cpu_and()
> checks p->nr_cpus_allowed which is protected sometimes by pi_lock, sometimes by
> the rq lock, etc depending on its state.

Yeah, it makes locking quite tricky. Maybe we can acquire the rq lock if
called from an unlocked context, instead of cpu_read_lock(), but then we
still have to deal with pi_lock.

> 
> This might be fine in practice as I _think_ the only unsafe thing would be if
> p->cpus_ptr could have a torn read or write. scx_select_cpu_dfl() does do
> preempt_disable() so no issues in querying the per-cpu variables there.

I've been running `stress-ng --race-sched N` for a while to stress test
affinity changes  and I haven't triggered any error, but that doesn't mean
the code is safe...

> 
> As long as this is safe (or can be made safe) I don't have a strong opinion one
> way or the other. I think it's probably a good idea to allow users to remove
> code duplication, and in general it's up to the user to use this API correctly
> (e.g. if you're preempted during a call to scx_bpf_select_cpu_and() and
> prev_cpu changes out from under you due to the task being migrated, that's
> likely just a bug in the scheduler).

Right, I guess the biggest issue here is dealing with locking in a safe
way. About the scheduler not being 100% accurate with prev_cpu, etc. that's
a non-problem I think, at the end we don't need to be formally perfect, we
can tolerate little errors if this API is used from a non-atomic context.

I'll think more about it and do some experiemnts with locking.

Thanks!
-Andrea

> 
> Thanks,
> David
> 
> >  	/*
> >  	 * This may also be called from ops.enqueue(), so we need to handle
> >  	 * per-CPU tasks as well. For these tasks, we can skip all idle CPU
> > @@ -956,10 +961,16 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64
> >  	if (p->nr_cpus_allowed == 1) {
> >  		if (cpumask_test_cpu(prev_cpu, cpus_allowed) &&
> >  		    scx_idle_test_and_clear_cpu(prev_cpu))
> > -			return prev_cpu;
> > -		return -EBUSY;
> > +			cpu = prev_cpu;
> > +		else
> > +			cpu = -EBUSY;
> > +		goto out_unlock;
> >  	}
> >  	cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, cpus_allowed, flags);
> > +
> > +out_unlock:
> > +	if (scx_kf_allowed_if_unlocked())
> > +		cpus_read_unlock();
> >  #else
> >  	cpu = -EBUSY;
> >  #endif
> > @@ -1266,6 +1277,7 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU)
> >  BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> >  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU)
> >  BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
> > +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
> >  BTF_KFUNCS_END(scx_kfunc_ids_idle)
> >  
> >  static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
> > @@ -1275,7 +1287,6 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
> >  
> >  BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> >  BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> > -BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
> >  BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
> >  
> >  static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> > -- 
> > 2.49.0
> >
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Andrea Righi 9 months ago
On Mon, May 12, 2025 at 08:29:26PM +0200, Andrea Righi wrote:
...
> > > +	/*
> > > +	 * If called from an unlocked context, try to acquire
> > > +	 * cpus_read_lock() to avoid races with CPU hotplug.
> > > +	 */
> > > +	if (scx_kf_allowed_if_unlocked())
> > > +		if (!cpus_read_trylock())
> > > +			return -EBUSY;
> > 
> > Hmm, so this now assumes that this function can be called without the rq lock
> > held. I'm not sure if that's safe, as scx_select_cpu_dfl() queries p->cpus_ptr,
> > which is protected by the rq lock. Additionally, scx_bpf_select_cpu_and()
> > checks p->nr_cpus_allowed which is protected sometimes by pi_lock, sometimes by
> > the rq lock, etc depending on its state.
> 
> Yeah, it makes locking quite tricky. Maybe we can acquire the rq lock if
> called from an unlocked context, instead of cpu_read_lock(), but then we
> still have to deal with pi_lock.

Actually that might work, since pi_lock should never be held when
scx_kf_allowed_if_unlocked() is true, so basically we can do:

	if (scx_kf_allowed_if_unlocked())
		rq = task_rq_lock(p, &rf);
	...
	if (scx_kf_allowed_if_unlocked())
		task_rq_unlock(rq, p, &rf);

Or at least it should cover all current use cases. The tricky one is
scx_bpf_select_cpu_and() being called via BPF test_run from a user-space
task (scx_rustland_core).

If we had a way to clearly identify a test_run context, we could restrict
this to BPF_PROG_TYPE_STRUCT_OPS and BPF_PROG_TYPE_TEST_RUN (but as far as
I can tell, the latter doesn't exist).

Thanks,
-Andrea
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Tejun Heo 9 months ago
Hello,

On Mon, May 12, 2025 at 09:07:49PM +0200, Andrea Righi wrote:
...
> 	if (scx_kf_allowed_if_unlocked())
> 		rq = task_rq_lock(p, &rf);
> 	...
> 	if (scx_kf_allowed_if_unlocked())
> 		task_rq_unlock(rq, p, &rf);
> 
> Or at least it should cover all current use cases. The tricky one is
> scx_bpf_select_cpu_and() being called via BPF test_run from a user-space
> task (scx_rustland_core).
> 
> If we had a way to clearly identify a test_run context, we could restrict
> this to BPF_PROG_TYPE_STRUCT_OPS and BPF_PROG_TYPE_TEST_RUN (but as far as
> I can tell, the latter doesn't exist).

Shouldn't that work as-is? TEST_RUN isn't gonna set any kf_mask, so the same
condition would work?

Thanks.

-- 
tejun
Re: [PATCH 2/2] sched_ext/idle: Make scx_bpf_select_cpu_and() usable from any context
Posted by Andrea Righi 9 months ago
On Mon, May 12, 2025 at 10:00:19AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, May 12, 2025 at 09:07:49PM +0200, Andrea Righi wrote:
> ...
> > 	if (scx_kf_allowed_if_unlocked())
> > 		rq = task_rq_lock(p, &rf);
> > 	...
> > 	if (scx_kf_allowed_if_unlocked())
> > 		task_rq_unlock(rq, p, &rf);
> > 
> > Or at least it should cover all current use cases. The tricky one is
> > scx_bpf_select_cpu_and() being called via BPF test_run from a user-space
> > task (scx_rustland_core).
> > 
> > If we had a way to clearly identify a test_run context, we could restrict
> > this to BPF_PROG_TYPE_STRUCT_OPS and BPF_PROG_TYPE_TEST_RUN (but as far as
> > I can tell, the latter doesn't exist).
> 
> Shouldn't that work as-is? TEST_RUN isn't gonna set any kf_mask, so the same
> condition would work?

Oh yes, it works, my point is that, without any restrictions, the kfunc
might be used in problematic contexts, for example if it's called with
pi_lock held it could trigger a deadlock with task_rq_lock(). Limiting its
use to ops.select_cpu(), ops.enqueue(), ops.dispatch(), and test_run would
allow to cover all the use cases while being safe.

Thanks,
-Andrea