sched_ext: Fix SCX_KICK_WAIT to work reliably

Tejun Heo posted 1 patch 3 months, 2 weeks ago
kernel/sched/ext.c          |   31 ++++++++++++++++++-------------
kernel/sched/ext_internal.h |    6 ++++--
2 files changed, 22 insertions(+), 15 deletions(-)
sched_ext: Fix SCX_KICK_WAIT to work reliably
Posted by Tejun Heo 3 months, 2 weeks ago
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.

This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.

However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.

This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
switches out. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          |   31 ++++++++++++++++++-------------
 kernel/sched/ext_internal.h |    6 ++++--
 2 files changed, 22 insertions(+), 15 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
 	struct scx_sched *sch = scx_root;
 	const struct sched_class *next_class = next->sched_class;
 
-	/*
-	 * Pairs with the smp_load_acquire() issued by a CPU in
-	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
-	 * resched.
-	 */
-	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
 	if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
 		return;
 
@@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
 			      struct task_struct *next)
 {
 	struct scx_sched *sch = scx_root;
+
+	/*
+	 * Pairs with the smp_load_acquire() issued by a CPU in
+	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+	 * resched.
+	 */
+	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+
 	update_curr_scx(rq);
 
 	/* see dequeue_task_scx() on why we skip when !QUEUED */
@@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
 		}
 
 		if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
-			pseqs[cpu] = rq->scx.pnt_seq;
-			should_wait = true;
+			if (cur_class == &ext_sched_class) {
+				pseqs[cpu] = rq->scx.pnt_seq;
+				should_wait = true;
+			} else {
+				cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+			}
 		}
 
 		resched_curr(rq);
@@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
 
 		if (cpu != cpu_of(this_rq)) {
 			/*
-			 * Pairs with smp_store_release() issued by this CPU in
-			 * switch_class() on the resched path.
+			 * Pairs with store_release in put_prev_task_scx().
 			 *
-			 * We busy-wait here to guarantee that no other task can
-			 * be scheduled on our core before the target CPU has
-			 * entered the resched path.
+			 * We busy-wait here to guarantee that the task running
+			 * at the time of kicking is no longer running. This can
+			 * be used to implement e.g. core scheduling.
 			 */
 			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
 				cpu_relax();
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -997,8 +997,10 @@ enum scx_kick_flags {
 	SCX_KICK_PREEMPT	= 1LLU << 1,
 
 	/*
-	 * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
-	 * return after the target CPU finishes picking the next task.
+	 * The scx_bpf_kick_cpu() call will return after the current SCX task of
+	 * the target CPU switches out. This can be used to implement e.g. core
+	 * scheduling. This has no effect if the current task on the target CPU
+	 * is not on SCX.
 	 */
 	SCX_KICK_WAIT		= 1LLU << 2,
 };
Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Posted by Peter Zijlstra 3 months, 2 weeks ago
On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote:

> @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
>  
>  		if (cpu != cpu_of(this_rq)) {
>  			/*
> -			 * Pairs with smp_store_release() issued by this CPU in
> -			 * switch_class() on the resched path.
> +			 * Pairs with store_release in put_prev_task_scx().
>  			 *
> -			 * We busy-wait here to guarantee that no other task can
> -			 * be scheduled on our core before the target CPU has
> -			 * entered the resched path.
> +			 * We busy-wait here to guarantee that the task running
> +			 * at the time of kicking is no longer running. This can
> +			 * be used to implement e.g. core scheduling.
>  			 */
>  			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
>  				cpu_relax();

You could consider using:

			smp_cond_load_acquire(wait_pnt_seq, VAL !+ pseqs[cpu]);

that's the fancy way of doing a spin wait and allows architectures to
optimize (mostly arm64 at this point).
Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Posted by Tejun Heo 3 months, 2 weeks ago
On Wed, Oct 22, 2025 at 10:03:46AM +0200, Peter Zijlstra wrote:
> >  			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> >  				cpu_relax();
> 
> You could consider using:
> 
> 			smp_cond_load_acquire(wait_pnt_seq, VAL !+ pseqs[cpu]);
> 
> that's the fancy way of doing a spin wait and allows architectures to
> optimize (mostly arm64 at this point).

Will do. Thanks.

-- 
tejun
Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Posted by Andrea Righi 3 months, 2 weeks ago
Hi Tejun,

On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote:
> SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
> a reschedule and can be used to implement operations like core scheduling.
> 
> This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
> which was always called when a CPU picks the next task to run, allowing
> SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
> pick the next task.
> 
> However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
> with switch_class()") replaced scx_next_task_picked() with the
> switch_class() callback, which is only called when switching between sched
> classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
> reliably incremented unless the previous task was SCX and the next task was
> not.
> 
> This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
> pick_task_scx()") which refactored the pick path making put_prev_task_scx()
> the natural place to track task switches for SCX_KICK_WAIT. The fix moves
> pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
> current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
> switches out. This provides sufficient guarantee for use cases like core
> scheduling while keeping the operation self-contained within SCX.
> 
> Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c          |   31 ++++++++++++++++++-------------
>  kernel/sched/ext_internal.h |    6 ++++--
>  2 files changed, 22 insertions(+), 15 deletions(-)
> 
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
>  	struct scx_sched *sch = scx_root;
>  	const struct sched_class *next_class = next->sched_class;
>  
> -	/*
> -	 * Pairs with the smp_load_acquire() issued by a CPU in
> -	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> -	 * resched.
> -	 */
> -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
>  	if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
>  		return;
>  
> @@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
>  			      struct task_struct *next)
>  {
>  	struct scx_sched *sch = scx_root;
> +
> +	/*
> +	 * Pairs with the smp_load_acquire() issued by a CPU in
> +	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> +	 * resched.
> +	 */
> +	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +
>  	update_curr_scx(rq);
>  
>  	/* see dequeue_task_scx() on why we skip when !QUEUED */
> @@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
>  		}
>  
>  		if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
> -			pseqs[cpu] = rq->scx.pnt_seq;
> -			should_wait = true;
> +			if (cur_class == &ext_sched_class) {
> +				pseqs[cpu] = rq->scx.pnt_seq;
> +				should_wait = true;
> +			} else {
> +				cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
> +			}
>  		}
>  
>  		resched_curr(rq);
> @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
>  
>  		if (cpu != cpu_of(this_rq)) {

It's probably fine anyway, but should we check for cpu_online(cpu) here?

>  			/*
> -			 * Pairs with smp_store_release() issued by this CPU in
> -			 * switch_class() on the resched path.
> +			 * Pairs with store_release in put_prev_task_scx().
>  			 *
> -			 * We busy-wait here to guarantee that no other task can
> -			 * be scheduled on our core before the target CPU has
> -			 * entered the resched path.
> +			 * We busy-wait here to guarantee that the task running
> +			 * at the time of kicking is no longer running. This can
> +			 * be used to implement e.g. core scheduling.
>  			 */
>  			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
>  				cpu_relax();

I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
is refilled concurrently between kick_one_cpu() and this busy wait. In this
case we return, because wait_pnt_seq is incremented, but we keep running
the same task.

Should we introduce a flag (or something similar) to force the re-enqueue
of the prev task in this case?

> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -997,8 +997,10 @@ enum scx_kick_flags {
>  	SCX_KICK_PREEMPT	= 1LLU << 1,
>  
>  	/*
> -	 * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
> -	 * return after the target CPU finishes picking the next task.
> +	 * The scx_bpf_kick_cpu() call will return after the current SCX task of
> +	 * the target CPU switches out. This can be used to implement e.g. core
> +	 * scheduling. This has no effect if the current task on the target CPU
> +	 * is not on SCX.
>  	 */
>  	SCX_KICK_WAIT		= 1LLU << 2,
>  };

Thanks,
-Andrea
Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Posted by Tejun Heo 3 months, 2 weeks ago
Hello, Andrea.

On Wed, Oct 22, 2025 at 09:43:25AM +0200, Andrea Righi wrote:
> > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
> >  
> >  		if (cpu != cpu_of(this_rq)) {
> 
> It's probably fine anyway, but should we check for cpu_online(cpu) here?

This block gets activated iff kick_one_cpu() returns true and that is gated
by the CPU being online && the current task being on SCX. For the CPU to go
offline, that task has to go off CPU and thus increment the sequence
counter.

> >  			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> >  				cpu_relax();
> 
> I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
> is refilled concurrently between kick_one_cpu() and this busy wait. In this
> case we return, because wait_pnt_seq is incremented, but we keep running
> the same task.
> 
> Should we introduce a flag (or something similar) to force the re-enqueue
> of the prev task in this case?

Ah, right, that's a hole. There's another hole. The BPF scheduler can choose
to run the same task and put_prev_task_scx() won't be called. I think we
need to bump the seq count on entry to pick_task_scx() too. That should
solve both problems. All that we're guaranteeing is that we wait until the
task enters scheduling path. If a higher class task gets picked,
put_prev_task_scx() will be called. Otherwise, we break the wait when
pick_task_scx() is entered.

Thanks.

-- 
tejun
Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Posted by Andrea Righi 3 months, 2 weeks ago
On Wed, Oct 22, 2025 at 08:37:50AM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> On Wed, Oct 22, 2025 at 09:43:25AM +0200, Andrea Righi wrote:
> > > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
> > >  
> > >  		if (cpu != cpu_of(this_rq)) {
> > 
> > It's probably fine anyway, but should we check for cpu_online(cpu) here?
> 
> This block gets activated iff kick_one_cpu() returns true and that is gated
> by the CPU being online && the current task being on SCX. For the CPU to go
> offline, that task has to go off CPU and thus increment the sequence
> counter.

I was thinking if the CPU goes offline after kick_one_cpu() returns and
before reaching this loop, but even in this case we're not accessing
anything unsafe, so we should be fine.

> 
> > >  			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> > >  				cpu_relax();
> > 
> > I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
> > is refilled concurrently between kick_one_cpu() and this busy wait. In this
> > case we return, because wait_pnt_seq is incremented, but we keep running
> > the same task.
> > 
> > Should we introduce a flag (or something similar) to force the re-enqueue
> > of the prev task in this case?
> 
> Ah, right, that's a hole. There's another hole. The BPF scheduler can choose
> to run the same task and put_prev_task_scx() won't be called. I think we
> need to bump the seq count on entry to pick_task_scx() too. That should
> solve both problems. All that we're guaranteeing is that we wait until the
> task enters scheduling path. If a higher class task gets picked,
> put_prev_task_scx() will be called. Otherwise, we break the wait when
> pick_task_scx() is entered.

Yeah, that sounds reasonable to me.

Thanks,
-Andrea