[RFC][PATCH] sched/ext: Split curr|donor references properly

John Stultz posted 1 patch 1 week, 2 days ago
kernel/sched/ext.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)
[RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by John Stultz 1 week, 2 days ago
With proxy-exec, we want to do the accounting against the donor
most of the time. Without proxy-exec, there should be no
difference as the rq->donor and rq->curr are the same.

So rework the logic to reference the rq->donor where appropriate.

Also add donor info to scx_dump_state()

Since CONFIG_SCHED_PROXY_EXEC currently depends on
!CONFIG_SCHED_CLASS_EXT, this should have no effect
(other then the extra donor output in scx_dump_state),
but this is one step needed to eventually remove that
constraint for proxy-exec.

Just wanted to send this out for early review prior to LPC.

Feedback or thoughts would be greatly appreciated!

Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Changwoo Min <changwoo@igalia.com>
Cc: sched-ext@lists.linux.dev
Cc: kernel-team@android.com
---
 kernel/sched/ext.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49e9649a..446091cba4429 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -938,17 +938,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
 
 static void update_curr_scx(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *donor = rq->donor;
 	s64 delta_exec;
 
 	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0))
 		return;
 
-	if (curr->scx.slice != SCX_SLICE_INF) {
-		curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
-		if (!curr->scx.slice)
-			touch_core_sched(rq, curr);
+	if (donor->scx.slice != SCX_SLICE_INF) {
+		donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
+		if (!donor->scx.slice)
+			touch_core_sched(rq, donor);
 	}
 }
 
@@ -1090,14 +1090,14 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
 		bool preempt = false;
 
-		if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
-		    rq->curr->sched_class == &ext_sched_class) {
-			rq->curr->scx.slice = 0;
+		if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
+		    rq->donor->sched_class == &ext_sched_class) {
+			rq->donor->scx.slice = 0;
 			preempt = true;
 		}
 
 		if (preempt || sched_class_above(&ext_sched_class,
-						 rq->curr->sched_class))
+						 rq->donor->sched_class))
 			resched_curr(rq);
 	} else {
 		raw_spin_unlock(&dsq->lock);
@@ -2001,7 +2001,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		}
 
 		/* if the destination CPU is idle, wake it up */
-		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+		if (sched_class_above(p->sched_class, dst_rq->donor->sched_class))
 			resched_curr(dst_rq);
 	}
 
@@ -2424,7 +2424,7 @@ static struct task_struct *first_local_task(struct rq *rq)
 static struct task_struct *
 do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 {
-	struct task_struct *prev = rq->curr;
+	struct task_struct *prev = rq->donor;
 	bool keep_prev, kick_idle = false;
 	struct task_struct *p;
 
@@ -3093,7 +3093,7 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
 #ifdef CONFIG_NO_HZ_FULL
 bool scx_can_stop_tick(struct rq *rq)
 {
-	struct task_struct *p = rq->curr;
+	struct task_struct *p = rq->donor;
 
 	if (scx_rq_bypassing(rq))
 		return false;
@@ -4587,6 +4587,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		dump_line(&ns, "          curr=%s[%d] class=%ps",
 			  rq->curr->comm, rq->curr->pid,
 			  rq->curr->sched_class);
+		dump_line(&ns, "          donor=%s[%d] class=%ps",
+			  rq->donor->comm, rq->donor->pid,
+			  rq->donor->sched_class);
 		if (!cpumask_empty(rq->scx.cpus_to_kick))
 			dump_line(&ns, "  cpus_to_kick   : %*pb",
 				  cpumask_pr_args(rq->scx.cpus_to_kick));
@@ -5426,7 +5429,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 	unsigned long flags;
 
 	raw_spin_rq_lock_irqsave(rq, flags);
-	cur_class = rq->curr->sched_class;
+	cur_class = rq->donor->sched_class;
 
 	/*
 	 * During CPU hotplug, a CPU may depend on kicking itself to make
@@ -5438,7 +5441,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 	    !sched_class_above(cur_class, &ext_sched_class)) {
 		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
 			if (cur_class == &ext_sched_class)
-				rq->curr->scx.slice = 0;
+				rq->donor->scx.slice = 0;
 			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
 		}
 
-- 
2.52.0.223.gf5cc29aaa4-goog
Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by Joel Fernandes 1 week, 2 days ago
On Sat, Dec 06, 2025 at 12:14:45AM +0000, John Stultz wrote:
> With proxy-exec, we want to do the accounting against the donor
> most of the time. Without proxy-exec, there should be no
> difference as the rq->donor and rq->curr are the same.
> 
> So rework the logic to reference the rq->donor where appropriate.
> 
> Also add donor info to scx_dump_state()
> 
> Since CONFIG_SCHED_PROXY_EXEC currently depends on
> !CONFIG_SCHED_CLASS_EXT, this should have no effect
> (other then the extra donor output in scx_dump_state),
> but this is one step needed to eventually remove that
> constraint for proxy-exec.
> 
> Just wanted to send this out for early review prior to LPC.
> 
> Feedback or thoughts would be greatly appreciated!

Hi John,

I'm wondering if this will work well for BPF tasks because my understanding
is that some scheduler BPF programs also monitor runtime statistics. If they are unaware of proxy execution, how will it work?

I don't see any code in the patch that passes the donor information to the
BPF ops, for instance. I would really like the SCX folks to chime in before
we can move this patch forward. Thanks for marking it as an RFC.

We need to get a handle on how a scheduler BPF program will pass information
about the donor to the currently executing task. If we can make this happen
transparently, that's ideal. Otherwise, we may have to pass both the donor
task and the currently executing task to the BPF ops.

Thanks,

 - Joel


> 
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>
> Cc: Andrea Righi <arighi@nvidia.com>
> Cc: Changwoo Min <changwoo@igalia.com>
> Cc: sched-ext@lists.linux.dev
> Cc: kernel-team@android.com
> ---
>  kernel/sched/ext.c | 31 +++++++++++++++++--------------
>  1 file changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 05f5a49e9649a..446091cba4429 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -938,17 +938,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
>  
>  static void update_curr_scx(struct rq *rq)
>  {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *donor = rq->donor;
>  	s64 delta_exec;
>  
>  	delta_exec = update_curr_common(rq);
>  	if (unlikely(delta_exec <= 0))
>  		return;
>  
> -	if (curr->scx.slice != SCX_SLICE_INF) {
> -		curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
> -		if (!curr->scx.slice)
> -			touch_core_sched(rq, curr);
> +	if (donor->scx.slice != SCX_SLICE_INF) {
> +		donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
> +		if (!donor->scx.slice)
> +			touch_core_sched(rq, donor);
>  	}
>  }
>  
> @@ -1090,14 +1090,14 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
>  		bool preempt = false;
>  
> -		if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
> -		    rq->curr->sched_class == &ext_sched_class) {
> -			rq->curr->scx.slice = 0;
> +		if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
> +		    rq->donor->sched_class == &ext_sched_class) {
> +			rq->donor->scx.slice = 0;
>  			preempt = true;
>  		}
>  
>  		if (preempt || sched_class_above(&ext_sched_class,
> -						 rq->curr->sched_class))
> +						 rq->donor->sched_class))
>  			resched_curr(rq);
>  	} else {
>  		raw_spin_unlock(&dsq->lock);
> @@ -2001,7 +2001,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		}
>  
>  		/* if the destination CPU is idle, wake it up */
> -		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> +		if (sched_class_above(p->sched_class, dst_rq->donor->sched_class))
>  			resched_curr(dst_rq);
>  	}
>  
> @@ -2424,7 +2424,7 @@ static struct task_struct *first_local_task(struct rq *rq)
>  static struct task_struct *
>  do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
>  {
> -	struct task_struct *prev = rq->curr;
> +	struct task_struct *prev = rq->donor;
>  	bool keep_prev, kick_idle = false;
>  	struct task_struct *p;
>  
> @@ -3093,7 +3093,7 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
>  #ifdef CONFIG_NO_HZ_FULL
>  bool scx_can_stop_tick(struct rq *rq)
>  {
> -	struct task_struct *p = rq->curr;
> +	struct task_struct *p = rq->donor;
>  
>  	if (scx_rq_bypassing(rq))
>  		return false;
> @@ -4587,6 +4587,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>  		dump_line(&ns, "          curr=%s[%d] class=%ps",
>  			  rq->curr->comm, rq->curr->pid,
>  			  rq->curr->sched_class);
> +		dump_line(&ns, "          donor=%s[%d] class=%ps",
> +			  rq->donor->comm, rq->donor->pid,
> +			  rq->donor->sched_class);
>  		if (!cpumask_empty(rq->scx.cpus_to_kick))
>  			dump_line(&ns, "  cpus_to_kick   : %*pb",
>  				  cpumask_pr_args(rq->scx.cpus_to_kick));
> @@ -5426,7 +5429,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
>  	unsigned long flags;
>  
>  	raw_spin_rq_lock_irqsave(rq, flags);
> -	cur_class = rq->curr->sched_class;
> +	cur_class = rq->donor->sched_class;
>  
>  	/*
>  	 * During CPU hotplug, a CPU may depend on kicking itself to make
> @@ -5438,7 +5441,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
>  	    !sched_class_above(cur_class, &ext_sched_class)) {
>  		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
>  			if (cur_class == &ext_sched_class)
> -				rq->curr->scx.slice = 0;
> +				rq->donor->scx.slice = 0;
>  			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
>  		}
>  
> -- 
> 2.52.0.223.gf5cc29aaa4-goog
>
Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by Andrea Righi 1 week, 1 day ago
On Fri, Dec 05, 2025 at 09:47:24PM -0500, Joel Fernandes wrote:
> On Sat, Dec 06, 2025 at 12:14:45AM +0000, John Stultz wrote:
> > With proxy-exec, we want to do the accounting against the donor
> > most of the time. Without proxy-exec, there should be no
> > difference as the rq->donor and rq->curr are the same.
> > 
> > So rework the logic to reference the rq->donor where appropriate.
> > 
> > Also add donor info to scx_dump_state()
> > 
> > Since CONFIG_SCHED_PROXY_EXEC currently depends on
> > !CONFIG_SCHED_CLASS_EXT, this should have no effect
> > (other then the extra donor output in scx_dump_state),
> > but this is one step needed to eventually remove that
> > constraint for proxy-exec.
> > 
> > Just wanted to send this out for early review prior to LPC.
> > 
> > Feedback or thoughts would be greatly appreciated!
> 
> Hi John,
> 
> I'm wondering if this will work well for BPF tasks because my understanding
> is that some scheduler BPF programs also monitor runtime statistics. If they are unaware of proxy execution, how will it work?

Right, some schedulers are relying on p->scx.slice to evaluate task
runtime. It'd be nice for the BPF schedulers to be aware of the donor.

> 
> I don't see any code in the patch that passes the donor information to the
> BPF ops, for instance. I would really like the SCX folks to chime in before
> we can move this patch forward. Thanks for marking it as an RFC.
> 
> We need to get a handle on how a scheduler BPF program will pass information
> about the donor to the currently executing task. If we can make this happen
> transparently, that's ideal. Otherwise, we may have to pass both the donor
> task and the currently executing task to the BPF ops.

That's what I was thinking, callbacks like ops.running(), ops.tick() and
ops.stopping() should probably have a struct task_struct *donor argument in
addition to struct task_struct *p. Then the BPF scheduler can decide how to
use the donor information (this would address also the runtime evaluation).

Thanks,
-Andrea

> 
> Thanks,
> 
>  - Joel
> 
> 
> > 
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
> > Cc: Joel Fernandes <joelaf@google.com>
> > Cc: Qais Yousef <qyousef@layalina.io>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Zimuzo Ezeozue <zezeozue@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Metin Kaya <Metin.Kaya@arm.com>
> > Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: David Vernet <void@manifault.com>
> > Cc: Andrea Righi <arighi@nvidia.com>
> > Cc: Changwoo Min <changwoo@igalia.com>
> > Cc: sched-ext@lists.linux.dev
> > Cc: kernel-team@android.com
> > ---
> >  kernel/sched/ext.c | 31 +++++++++++++++++--------------
> >  1 file changed, 17 insertions(+), 14 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 05f5a49e9649a..446091cba4429 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -938,17 +938,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
> >  
> >  static void update_curr_scx(struct rq *rq)
> >  {
> > -	struct task_struct *curr = rq->curr;
> > +	struct task_struct *donor = rq->donor;
> >  	s64 delta_exec;
> >  
> >  	delta_exec = update_curr_common(rq);
> >  	if (unlikely(delta_exec <= 0))
> >  		return;
> >  
> > -	if (curr->scx.slice != SCX_SLICE_INF) {
> > -		curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
> > -		if (!curr->scx.slice)
> > -			touch_core_sched(rq, curr);
> > +	if (donor->scx.slice != SCX_SLICE_INF) {
> > +		donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
> > +		if (!donor->scx.slice)
> > +			touch_core_sched(rq, donor);
> >  	}
> >  }
> >  
> > @@ -1090,14 +1090,14 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >  		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
> >  		bool preempt = false;
> >  
> > -		if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
> > -		    rq->curr->sched_class == &ext_sched_class) {
> > -			rq->curr->scx.slice = 0;
> > +		if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
> > +		    rq->donor->sched_class == &ext_sched_class) {
> > +			rq->donor->scx.slice = 0;
> >  			preempt = true;
> >  		}
> >  
> >  		if (preempt || sched_class_above(&ext_sched_class,
> > -						 rq->curr->sched_class))
> > +						 rq->donor->sched_class))
> >  			resched_curr(rq);
> >  	} else {
> >  		raw_spin_unlock(&dsq->lock);
> > @@ -2001,7 +2001,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  		}
> >  
> >  		/* if the destination CPU is idle, wake it up */
> > -		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> > +		if (sched_class_above(p->sched_class, dst_rq->donor->sched_class))
> >  			resched_curr(dst_rq);
> >  	}
> >  
> > @@ -2424,7 +2424,7 @@ static struct task_struct *first_local_task(struct rq *rq)
> >  static struct task_struct *
> >  do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
> >  {
> > -	struct task_struct *prev = rq->curr;
> > +	struct task_struct *prev = rq->donor;
> >  	bool keep_prev, kick_idle = false;
> >  	struct task_struct *p;
> >  
> > @@ -3093,7 +3093,7 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
> >  #ifdef CONFIG_NO_HZ_FULL
> >  bool scx_can_stop_tick(struct rq *rq)
> >  {
> > -	struct task_struct *p = rq->curr;
> > +	struct task_struct *p = rq->donor;
> >  
> >  	if (scx_rq_bypassing(rq))
> >  		return false;
> > @@ -4587,6 +4587,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
> >  		dump_line(&ns, "          curr=%s[%d] class=%ps",
> >  			  rq->curr->comm, rq->curr->pid,
> >  			  rq->curr->sched_class);
> > +		dump_line(&ns, "          donor=%s[%d] class=%ps",
> > +			  rq->donor->comm, rq->donor->pid,
> > +			  rq->donor->sched_class);
> >  		if (!cpumask_empty(rq->scx.cpus_to_kick))
> >  			dump_line(&ns, "  cpus_to_kick   : %*pb",
> >  				  cpumask_pr_args(rq->scx.cpus_to_kick));
> > @@ -5426,7 +5429,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
> >  	unsigned long flags;
> >  
> >  	raw_spin_rq_lock_irqsave(rq, flags);
> > -	cur_class = rq->curr->sched_class;
> > +	cur_class = rq->donor->sched_class;
> >  
> >  	/*
> >  	 * During CPU hotplug, a CPU may depend on kicking itself to make
> > @@ -5438,7 +5441,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
> >  	    !sched_class_above(cur_class, &ext_sched_class)) {
> >  		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
> >  			if (cur_class == &ext_sched_class)
> > -				rq->curr->scx.slice = 0;
> > +				rq->donor->scx.slice = 0;
> >  			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
> >  		}
> >  
> > -- 
> > 2.52.0.223.gf5cc29aaa4-goog
> >
Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by Andrea Righi 1 week, 1 day ago
On Sun, Dec 07, 2025 at 09:54:32AM +0100, Andrea Righi wrote:
> On Fri, Dec 05, 2025 at 09:47:24PM -0500, Joel Fernandes wrote:
> > On Sat, Dec 06, 2025 at 12:14:45AM +0000, John Stultz wrote:
> > > With proxy-exec, we want to do the accounting against the donor
> > > most of the time. Without proxy-exec, there should be no
> > > difference as the rq->donor and rq->curr are the same.
> > > 
> > > So rework the logic to reference the rq->donor where appropriate.
> > > 
> > > Also add donor info to scx_dump_state()
> > > 
> > > Since CONFIG_SCHED_PROXY_EXEC currently depends on
> > > !CONFIG_SCHED_CLASS_EXT, this should have no effect
> > > (other then the extra donor output in scx_dump_state),
> > > but this is one step needed to eventually remove that
> > > constraint for proxy-exec.
> > > 
> > > Just wanted to send this out for early review prior to LPC.
> > > 
> > > Feedback or thoughts would be greatly appreciated!
> > 
> > Hi John,
> > 
> > I'm wondering if this will work well for BPF tasks because my understanding
> > is that some scheduler BPF programs also monitor runtime statistics. If they are unaware of proxy execution, how will it work?
> 
> Right, some schedulers are relying on p->scx.slice to evaluate task
> runtime. It'd be nice for the BPF schedulers to be aware of the donor.
> 
> > 
> > I don't see any code in the patch that passes the donor information to the
> > BPF ops, for instance. I would really like the SCX folks to chime in before
> > we can move this patch forward. Thanks for marking it as an RFC.
> > 
> > We need to get a handle on how a scheduler BPF program will pass information
> > about the donor to the currently executing task. If we can make this happen
> > transparently, that's ideal. Otherwise, we may have to pass both the donor
> > task and the currently executing task to the BPF ops.
> 
> That's what I was thinking, callbacks like ops.running(), ops.tick() and
> ops.stopping() should probably have a struct task_struct *donor argument in
> addition to struct task_struct *p. Then the BPF scheduler can decide how to
> use the donor information (this would address also the runtime evaluation).

Or, better, have a kfunc like the following (I'm just sketching it, this is
likely broken):

__bpf_kfunc struct task_struct *scx_bpf_task_donor(const struct task_struct *p)
{
	struct task_struct *curr, *donor;
	struct rq *rq;

#ifndef CONFIG_SCHED_PROXY_EXEC
	return (struct task_struct *)p;
#endif

	rq = task_rq(p);
	curr = READ_ONCE(rq->curr);
	donor = READ_ONCE(rq->donor);

	/*
	 * If @p is currently executing, return the donor.
	 *
	 * The donor can be:
	 * - same as curr (no proxy execution active)
	 * - different from curr (proxy execution: curr is running with
	 *   donor's context)
	 */
	if (curr == p)
		return donor;

	/*
	 * If @p is not currently executing (queued, sleeping, etc.),
	 * the concept of donor doesn't apply, return @p itself.
	 */
	return (struct task_struct *)p;
}

And then let the BPF scheduler decide how to use this information (while
still updating time slice and check sched_class accordingly, as John is
proposing).

-Andrea
Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by John Stultz 1 week, 2 days ago
On Fri, Dec 5, 2025 at 6:47 PM Joel Fernandes <joelagnelf@nvidia.com> wrote:
> On Sat, Dec 06, 2025 at 12:14:45AM +0000, John Stultz wrote:
> > With proxy-exec, we want to do the accounting against the donor
> > most of the time. Without proxy-exec, there should be no
> > difference as the rq->donor and rq->curr are the same.
> >
> > So rework the logic to reference the rq->donor where appropriate.
> >
> > Also add donor info to scx_dump_state()
> >
> > Since CONFIG_SCHED_PROXY_EXEC currently depends on
> > !CONFIG_SCHED_CLASS_EXT, this should have no effect
> > (other then the extra donor output in scx_dump_state),
> > but this is one step needed to eventually remove that
> > constraint for proxy-exec.
> >
> > Just wanted to send this out for early review prior to LPC.
> >
> > Feedback or thoughts would be greatly appreciated!
>
> Hi John,
>
> I'm wondering if this will work well for BPF tasks because my understanding
> is that some scheduler BPF programs also monitor runtime statistics. If they are unaware of proxy execution, how will it work?

Good question! Be sure to come to my LPC talk on this next week! :)
https://lpc.events/event/19/contributions/2032/

> I don't see any code in the patch that passes the donor information to the
> BPF ops, for instance. I would really like the SCX folks to chime in before
> we can move this patch forward. Thanks for marking it as an RFC.

Oh yes, this RFC is intended to just be something to open initial
discussion for the session next week. I'm very much hoping to get
further thoughts on it, in person, next week.

> We need to get a handle on how a scheduler BPF program will pass information
> about the donor to the currently executing task. If we can make this happen
> transparently, that's ideal. Otherwise, we may have to pass both the donor
> task and the currently executing task to the BPF ops.

So, one thing about proxy-exec is the class schedulers are pretty much
are to keep their existing behavior. Its just the core scheduler may
not actually run what they pick.

That's ok, as the task they pick becomes the rq->donor that we want to
use for pretty much all the scheduling accounting (the exception being
the cputime accounting necessary for cputimers on the running task to
behave sanely as well as top output - as you have helped identify
earlier).  So this patch is just shifting the class scheduler to
utilize the donor pointer instead of curr, so we are consistent in the
proxy case.

As for the concern about communicating the split context (rq->donor vs
rq->curr) to the bpf program, to my understanding, the DSQ abstraction
seems to make that unnecessary. It provides a general enough interface
for the bpf logic, that it seems we only have to worry about the split
context on the the sched/ext.c logic side as it processes the DSQ.
That said, I'm no sched_ext expert, so I'm hoping at LPC we can find
any edge cases that do need to be dealt with.

Now, some of the sched/ext.c logic does seem to want to know if idle
is running, so that's the only case where I've left rq->curr. But I
believe the switching to donor that this patch does, should be ok,
since without proxy-exec rq->donor==rq->cur.

Elsewhere there are outstanding issues, as  proxy-exec needing to
briefly schedule idle (see proxy_resched_idle()), in order to get the
current task off the cpu, and this seems to cause confusion around
SCX_ENQ_LAST logic (since proxy-exec may briefly switch to idle even
if there's a runnable task).  So these items do need to get resolved
before we remove the Kconfig exclusivity between proxy-exec and
sched_ext. But again, I'm looking forward to next week to try to hear
what folks think the best approach will be.

Thanks again for the thought here! Always appreciate your feedback!
-john
Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by Joel Fernandes 1 week, 2 days ago
Hi John,

On 12/5/2025 11:49 PM, John Stultz wrote:
> On Fri, Dec 5, 2025 at 6:47 PM Joel Fernandes <joelagnelf@nvidia.com> wrote:
>> On Sat, Dec 06, 2025 at 12:14:45AM +0000, John Stultz wrote:
>>> With proxy-exec, we want to do the accounting against the donor
>>> most of the time. Without proxy-exec, there should be no
>>> difference as the rq->donor and rq->curr are the same.
>>>
>>> So rework the logic to reference the rq->donor where appropriate.
>>>
>>> Also add donor info to scx_dump_state()
>>>
>>> Since CONFIG_SCHED_PROXY_EXEC currently depends on
>>> !CONFIG_SCHED_CLASS_EXT, this should have no effect
>>> (other then the extra donor output in scx_dump_state),
>>> but this is one step needed to eventually remove that
>>> constraint for proxy-exec.
>>>
>>> Just wanted to send this out for early review prior to LPC.
>>>
>>> Feedback or thoughts would be greatly appreciated!
>>
>> Hi John,
>>
>> I'm wondering if this will work well for BPF tasks because my understanding
>> is that some scheduler BPF programs also monitor runtime statistics. If they are unaware of proxy execution, how will it work?
> 
> Good question! Be sure to come to my LPC talk on this next week! :)
> https://lpc.events/event/19/contributions/2032/

Sure, will try to make it to the talk and hopefully no conflicts. :)

> 
>> I don't see any code in the patch that passes the donor information to the
>> BPF ops, for instance. I would really like the SCX folks to chime in before
>> we can move this patch forward. Thanks for marking it as an RFC.
> 
> Oh yes, this RFC is intended to just be something to open initial
> discussion for the session next week. I'm very much hoping to get
> further thoughts on it, in person, next week.

Cool!
>> We need to get a handle on how a scheduler BPF program will pass information
>> about the donor to the currently executing task. If we can make this happen
>> transparently, that's ideal. Otherwise, we may have to pass both the donor
>> task and the currently executing task to the BPF ops.
> 
> So, one thing about proxy-exec is the class schedulers are pretty much
> are to keep their existing behavior. Its just the core scheduler may
> not actually run what they pick.

Didn't we have complexities with RT, push-pull lists and such? That was class
specific, no?

> That's ok, as the task they pick becomes the rq->donor that we want to
> use for pretty much all the scheduling accounting (the exception being
> the cputime accounting necessary for cputimers on the running task to
> behave sanely as well as top output - as you have helped identify
> earlier).  So this patch is just shifting the class scheduler to
> utilize the donor pointer instead of curr, so we are consistent in the
> proxy case.
> 
> As for the concern about communicating the split context (rq->donor vs
> rq->curr) to the bpf program, to my understanding, the DSQ abstraction
> seems to make that unnecessary. It provides a general enough interface
> for the bpf logic, that it seems we only have to worry about the split
> context on the the sched/ext.c logic side as it processes the DSQ.
> That said, I'm no sched_ext expert, so I'm hoping at LPC we can find
> any edge cases that do need to be dealt with.

Right, so it is exactly these pointer shifts I was concerned about. Runtime
callbacks such 'stopping' [1] directly use p->slice.

[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c#L124

So we have to pass the correct 'p' to these callbacks. Did I miss something
about your patch though that handles this?

If this is indeed a problem, maybe one way to get around it initially is to make
'proxy exec' an opt-in for BPF schedulers. But then we'd have to handle a hybrid
world.

At a high level, my understanding is BPF schedulers have a lot of say in how to
schedule including precise time slice and preemption control (give or take level
of control and performance reasons). You can in fact have your own 'userland'
queues that the kernel is unaware, IIUC. I am not sure if proxy exec will
transparently work for all those usecases. It will probably work properly only
when BPF scheduling in userland is simple and most of the scheduling is done by
non-BPF kernel code.

Maybe this isn't a problem at all, but I thought I'd double check. :)

> Thanks again for the thought here! Always appreciate your feedback!
Sure, any time John! I am glad to see your patches continuously flowing.

cheers,

 - Joel

Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by Tejun Heo 1 week, 2 days ago
Hello,

On Sat, Dec 06, 2025 at 09:56:50AM -0500, Joel Fernandes wrote:
...
> At a high level, my understanding is BPF schedulers have a lot of say in how to
> schedule including precise time slice and preemption control (give or take level
> of control and performance reasons). You can in fact have your own 'userland'
> queues that the kernel is unaware, IIUC. I am not sure if proxy exec will
> transparently work for all those usecases. It will probably work properly only
> when BPF scheduling in userland is simple and most of the scheduling is done by
> non-BPF kernel code.

Maybe this can be resolved by proxy execution explicitly telling sched_ext
to essentially dequeue the proxy-executed task so that it's kept more
transparent. However, I wonder whether it'd be a useful first step to first
deconflict the two config options. ie. Allow sched_ext loading disable proxy
execution dynamically so that people don't have to choose between the two
config options. I don't know the code intimately, but, from just skimming
it, it looks like it can be drained with a static key and some percpu
counters.

Thanks.

-- 
tejun
Re: [RFC][PATCH] sched/ext: Split curr|donor references properly
Posted by Joel Fernandes 1 week, 2 days ago

> On Dec 6, 2025, at 10:59 AM, Tejun Heo <tj@kernel.org> wrote:
> 
> Hello,
> 
> On Sat, Dec 06, 2025 at 09:56:50AM -0500, Joel Fernandes wrote:
> ...
>> At a high level, my understanding is BPF schedulers have a lot of say in how to
>> schedule including precise time slice and preemption control (give or take level
>> of control and performance reasons). You can in fact have your own 'userland'
>> queues that the kernel is unaware, IIUC. I am not sure if proxy exec will
>> transparently work for all those usecases. It will probably work properly only
>> when BPF scheduling in userland is simple and most of the scheduling is done by
>> non-BPF kernel code.
> 
> Maybe this can be resolved by proxy execution explicitly telling sched_ext
> to essentially dequeue the proxy-executed task so that it's kept more
> transparent.

Yes, perhaps.

> However, I wonder whether it'd be a useful first step to first
> deconflict the two config options. ie. Allow sched_ext loading disable proxy
> execution dynamically so that people don't have to choose between the two
> config options. I don't know the code intimately, but, from just skimming
> it, it looks like it can be drained with a static key and some percpu
> counters.

Agreed. There is also stop_machine if all else fails :)

thanks,

 - Joel

> 
> Thanks.
> 
> --
> tejun