[PATCH] sched_ext: Rebuild fair weight when disabling BPF scheduler

Zicheng Qu posted 1 patch 1 week, 6 days ago
kernel/sched/ext.c | 4 ++++
1 file changed, 4 insertions(+)
[PATCH] sched_ext: Rebuild fair weight when disabling BPF scheduler
Posted by Zicheng Qu 1 week, 6 days ago
From: Zicheng Qu <quzicheng@huawei.com>

When a BPF scheduler is disabled, scx_root_disable() switches tasks
from ext_sched_class back to fair_sched_class directly. This does not
go through __setscheduler_params(), so p->se.load is not rebuilt for
tasks returning to fair.

For example, after enabling a sched_ext BPF scheduler and creating
CPU-bound tasks with different nice values, disabling the BPF scheduler
can leave them running under fair with stale p->se.load. They may then
split CPU time according to the stale weight instead of their current
nice weights.

Rebuild the fair load weight when scx_root_disable() switches a task
from ext_sched_class to fair_sched_class. Use set_load_weight(p, false)
so CFS gets a native load_weight derived from the task's current
policy/static_prio before the task is enqueued on fair.

Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
---
 kernel/sched/ext.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 65631e577ee9..e5b8509ce7ee 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5967,6 +5967,10 @@ static void scx_root_disable(struct scx_sched *sch)
 
 		scoped_guard (sched_change, p, queue_flags) {
 			p->sched_class = new_class;
+
+			if (old_class == &ext_sched_class &&
+			    new_class == &fair_sched_class)
+				set_load_weight(p, false);
 		}
 
 		scx_disable_and_exit_task(scx_task_sched(p), p);
-- 
2.43.0
Re: [PATCH] sched_ext: Rebuild fair weight when disabling BPF scheduler
Posted by Andrea Righi 1 week, 6 days ago
Hi Zicheng,

On Tue, May 26, 2026 at 09:52:11PM +0800, Zicheng Qu wrote:
> From: Zicheng Qu <quzicheng@huawei.com>
> 
> When a BPF scheduler is disabled, scx_root_disable() switches tasks
> from ext_sched_class back to fair_sched_class directly. This does not
> go through __setscheduler_params(), so p->se.load is not rebuilt for
> tasks returning to fair.
> 
> For example, after enabling a sched_ext BPF scheduler and creating
> CPU-bound tasks with different nice values, disabling the BPF scheduler
> can leave them running under fair with stale p->se.load. They may then
> split CPU time according to the stale weight instead of their current
> nice weights.
> 
> Rebuild the fair load weight when scx_root_disable() switches a task
> from ext_sched_class to fair_sched_class. Use set_load_weight(p, false)
> so CFS gets a native load_weight derived from the task's current
> policy/static_prio before the task is enqueued on fair.
> 
> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> ---
>  kernel/sched/ext.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 65631e577ee9..e5b8509ce7ee 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5967,6 +5967,10 @@ static void scx_root_disable(struct scx_sched *sch)
>  
>  		scoped_guard (sched_change, p, queue_flags) {
>  			p->sched_class = new_class;
> +
> +			if (old_class == &ext_sched_class &&
> +			    new_class == &fair_sched_class)
> +				set_load_weight(p, false);

I'm wondering if we have a similar issue for tasks moving from SCHED_EXT to
SCHED_NORMAL when a scx scheduler is running in partial mode. Maybe we need to
intercept this special case in __sched_setscheduler()? (not necessarily for this
patch, it can be addressed later as a separate follow-up patch).

For now, this makes sense to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

>  		}
>  
>  		scx_disable_and_exit_task(scx_task_sched(p), p);
> -- 
> 2.43.0
>
[PATCH v2] sched_ext: Rebuild fair weight on ext to fair switches
Posted by quzicheng315@gmail.com 1 week, 5 days ago
From: Zicheng Qu <quzicheng315@gmail.com>

Tasks running on sched_ext do not use p->se.load as their active
scheduling weight. Their nice-derived weight is maintained as
p->scx.weight instead.

When such a task switches back to fair, CFS expects p->se.load to match
the task's current policy/static_prio before the task is enqueued.
However, not all ext to fair transitions rebuild p->se.load. For
example, scx_root_disable() switches tasks back to fair directly, and
partial mode can move a task from SCHED_EXT to SCHED_NORMAL through
sched_setscheduler(). In the latter case, set_load_weight(p, true) runs
while p->sched_class is still ext_sched_class, so reweight_task_scx()
updates p->scx.weight but leaves p->se.load stale.

Rebuild the fair load weight in sched_change_end() when the class switch
is from ext_sched_class to fair_sched_class. This is after the class has
been changed and before the task is enqueued on fair, so CFS sees a
native load_weight derived from the task's current policy/static_prio.

Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
---
Changes in v2:
- Move the fix from scx_root_disable() to sched_change_end() so the same
  ext-to-fair rebuild also covers partial mode SCHED_EXT to SCHED_NORMAL
  transitions through sched_setscheduler(), as Andrea pointed out.

 kernel/sched/core.c |  2 ++
 kernel/sched/ext.h  | 11 +++++++++++
 2 files changed, 13 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c6..c694aabc451a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11200,6 +11200,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
 	 */
 	WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
 
+	scx_rebuild_fair_weight_on_class_switch(p, ctx->class, p->sched_class);
+
 	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
 		p->sched_class->switching_to(rq, p);
 
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index 0b7fc46aee08..1f8248c897af 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -35,6 +35,14 @@ static inline bool task_on_scx(const struct task_struct *p)
 	return scx_enabled() && p->sched_class == &ext_sched_class;
 }
 
+static inline void scx_rebuild_fair_weight_on_class_switch(struct task_struct *p,
+							   const struct sched_class *old_class,
+							   const struct sched_class *new_class)
+{
+	if (old_class == &ext_sched_class && new_class == &fair_sched_class)
+		set_load_weight(p, false);
+}
+
 #ifdef CONFIG_SCHED_CORE
 bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 		   bool in_fi);
@@ -55,6 +63,9 @@ static inline int scx_check_setscheduler(struct task_struct *p, int policy) { re
 static inline bool task_on_scx(const struct task_struct *p) { return false; }
 static inline bool scx_allow_ttwu_queue(const struct task_struct *p) { return true; }
 static inline void init_sched_ext_class(void) {}
+static inline void scx_rebuild_fair_weight_on_class_switch(struct task_struct *p,
+							   const struct sched_class *old_class,
+							   const struct sched_class *new_class) {}
 
 #endif	/* CONFIG_SCHED_CLASS_EXT */
 
-- 
2.43.0
Re: [PATCH v2] sched_ext: Rebuild fair weight on ext to fair switches
Posted by Peter Zijlstra 1 week, 5 days ago
On Wed, May 27, 2026 at 05:40:37PM +0800, quzicheng315@gmail.com wrote:
> From: Zicheng Qu <quzicheng315@gmail.com>
> 
> Tasks running on sched_ext do not use p->se.load as their active
> scheduling weight. Their nice-derived weight is maintained as
> p->scx.weight instead.
> 
> When such a task switches back to fair, CFS expects p->se.load to match
> the task's current policy/static_prio before the task is enqueued.
> However, not all ext to fair transitions rebuild p->se.load. For
> example, scx_root_disable() switches tasks back to fair directly, and
> partial mode can move a task from SCHED_EXT to SCHED_NORMAL through
> sched_setscheduler(). In the latter case, set_load_weight(p, true) runs
> while p->sched_class is still ext_sched_class, so reweight_task_scx()
> updates p->scx.weight but leaves p->se.load stale.
> 
> Rebuild the fair load weight in sched_change_end() when the class switch
> is from ext_sched_class to fair_sched_class. This is after the class has
> been changed and before the task is enqueued on fair, so CFS sees a
> native load_weight derived from the task's current policy/static_prio.
> 
> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> ---
> Changes in v2:
> - Move the fix from scx_root_disable() to sched_change_end() so the same
>   ext-to-fair rebuild also covers partial mode SCHED_EXT to SCHED_NORMAL
>   transitions through sched_setscheduler(), as Andrea pointed out.
> 
>  kernel/sched/core.c |  2 ++
>  kernel/sched/ext.h  | 11 +++++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d3c6..c694aabc451a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11200,6 +11200,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
>  	 */
>  	WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
>  
> +	scx_rebuild_fair_weight_on_class_switch(p, ctx->class, p->sched_class);
> +
>  	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
>  		p->sched_class->switching_to(rq, p);
>  
> diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
> index 0b7fc46aee08..1f8248c897af 100644
> --- a/kernel/sched/ext.h
> +++ b/kernel/sched/ext.h
> @@ -35,6 +35,14 @@ static inline bool task_on_scx(const struct task_struct *p)
>  	return scx_enabled() && p->sched_class == &ext_sched_class;
>  }
>  
> +static inline void scx_rebuild_fair_weight_on_class_switch(struct task_struct *p,
> +							   const struct sched_class *old_class,
> +							   const struct sched_class *new_class)
> +{
> +	if (old_class == &ext_sched_class && new_class == &fair_sched_class)
> +		set_load_weight(p, false);
> +}
> +
>  #ifdef CONFIG_SCHED_CORE
>  bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  		   bool in_fi);
> @@ -55,6 +63,9 @@ static inline int scx_check_setscheduler(struct task_struct *p, int policy) { re
>  static inline bool task_on_scx(const struct task_struct *p) { return false; }
>  static inline bool scx_allow_ttwu_queue(const struct task_struct *p) { return true; }
>  static inline void init_sched_ext_class(void) {}
> +static inline void scx_rebuild_fair_weight_on_class_switch(struct task_struct *p,
> +							   const struct sched_class *old_class,
> +							   const struct sched_class *new_class) {}
>  
>  #endif	/* CONFIG_SCHED_CLASS_EXT */

This is truly horrible. We have 4 class methods involved with switching
classes and you stick in a random call in a place that is called when no
class is changed.

Would not something like this work?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62a2dcb0d03e..a2eb43bd73b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14957,6 +14957,11 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 	detach_task_cfs_rq(p);
 }
 
+static void switching_to_fair(struct rq *rq, struct task_struct *p)
+{
+	set_load_weight(p, false);
+}
+
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
 	WARN_ON_ONCE(p->se.sched_delayed);
@@ -15351,6 +15356,7 @@ DEFINE_SCHED_CLASS(fair) = {
 	.prio_changed		= prio_changed_fair,
 	.switching_from		= switching_from_fair,
 	.switched_from		= switched_from_fair,
+	.switching_to		= switching_to_fair,
 	.switched_to		= switched_to_fair,
 
 	.get_rr_interval	= get_rr_interval_fair,
Re: [PATCH v2] sched_ext: Rebuild fair weight on ext to fair switches
Posted by Zicheng Qu 1 week, 4 days ago
On Wed, May 27, 2026 at 07:26PM +0800, Peter Zijlstra wrote:

> This is truly horrible. We have 4 class methods involved with switching
> classes and you stick in a random call in a place that is called when no
> class is changed.
>
> Would not something like this work?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 62a2dcb0d03e..a2eb43bd73b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -14957,6 +14957,11 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
>   	detach_task_cfs_rq(p);
>   }
>   
> +static void switching_to_fair(struct rq *rq, struct task_struct *p)
> +{
> +	set_load_weight(p, false);
> +}
> +
>   static void switched_to_fair(struct rq *rq, struct task_struct *p)
>   {
>   	WARN_ON_ONCE(p->se.sched_delayed);
> @@ -15351,6 +15356,7 @@ DEFINE_SCHED_CLASS(fair) = {
>   	.prio_changed		= prio_changed_fair,
>   	.switching_from		= switching_from_fair,
>   	.switched_from		= switched_from_fair,
> +	.switching_to		= switching_to_fair,
>   	.switched_to		= switched_to_fair,
>   
>   	.get_rr_interval	= get_rr_interval_fair,
Yes, from the class switch point of view, `switching_to_fair()` is a better
fit.

Before v2, I was weighing three possible places for the fix:

1. Updating `p->se.load` from `reweight_task_scx()`. This would keep the 
fair
weight in sync while the task is on sched_ext, so switching back to fair 
would
not need any extra fixup. However, it would also make sched_ext maintain 
fair
class state even when fair is not using it, which does not seem like the 
right
ownership model.

2. Rebuilding `p->se.load` from fair's `switching_to` hook. This is the most
natural place semantically, since the task is entering fair and fair 
prepares
its own state before enqueue. My only concern was that, for non-ext -> fair
paths, `__setscheduler_params()` may have already updated `p->se.load` 
through
`set_load_weight(p, true)`, so calling `set_load_weight(p, false)`
unconditionally here can be redundant logically. Functionally, though, it is
harmless.

3. Rebuilding in `sched_change_end()` based on the old/new classes. That was
the v2 choice because both classes are available there, the task has not 
been
enqueued yet, and it covers both `scx_root_disable()` and the partial-mode
`sched_setscheduler()` path. In hindsight, though, this makes the generic
sched_change path handle a scx & fair-specific fixup. That is more awkward
than letting fair prepare its own state in `switching_to_fair()`.

I'll respin v3 as you suggested.


Thanks,

Zicheng
Re: [PATCH v2] sched_ext: Rebuild fair weight on ext to fair switches
Posted by Peter Zijlstra 1 week, 4 days ago
On Thu, May 28, 2026 at 10:53:54AM +0800, Zicheng Qu wrote:

> 2. Rebuilding `p->se.load` from fair's `switching_to` hook. This is the most
> natural place semantically, since the task is entering fair and fair
> prepares
> its own state before enqueue. My only concern was that, for non-ext -> fair
> paths, `__setscheduler_params()` may have already updated `p->se.load`
> through
> `set_load_weight(p, true)`, so calling `set_load_weight(p, false)`
> unconditionally here can be redundant logically. Functionally, though, it is
> harmless.

Right. We can worry about optimizing this if there's ever a report. I
don't expect this to be noticeable much. If anything, the PI code would
be the one to trip this most often I think.
[PATCH v3] sched/fair: Rebuild load weight when switching to fair
Posted by quzicheng315@gmail.com 1 week, 4 days ago
From: Zicheng Qu <quzicheng@huawei.com>

Tasks that run outside fair may not keep p->se.load in sync with their
current scheduling policy and static priority. sched_ext, for example,
uses p->scx.weight as the active scheduling weight, so p->se.load can be
stale when a task moves back to fair.

The fair_sched_class expects the sched_entity load weight to be valid
before the task is enqueued. Rebuild it from fair's switching_to hook,
which runs after the class has been changed to fair and before enqueue,
so both sched_ext disable and SCHED_EXT to SCHED_NORMAL transitions get
a native fair load weight.

Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
---
Changes in v3:
- Move the rebuild into fair's switching_to hook, as suggested by Peter.
  This lets fair prepare its own state before enqueue and avoids adding a
  sched_ext/fair-specific fixup to the generic sched_change_end() path.

Changes in v2:
- Move the fix from scx_root_disable() to the class switch path so it also
  covers partial-mode SCHED_EXT to SCHED_NORMAL transitions through
  sched_setscheduler(). Andrea identified this missing case in the v1
  discussion.

 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f982..3a21ceefcadf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13837,6 +13837,15 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 	detach_task_cfs_rq(p);
 }
 
+static void switching_to_fair(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Tasks may come from classes that don't keep se.load up to date.
+	 * Rebuild it before the task is enqueued.
+	 */
+	set_load_weight(p, false);
+}
+
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
 	WARN_ON_ONCE(p->se.sched_delayed);
@@ -14233,6 +14242,7 @@ DEFINE_SCHED_CLASS(fair) = {
 	.prio_changed		= prio_changed_fair,
 	.switching_from		= switching_from_fair,
 	.switched_from		= switched_from_fair,
+	.switching_to		= switching_to_fair,
 	.switched_to		= switched_to_fair,
 
 	.get_rr_interval	= get_rr_interval_fair,
-- 
2.53.0
Re: [PATCH v3] sched/fair: Rebuild load weight when switching to fair
Posted by Tejun Heo 1 week, 4 days ago
On Thu, May 28, 2026 at 09:12:38PM +0800, quzicheng315@gmail.com wrote:
> From: Zicheng Qu <quzicheng@huawei.com>
> 
> Tasks that run outside fair may not keep p->se.load in sync with their
> current scheduling policy and static priority. sched_ext, for example,
> uses p->scx.weight as the active scheduling weight, so p->se.load can be
> stale when a task moves back to fair.
> 
> The fair_sched_class expects the sched_entity load weight to be valid
> before the task is enqueued. Rebuild it from fair's switching_to hook,
> which runs after the class has been changed to fair and before enqueue,
> so both sched_ext disable and SCHED_EXT to SCHED_NORMAL transitions get
> a native fair load weight.
> 
> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun