kernel/sched/ext.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-)
scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to
clear [__]scx_switching_all. task_should_scx() short-circuits on DISABLING,
so forks in that window land on fair while next_active_class() still skips
fair - the new tasks stall.
This can deadlock the disable path itself: scx_alloc_and_add_sched() runs
under scx_enable_mutex and creates a helper kthread; if that new kthread is
one of the stalled fair tasks, the mutex holder waits forever and
scx_root_disable() can never make progress. Only sub-sched support exposes
this, since sub-sched enables are the only path where
scx_alloc_and_add_sched() can race the root's disable.
Move the DISABLING check after @scx_switching_all so that whenever
@scx_switching_all is set, forks keep going to scx and stay in lockstep with
__scx_switched_all. Once both are cleared (together under the mutex),
DISABLING applies normally.
Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5092,10 +5092,31 @@ static const struct kset_uevent_ops scx_
*/
bool task_should_scx(int policy)
{
- if (!scx_enabled() || unlikely(scx_enable_state() == SCX_DISABLING))
+ /* if disabled, nothing should be on it */
+ if (!scx_enabled())
return false;
+
+ /* scx is taking over all SCHED_OTHER and SCHED_EXT tasks */
if (READ_ONCE(scx_switching_all))
return true;
+
+ /*
+ * scx is tearing down - keep new SCHED_EXT tasks out.
+ *
+ * Must come after scx_switching_all test. While both are set, we must
+ * return true via the branch above: [__]scx_switching_all are cleared
+ * together under scx_enable_mutex, and a fork routed to fair while
+ * __scx_switched_all is still on would stall because
+ * next_active_class() skips fair.
+ *
+ * This can develop into a deadlock - scx holds scx_enable_mutex across
+ * kthread_create() in scx_alloc_and_add_sched(); if the new kthread is
+ * the stalled task, the disable path can never grab the mutex to clear
+ * scx_switching_all.
+ */
+ if (unlikely(scx_enable_state() == SCX_DISABLING))
+ return false;
+
return policy == SCHED_EXT;
}
Hi Tejun,
On Sat, May 16, 2026 at 02:41:20PM -1000, Tejun Heo wrote:
> scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to
> clear [__]scx_switching_all. task_should_scx() short-circuits on DISABLING,
> so forks in that window land on fair while next_active_class() still skips
> fair - the new tasks stall.
>
> This can deadlock the disable path itself: scx_alloc_and_add_sched() runs
> under scx_enable_mutex and creates a helper kthread; if that new kthread is
> one of the stalled fair tasks, the mutex holder waits forever and
> scx_root_disable() can never make progress. Only sub-sched support exposes
> this, since sub-sched enables are the only path where
> scx_alloc_and_add_sched() can race the root's disable.
>
> Move the DISABLING check after @scx_switching_all so that whenever
> @scx_switching_all is set, forks keep going to scx and stay in lockstep with
> __scx_switched_all. Once both are cleared (together under the mutex),
> DISABLING applies normally.
>
> Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> kernel/sched/ext.c | 23 ++++++++++++++++++++++-
> 1 file changed, 22 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5092,10 +5092,31 @@ static const struct kset_uevent_ops scx_
> */
> bool task_should_scx(int policy)
> {
> - if (!scx_enabled() || unlikely(scx_enable_state() == SCX_DISABLING))
> + /* if disabled, nothing should be on it */
> + if (!scx_enabled())
> return false;
> +
> + /* scx is taking over all SCHED_OTHER and SCHED_EXT tasks */
> if (READ_ONCE(scx_switching_all))
> return true;
> +
> + /*
> + * scx is tearing down - keep new SCHED_EXT tasks out.
> + *
> + * Must come after scx_switching_all test. While both are set, we must
> + * return true via the branch above: [__]scx_switching_all are cleared
> + * together under scx_enable_mutex, and a fork routed to fair while
> + * __scx_switched_all is still on would stall because
> + * next_active_class() skips fair.
Just being extra picky: [__]scx_switching_all are cleared together sequentially,
but not atomically (in fact the order is what matters). To make it more clear,
how about rephrasing the comment block above like this:
* Must come after the scx_switching_all test. scx_root_disable()
* clears __scx_switched_all before scx_switching_all (both under
* scx_enable_mutex), so while scx_switching_all is observed as true,
* __scx_switched_all may still be on. A fork routed to fair in that
* window would stall because next_active_class() skips fair.
> + *
> + * This can develop into a deadlock - scx holds scx_enable_mutex across
> + * kthread_create() in scx_alloc_and_add_sched(); if the new kthread is
> + * the stalled task, the disable path can never grab the mutex to clear
> + * scx_switching_all.
> + */
> + if (unlikely(scx_enable_state() == SCX_DISABLING))
> + return false;
> +
> return policy == SCHED_EXT;
> }
>
Other than that, looks good to me.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
Hello, On Sun, May 17, 2026 at 12:56:34PM +0200, Andrea Righi wrote: > > + * Must come after scx_switching_all test. While both are set, we must > > + * return true via the branch above: [__]scx_switching_all are cleared > > + * together under scx_enable_mutex, and a fork routed to fair while > > + * __scx_switched_all is still on would stall because > > + * next_active_class() skips fair. > > Just being extra picky: [__]scx_switching_all are cleared together sequentially, > but not atomically (in fact the order is what matters). To make it more clear, > how about rephrasing the comment block above like this: > > * Must come after the scx_switching_all test. scx_root_disable() > * clears __scx_switched_all before scx_switching_all (both under > * scx_enable_mutex), so while scx_switching_all is observed as true, > * __scx_switched_all may still be on. A fork routed to fair in that > * window would stall because next_active_class() skips fair. Hmm... I don't think the ordering between scx_switching_all and __scx_switching_all matters here. The stall is caused by the gap between the earlier DISABLING transition and __scx_switching_all being turned off which here is tested through scx_switching_all and at this point as the mutex is already held, even if you swapped scx_switching_all's position with __scx_switching_all, it wouldn't matter. It's just kinda confusing because what's actually involved in the stall and deadlock is __scx_switching_all but we're testing it via scx_switching_all. I'll update the comment so that it just mentions __scx_switching_all. I'm not even sure we actually need scx_switching_all. Thanks. -- tejun
scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to
clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING,
so forks in that window land on fair while next_active_class() still skips
fair - the new tasks stall.
This can deadlock the disable path itself: scx_alloc_and_add_sched() runs
under scx_enable_mutex and creates a helper kthread; if that new kthread is
one of the stalled fair tasks, the mutex holder waits forever and
scx_root_disable() can never make progress. Only sub-sched support exposes
this, since sub-sched enables are the only path where
scx_alloc_and_add_sched() can race the root's disable.
Move the DISABLING check after @scx_switching_all. @scx_switching_all
serves as a proxy for __scx_switched_all, so while it's set, forks keep
going to scx. Once cleared, DISABLING applies normally.
v2: Reword in-source comment and description. (Andrea)
Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5092,10 +5092,30 @@ static const struct kset_uevent_ops scx_
*/
bool task_should_scx(int policy)
{
- if (!scx_enabled() || unlikely(scx_enable_state() == SCX_DISABLING))
+ /* if disabled, nothing should be on it */
+ if (!scx_enabled())
return false;
+
+ /* scx is taking over all SCHED_OTHER and SCHED_EXT tasks */
if (READ_ONCE(scx_switching_all))
return true;
+
+ /*
+ * scx is tearing down - keep new SCHED_EXT tasks out.
+ *
+ * Must come after scx_switching_all test, which serves as a proxy
+ * for __scx_switched_all. While __scx_switched_all is set, we must
+ * return true via the branch above: a fork routed to fair would
+ * stall because next_active_class() skips fair.
+ *
+ * This can develop into a deadlock - scx holds scx_enable_mutex across
+ * kthread_create() in scx_alloc_and_add_sched(); if the new kthread is
+ * the stalled task, the disable path can never grab the mutex to clear
+ * scx_switching_all.
+ */
+ if (unlikely(scx_enable_state() == SCX_DISABLING))
+ return false;
+
return policy == SCHED_EXT;
}
Hello, Applied to sched_ext/for-7.1-fixes. Thanks. -- tejun
Hi Tejun,
On Sun, May 17, 2026 at 07:43:16AM -1000, Tejun Heo wrote:
> scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to
> clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING,
> so forks in that window land on fair while next_active_class() still skips
> fair - the new tasks stall.
>
> This can deadlock the disable path itself: scx_alloc_and_add_sched() runs
> under scx_enable_mutex and creates a helper kthread; if that new kthread is
> one of the stalled fair tasks, the mutex holder waits forever and
> scx_root_disable() can never make progress. Only sub-sched support exposes
> this, since sub-sched enables are the only path where
> scx_alloc_and_add_sched() can race the root's disable.
>
> Move the DISABLING check after @scx_switching_all. @scx_switching_all
> serves as a proxy for __scx_switched_all, so while it's set, forks keep
> going to scx. Once cleared, DISABLING applies normally.
>
> v2: Reword in-source comment and description. (Andrea)
>
> Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/ext.c | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5092,10 +5092,30 @@ static const struct kset_uevent_ops scx_
> */
> bool task_should_scx(int policy)
> {
> - if (!scx_enabled() || unlikely(scx_enable_state() == SCX_DISABLING))
> + /* if disabled, nothing should be on it */
> + if (!scx_enabled())
> return false;
> +
> + /* scx is taking over all SCHED_OTHER and SCHED_EXT tasks */
> if (READ_ONCE(scx_switching_all))
> return true;
> +
> + /*
> + * scx is tearing down - keep new SCHED_EXT tasks out.
> + *
> + * Must come after scx_switching_all test, which serves as a proxy
> + * for __scx_switched_all. While __scx_switched_all is set, we must
> + * return true via the branch above: a fork routed to fair would
> + * stall because next_active_class() skips fair.
> + *
> + * This can develop into a deadlock - scx holds scx_enable_mutex across
> + * kthread_create() in scx_alloc_and_add_sched(); if the new kthread is
> + * the stalled task, the disable path can never grab the mutex to clear
> + * scx_switching_all.
> + */
Yeah, this is much better than my comment (that was quite confusing).
To make sure I understand: what fixes the deadlock is checking scx_switching_all
before DISABLING in task_should_scx(), because in this way the sched_ext_helper
kthread goes to scx (not fair), runs, the enable path completes, releases the
mutex and the disable path moves forward.
When I wrote my comment I was looking at the ordering of [__]scx_switched_all in
scx_root_disable():
static_branch_disable(&__scx_switched_all);
WRITE_ONCE(scx_switching_all, false);
And I was wondering, if we invert those we'd have a similar issue: a small
window where __scx_switched_all == ON and scx_switching_all == false. But the
current order is already the safe one, so no change needed.
Thanks,
-Andrea
> + if (unlikely(scx_enable_state() == SCX_DISABLING))
> + return false;
> +
> return policy == SCHED_EXT;
> }
>
Hello, On Sun, May 17, 2026 at 08:47:31PM +0200, Andrea Righi wrote: ... > Yeah, this is much better than my comment (that was quite confusing). > > To make sure I understand: what fixes the deadlock is checking scx_switching_all > before DISABLING in task_should_scx(), because in this way the sched_ext_helper > kthread goes to scx (not fair), runs, the enable path completes, releases the > mutex and the disable path moves forward. > > When I wrote my comment I was looking at the ordering of [__]scx_switched_all in > scx_root_disable(): > > static_branch_disable(&__scx_switched_all); > WRITE_ONCE(scx_switching_all, false); > > And I was wondering, if we invert those we'd have a similar issue: a small > window where __scx_switched_all == ON and scx_switching_all == false. But the > current order is already the safe one, so no change needed. Yeah, and even if create that window between __scx_switched_all and scx_switching_all, it's transient. Let's say a task slips into eevdf between the two. The task has no way of preventing disable from completing __scx_switched_all transition, and the condition would unwind. The problem with DISABLING transition was that it could make a racing enable path to wait for kthread creation to finish while holding enable_mutex. Because disable path needs the same mutex to turn off __scx_switched_all and the stalled task needs __scx_switched_all to be turned off to progress, we end up in a deadlock. Thanks. -- tejun
© 2016 - 2026 Red Hat, Inc.