sched_ext: allow scx_bpf_task_cgroup() in ops.dispatch()

[PATCH] sched_ext: allow scx_bpf_task_cgroup() in ops.dispatch()

Posted by Changwoo Min 2 weeks, 3 days ago

Calling scx_bpf_task_cgroup() on the prev task from ops.dispatch()
currently triggers a runtime error:

------
CodecWorker[92989] triggered exit kind 1024:
  runtime error (called on a task not being operated on)

Backtrace:
  scx_bpf_task_cgroup+0xbb/0xd0
  bpf_prog_d7d2e9d404cf3602_lavd_dispatch+0xc45/0xccc
  bpf__sched_ext_ops_dispatch+0x4b/0xab
  balance_one+0x25f/0x550
  balance_scx+0x37/0x160
  prev_balance+0x46/0xb0
  __schedule+0x702/0x23d0
  schedule+0x27/0xd0
  irqentry_exit_to_user_mode+0x1ac/0x200
  asm_sysvec_apic_timer_interrupt+0x1a/0x20
------

This happens because the prev task is not registered as a task argument
of ops.dispatch(). As a result, scx_kf_allowed_on_arg_tasks() rejects
scx_bpf_task_cgroup() calls on the prev task.

Fix this by adding the prev task to scx.kf_tasks so that task-related
BPF helpers such as scx_bpf_task_cgroup() can be called safely. Since
the SCX_CALL_OP_TASK family assumes the first argument is the task,
introduce a new SCX_CALL_OP_TASK_ANY macro without that restriction.
Also update __SCX_KF_TERMINAL to include SCX_KF_DISPATCH.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 include/linux/sched/ext.h |  3 ++-
 kernel/sched/ext.c        | 13 +++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 7047101dbf58..b404a93d371c 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -116,7 +116,8 @@ enum scx_kf_mask {
 
 	__SCX_KF_RQ_LOCKED	= SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
 				  SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
-	__SCX_KF_TERMINAL	= SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
+	__SCX_KF_TERMINAL	= SCX_KF_DISPATCH | SCX_KF_ENQUEUE |
+				  SCX_KF_SELECT_CPU | SCX_KF_REST,
 };
 
 enum scx_dsq_lnode_flags {
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3eb6be889da6..f704d2e5dce2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -328,6 +328,14 @@ do {										\
 	__ret;									\
 })
 
+#define SCX_CALL_OP_TASK_ANY(sch, mask, op, rq, task, args...)			\
+do {										\
+	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
+	current->scx.kf_tasks[0] = task;					\
+	SCX_CALL_OP((sch), mask, op, rq, ##args);				\
+	current->scx.kf_tasks[0] = NULL;					\
+} while (0)
+
 /* @mask is constant, always inline to cull unnecessary branches */
 static __always_inline bool scx_kf_allowed(u32 mask)
 {
@@ -2105,8 +2113,9 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	do {
 		dspc->nr_tasks = 0;
 
-		SCX_CALL_OP(sch, SCX_KF_DISPATCH, dispatch, rq,
-			    cpu_of(rq), prev_on_scx ? prev : NULL);
+		SCX_CALL_OP_TASK_ANY(sch, SCX_KF_DISPATCH, dispatch, rq,
+				     prev_on_scx ? prev : NULL,
+				     cpu_of(rq), prev_on_scx ? prev : NULL);
 
 		flush_dispatch_buf(sch, rq);
 
-- 
2.51.0

Re: [PATCH] sched_ext: allow scx_bpf_task_cgroup() in ops.dispatch()

Posted by Tejun Heo 2 weeks, 1 day ago

Hello,

On Mon, Sep 15, 2025 at 03:52:36PM +0900, Changwoo Min wrote:
...
> Fix this by adding the prev task to scx.kf_tasks so that task-related
> BPF helpers such as scx_bpf_task_cgroup() can be called safely. Since
> the SCX_CALL_OP_TASK family assumes the first argument is the task,
> introduce a new SCX_CALL_OP_TASK_ANY macro without that restriction.
> Also update __SCX_KF_TERMINAL to include SCX_KF_DISPATCH.

I'm not sure this is safe tho. ops.dispatch() can release the rq lock it's
holding to migrate tasks across rq's, which means that other operations can
nest inside - ie. there can be an irq which triggers ops.enqueue() while
ops.dispatch() is in progress. That can in turn overwrite
current->scx.kf_tasks[].

I wonder whether a better approach would be tracking cgroup membership from
BPF side. ops.init_task() tells you the initial cgroup it's joining and if
the task later moves to another cgroup, ops.cgroup_move() will be invoked.
Would that work?

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: allow scx_bpf_task_cgroup() in ops.dispatch()

Posted by Changwoo Min 2 weeks, 1 day ago

Hello Tejun,

Thanks for the feedback.

On 9/17/25 05:32, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 15, 2025 at 03:52:36PM +0900, Changwoo Min wrote:
> ...
>> Fix this by adding the prev task to scx.kf_tasks so that task-related
>> BPF helpers such as scx_bpf_task_cgroup() can be called safely. Since
>> the SCX_CALL_OP_TASK family assumes the first argument is the task,
>> introduce a new SCX_CALL_OP_TASK_ANY macro without that restriction.
>> Also update __SCX_KF_TERMINAL to include SCX_KF_DISPATCH.
> 
> I'm not sure this is safe tho. ops.dispatch() can release the rq lock it's
> holding to migrate tasks across rq's, which means that other operations can
> nest inside - ie. there can be an irq which triggers ops.enqueue() while
> ops.dispatch() is in progress. That can in turn overwrite
> current->scx.kf_tasks[].

I thought that ops.dispatch() always holds an rq lock since there
is a lockdep_assert_rq_held(rq) check at the beginning of
balance_one(), which invokes the BPF scheduler's dispatch.
I guess I am missing an edge case?

> 
> I wonder whether a better approach would be tracking cgroup membership from
> BPF side. ops.init_task() tells you the initial cgroup it's joining and if

Currently, it is also not allowed to call scx_bpf_task_cgroup()
at ops.init_task() because the rq lock is not held at
ops.init_task(). The earliest possible moment to get a task's
cgroup ID is ops.enable().

> the task later moves to another cgroup, ops.cgroup_move() will be invoked.
> Would that work?

Checking cgroup ID at ops.enable() is not ideal because there
would be no cgroup ID changes between ops.enable() calls.
However, the additional overhead should be marginal. So tracking
cgroup ID at ops.enable() and ops.cgroup_move() will work.

Regards,
Changwoo Min

Re: [PATCH] sched_ext: allow scx_bpf_task_cgroup() in ops.dispatch()

Posted by Tejun Heo 2 weeks, 1 day ago

Hello,

On Wed, Sep 17, 2025 at 11:12:18AM +0900, Changwoo Min wrote:
> > I'm not sure this is safe tho. ops.dispatch() can release the rq lock it's
> > holding to migrate tasks across rq's, which means that other operations can
> > nest inside - ie. there can be an irq which triggers ops.enqueue() while
> > ops.dispatch() is in progress. That can in turn overwrite
> > current->scx.kf_tasks[].
> 
> I thought that ops.dispatch() always holds an rq lock since there
> is a lockdep_assert_rq_held(rq) check at the beginning of
> balance_one(), which invokes the BPF scheduler's dispatch.
> I guess I am missing an edge case?

ops.dispatch() is always called and returns with rq locked; however, it
sometimes needs to migrate tasks across rq boundaries and thus can
temporarily release the rq lock it was called with. Please take a look at
dispatch_to_local_dsq() which is called from balance_one() ->
flush_dispatch_buf() -> finish_dispatch().

> > I wonder whether a better approach would be tracking cgroup membership from
> > BPF side. ops.init_task() tells you the initial cgroup it's joining and if
> 
> Currently, it is also not allowed to call scx_bpf_task_cgroup()
> at ops.init_task() because the rq lock is not held at
> ops.init_task(). The earliest possible moment to get a task's
> cgroup ID is ops.enable().

ops.init_task() is called with scx_init_task_args, which contains .cgroup
which points to the CPU controller cgroup the task belongs to.

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: allow scx_bpf_task_cgroup() in ops.dispatch()

Posted by Changwoo Min 1 week, 5 days ago

Hello,

Thanks for the clarification.

On 9/17/25 14:00, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 17, 2025 at 11:12:18AM +0900, Changwoo Min wrote:
>>> I'm not sure this is safe tho. ops.dispatch() can release the rq lock it's
>>> holding to migrate tasks across rq's, which means that other operations can
>>> nest inside - ie. there can be an irq which triggers ops.enqueue() while
>>> ops.dispatch() is in progress. That can in turn overwrite
>>> current->scx.kf_tasks[].
>>
>> I thought that ops.dispatch() always holds an rq lock since there
>> is a lockdep_assert_rq_held(rq) check at the beginning of
>> balance_one(), which invokes the BPF scheduler's dispatch.
>> I guess I am missing an edge case?
> 
> ops.dispatch() is always called and returns with rq locked; however, it
> sometimes needs to migrate tasks across rq boundaries and thus can
> temporarily release the rq lock it was called with. Please take a look at
> dispatch_to_local_dsq() which is called from balance_one() ->
> flush_dispatch_buf() -> finish_dispatch().

You are right. I overlooked that path.

>>> I wonder whether a better approach would be tracking cgroup membership from
>>> BPF side. ops.init_task() tells you the initial cgroup it's joining and if
>>
>> Currently, it is also not allowed to call scx_bpf_task_cgroup()
>> at ops.init_task() because the rq lock is not held at
>> ops.init_task(). The earliest possible moment to get a task's
>> cgroup ID is ops.enable().
> 
> ops.init_task() is called with scx_init_task_args, which contains .cgroup
> which points to the CPU controller cgroup the task belongs to.

Yeah, that's better than relying on ops.enable().

Regards,
Changwoo Min