sched_ext: introduce cpu tick

[PATCH] sched_ext: introduce cpu tick

Posted by liuwenfang 8 months ago

Assume one CPU is running one RT task and one runnable scx task on
its local dsq, the scx task cannot be scheduled until RT task enters
sleep, if RT task will run for 100ms, the scx task should be migrated
to other dsqs, then it can have a chance to be scheduled by other CPUs.

So cpu_tick is added to notitfy BPF scheduler to check long runnable
scx on its local dsq, related policy can be taken to improve the
performance.

Signed-off-by: liuwenfang liuwenfang@honor.com
---
 kernel/sched/ext.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f5133249f..2232f616c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -337,6 +337,16 @@ struct sched_ext_ops {
 	 */
 	void (*tick)(struct task_struct *p);
 
+	/**
+	 * @tick: Periodic tick
+	 * @rq: current CPU's rq
+	 *
+	 * This operation is called every 1/HZ seconds on each CPU which is
+	 * not idle. Notify BPF scheduler to take policy for runnable tasks
+	 * on local dsq.
+	 */
+	void (*cpu_tick)(struct rq *rq);
+
 	/**
 	 * @runnable: A task is becoming runnable on its associated CPU
 	 * @p: task becoming runnable
@@ -3569,6 +3579,9 @@ void scx_tick(struct rq *rq)
 	}
 
 	update_other_load_avgs(rq);
+
+	if (SCX_HAS_OP(cpu_tick))
+		SCX_CALL_OP(SCX_KF_REST, cpu_tick, rq);
 }
 
 static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
@@ -5753,6 +5766,7 @@ static void sched_ext_ops__enqueue(struct task_struct *p, u64 enq_flags) {}
 static void sched_ext_ops__dequeue(struct task_struct *p, u64 enq_flags) {}
 static void sched_ext_ops__dispatch(s32 prev_cpu, struct task_struct *prev__nullable) {}
 static void sched_ext_ops__tick(struct task_struct *p) {}
+static void sched_ext_ops__cpu_tick(struct rq *rq) {}
 static void sched_ext_ops__runnable(struct task_struct *p, u64 enq_flags) {}
 static void sched_ext_ops__running(struct task_struct *p) {}
 static void sched_ext_ops__stopping(struct task_struct *p, bool runnable) {}
@@ -5790,6 +5804,7 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
 	.dequeue		= sched_ext_ops__dequeue,
 	.dispatch		= sched_ext_ops__dispatch,
 	.tick			= sched_ext_ops__tick,
+	.cpu_tick		= sched_ext_ops__cpu_tick,
 	.runnable		= sched_ext_ops__runnable,
 	.running		= sched_ext_ops__running,
 	.stopping		= sched_ext_ops__stopping,
-- 
2.17.1

Re: [PATCH] sched_ext: introduce cpu tick

Posted by 'Tejun Heo' 8 months ago

Hello,

On Tue, Jun 10, 2025 at 08:59:45AM +0000, liuwenfang wrote:
> Assume one CPU is running one RT task and one runnable scx task on
> its local dsq, the scx task cannot be scheduled until RT task enters
> sleep, if RT task will run for 100ms, the scx task should be migrated
> to other dsqs, then it can have a chance to be scheduled by other CPUs.
> 
> So cpu_tick is added to notitfy BPF scheduler to check long runnable
> scx on its local dsq, related policy can be taken to improve the
> performance.

(cc'ing Kumar as we discussed similar issue recently)

There are some race conditions we need to address but calling
scx_bpf_reenqueue_local() from ops.cpu_release() is the intended way of
handling these situations. I don't think periodically polling from ticks is
a good approach, especially given that ticks can be skipped w/ nohz_full.

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: introduce cpu tick

Posted by liuwenfang 8 months ago

Thanks for your feedback.

Another one issue is that if a runnable local SCX task has p->nr_cpus_allowed equal to 1,
and there are RT tasks on this CPU's runqueue, we need a chance to let BPF scheduler to adjust RT 
throttle param properly(or other methods), so that the local boud SCX task will be scheduled
in time. This is important for the mobile scenario to render smoothly at 120 frames per second.
scx_bpf_reenqueue_local will not work for the local SCX when p->nr_cpus_allowed == 1.

Also some tradeoff methods can be taken to balance the performance:
If the running SCX task is preempted by one short-running RT task(predicted by its history),
then it is better for the BPF scheduler to keep this SCX task on its local dsq, rather than directly calling
scx_bpf_reenqueue_local(). However, we still need protection for this situation in case the
short RT task become long-running task(perhaps due to some exception).

Any suggestions and comments are welcome!

Best regards

> 
> Hello,
> 
> On Tue, Jun 10, 2025 at 08:59:45AM +0000, liuwenfang wrote:
> > Assume one CPU is running one RT task and one runnable scx task on its
> > local dsq, the scx task cannot be scheduled until RT task enters
> > sleep, if RT task will run for 100ms, the scx task should be migrated
> > to other dsqs, then it can have a chance to be scheduled by other CPUs.
> >
> > So cpu_tick is added to notitfy BPF scheduler to check long runnable
> > scx on its local dsq, related policy can be taken to improve the
> > performance.
> 
> (cc'ing Kumar as we discussed similar issue recently)
> 
> There are some race conditions we need to address but calling
> scx_bpf_reenqueue_local() from ops.cpu_release() is the intended way of
> handling these situations. I don't think periodically polling from ticks is a good
> approach, especially given that ticks can be skipped w/ nohz_full.
> 
> Thanks.
> 
> --
> tejun

Re: [PATCH] sched_ext: introduce cpu tick

Posted by Andrea Righi 8 months ago

On Wed, Jun 11, 2025 at 02:22:11AM +0000, liuwenfang wrote:
> Thanks for your feedback.
> 
> Another one issue is that if a runnable local SCX task has p->nr_cpus_allowed equal to 1,
> and there are RT tasks on this CPU's runqueue, we need a chance to let BPF scheduler to adjust RT 
> throttle param properly(or other methods), so that the local boud SCX task will be scheduled
> in time. This is important for the mobile scenario to render smoothly at 120 frames per second.
> scx_bpf_reenqueue_local will not work for the local SCX when p->nr_cpus_allowed == 1.
> 
> Also some tradeoff methods can be taken to balance the performance:
> If the running SCX task is preempted by one short-running RT task(predicted by its history),
> then it is better for the BPF scheduler to keep this SCX task on its local dsq, rather than directly calling
> scx_bpf_reenqueue_local(). However, we still need protection for this situation in case the
> short RT task become long-running task(perhaps due to some exception).
> 
> Any suggestions and comments are welcome!

This will be all addressed by the DL server work that Joel is doing:
https://lore.kernel.org/all/20250602180110.816225-10-joelagnelf@nvidia.com/

Thanks,
-Andrea

> 
> Best regards
> 
> > 
> > Hello,
> > 
> > On Tue, Jun 10, 2025 at 08:59:45AM +0000, liuwenfang wrote:
> > > Assume one CPU is running one RT task and one runnable scx task on its
> > > local dsq, the scx task cannot be scheduled until RT task enters
> > > sleep, if RT task will run for 100ms, the scx task should be migrated
> > > to other dsqs, then it can have a chance to be scheduled by other CPUs.
> > >
> > > So cpu_tick is added to notitfy BPF scheduler to check long runnable
> > > scx on its local dsq, related policy can be taken to improve the
> > > performance.
> > 
> > (cc'ing Kumar as we discussed similar issue recently)
> > 
> > There are some race conditions we need to address but calling
> > scx_bpf_reenqueue_local() from ops.cpu_release() is the intended way of
> > handling these situations. I don't think periodically polling from ticks is a good
> > approach, especially given that ticks can be skipped w/ nohz_full.
> > 
> > Thanks.
> > 
> > --
> > tejun