sched_ext: Support high-performance monotonically non-decreasing clock

[PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Changwoo Min 5 days, 23 hours ago

An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock(). Also,
after long running operations -- ops.running() and ops.update_idle() --
in a BPF scheduler, the sched_ext core invalidates the rq clock.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/core.c  | 6 +++++-
 kernel/sched/ext.c   | 3 +++
 kernel/sched/sched.h | 2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a910a5b4c274..d0eb58b6a2da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -784,6 +784,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
+	u64 clock;
 
 	lockdep_assert_rq_held(rq);
 
@@ -795,11 +796,14 @@ void update_rq_clock(struct rq *rq)
 		SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
 	rq->clock_update_flags |= RQCF_UPDATED;
 #endif
+	clock = sched_clock_cpu(cpu_of(rq));
+	scx_rq_clock_update(rq, clock);
 
-	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+	delta = clock - rq->clock;
 	if (delta < 0)
 		return;
 	rq->clock += delta;
+
 	update_rq_clock_task(rq, delta);
 }
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9f9bc2930658..b8ad776ef516 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2928,6 +2928,8 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 	if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
 		SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
 
+	scx_rq_clock_stale(rq);
+
 	clr_task_runnable(p, true);
 
 	/*
@@ -3590,6 +3592,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
 {
 	int cpu = cpu_of(rq);
 
+	scx_rq_clock_stale(rq);
 	if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
 		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
 		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 61efff790e24..03854ac9914b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1758,7 +1758,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 	if (rq->clock_update_flags > RQCF_ACT_SKIP)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
-
+	scx_rq_clock_stale(rq);
 	lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
 }
 
-- 
2.47.0

Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Peter Zijlstra 5 days, 20 hours ago

On Sun, Nov 17, 2024 at 01:01:23AM +0900, Changwoo Min wrote:
> An rq clock becomes valid when it is updated using update_rq_clock()
> and invalidated when the rq is unlocked using rq_unpin_lock(). Also,
> after long running operations -- ops.running() and ops.update_idle() --
> in a BPF scheduler, the sched_ext core invalidates the rq clock.
> 
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
>  kernel/sched/core.c  | 6 +++++-
>  kernel/sched/ext.c   | 3 +++
>  kernel/sched/sched.h | 2 +-
>  3 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a910a5b4c274..d0eb58b6a2da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -784,6 +784,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>  void update_rq_clock(struct rq *rq)
>  {
>  	s64 delta;
> +	u64 clock;
>  
>  	lockdep_assert_rq_held(rq);
>  
> @@ -795,11 +796,14 @@ void update_rq_clock(struct rq *rq)
>  		SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
>  	rq->clock_update_flags |= RQCF_UPDATED;
>  #endif
> +	clock = sched_clock_cpu(cpu_of(rq));
> +	scx_rq_clock_update(rq, clock);

It is not at all clear why you think you need to keep a second copy of
that value. You like cache misses?

Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Changwoo Min 5 days ago

Hello,

On 24. 11. 17. 04:32, Peter Zijlstra wrote:
> It is not at all clear why you think you need to keep a second copy of
> that value. You like cache misses?

Of course not. :-) I always try to avoid cache misses whenever
possible.

The main reason to keep the second copy (rq->scx.clock) is that
a BPF scheduler can call scx_bpf_clock_get_ns() at almost any
time in any context, including any of sched_ext operations, BPF
timer callbacks, BPF syscalls, kprobes, and so on.

However, the rq->clock is supposed to be updated while holding
the rq lock (lockdep_assert_rq_held), which is not always the
case in a BPF scheduler. Also, rq->clock is also used in other
places (e.g., PELT), so updating rq->clock in arbitrary points by
a BPF scheduler will create unnecessary dependencies.

Another approach would be to extend struct sched_clock_data (in
kernel/sched/clock.c) to store the update flag
(SCX_RQ_CLK_UPDATED). This would be the best regarding the number
of cache line accesses. However, that would be an overkill since
now sched_clock_data stores the sched_ext-specific data.
I thought it would be better to keep sched_ext specific data in
one place, struct scx_rq, for managibility.

Regards,
Changwoo Min

Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Peter Zijlstra 4 days, 6 hours ago

On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote:

> The main reason to keep the second copy (rq->scx.clock) is that
> a BPF scheduler can call scx_bpf_clock_get_ns() at almost any
> time in any context, including any of sched_ext operations, BPF
> timer callbacks, BPF syscalls, kprobes, and so on.

If it's going to be a BPF wide thing, why is it presented as part of
sched_ext ? That makes no sense.

> Another approach would be to extend struct sched_clock_data (in
> kernel/sched/clock.c) to store the update flag
> (SCX_RQ_CLK_UPDATED). This would be the best regarding the number
> of cache line accesses. However, that would be an overkill since
> now sched_clock_data stores the sched_ext-specific data.
> I thought it would be better to keep sched_ext specific data in
> one place, struct scx_rq, for managibility.

What's the purpose of that flag? Why can't BPF use sched_clock_local()
and call it a day?

Do note that kernel/sched/clock.c is very much x86 specific (it was
briefly used by ia64 since their 'TSC' was of equal quality).

Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so
we have plenty free bytes there.

Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Changwoo Min 3 days, 14 hours ago

Hello,

Thank you for the prompt feedback. I hope the following answers
can clarify most of your doubts.

On 24. 11. 18. 18:41, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote:
> 
>> The main reason to keep the second copy (rq->scx.clock) is that
>> a BPF scheduler can call scx_bpf_clock_get_ns() at almost any
>> time in any context, including any of sched_ext operations, BPF
>> timer callbacks, BPF syscalls, kprobes, and so on.
> 
> If it's going to be a BPF wide thing, why is it presented as part of
> sched_ext ? That makes no sense.

There is a confusion here. scx_bpf_clock_get_ns() is for BPF
schedulers, not for random BPF programs. In almost all cases, it
will be used in the shced_ext operations, such as ops.running()
and ops.stopping(), to implement scheduling policies. However, if
BPF schedulers use other BPF features, such as BPF timer,
scx_bpf_clock_get_ns() also can be used. For example, scx_lavd uses
a BPF timer for periodic background processing; scx_lavd and
scx_flash use kprobe to trace futex system calls. Also, since
scx_bpf_clock_get_ns() relies on rq lock, it is not meaningful
outside of the BPF schedulers. Hence, it should be a part of
sched_ext.

>> Another approach would be to extend struct sched_clock_data (in
>> kernel/sched/clock.c) to store the update flag
>> (SCX_RQ_CLK_UPDATED). This would be the best regarding the number
>> of cache line accesses. However, that would be an overkill since
>> now sched_clock_data stores the sched_ext-specific data.
>> I thought it would be better to keep sched_ext specific data in
>> one place, struct scx_rq, for managibility.
> 
> What's the purpose of that flag? Why can't BPF use sched_clock_local()
> and call it a day?

Let's suppose the following timeline:

   T1. rq_lock(rq)
   T2. update_rq_clock(rq)
   T3. a sched_ext BPF operation
   T4. rq_unlock(rq)
   T5. a sched_ext BPF operation
   T6. rq_lock(rq)
   T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid
(SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during
[T2, T4) (including T3) will return the rq clock updated at T2.
Let's think about what we should do for the duration [T4, T7)
when a BPF scheduler can still call scx_bpf_clock_get_ns (T5).
During that duration, we consider the rq clock is invalid
(SCX_RQ_CLK_UPDATED is unset). So when calling
scx_bpf_clock_get_ns at T5, we call sched_clock() to get the
fresh clock.

I think the term `UPDATED` was misleading. I will change it to
`VALID` in the next version.

> Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so
> we have plenty free bytes there.

Alright. I will change the current implementation and extend
`struct sched_clock_data` to store the `VALID` flag in the next
version.

Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Peter Zijlstra 3 days, 7 hours ago

On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote:

> > What's the purpose of that flag? Why can't BPF use sched_clock_local()
> > and call it a day?
> 
> Let's suppose the following timeline:
> 
>   T1. rq_lock(rq)
>   T2. update_rq_clock(rq)
>   T3. a sched_ext BPF operation
>   T4. rq_unlock(rq)
>   T5. a sched_ext BPF operation
>   T6. rq_lock(rq)
>   T7. update_rq_clock(rq)
> 
> For [T2, T4), we consider that rq clock is valid
> (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during
> [T2, T4) (including T3) will return the rq clock updated at T2.
> Let's think about what we should do for the duration [T4, T7)
> when a BPF scheduler can still call scx_bpf_clock_get_ns (T5).
> During that duration, we consider the rq clock is invalid
> (SCX_RQ_CLK_UPDATED is unset). So when calling
> scx_bpf_clock_get_ns at T5, we call sched_clock() to get the
> fresh clock.
> 
> I think the term `UPDATED` was misleading. I will change it to
> `VALID` in the next version.

So the reason rq->clock is tied to rq->lock, is to ensure a scheduling
operation happens at a single point in time.

Suppose re-nice, you dequeue the task, you modify its properties
(weight) and then you requeue it. If time were passing 'normally' the
task would loose the time between dequeue and enqueue -- this is not
right.

The only obvious exception here is a migration.

So the question then becomes, what is T5 doing and is it 'right' for it
to get a fresh clock value.

Please give an example of T5 -- I really don't know this BPF crap much
-- and reason about how the clock should behave.

Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Posted by Changwoo Min 2 days, 23 hours ago

Hello,

On 24. 11. 19. 17:17, Peter Zijlstra wrote:
> On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote:

>> Let's suppose the following timeline:
>>
>>    T1. rq_lock(rq)
>>    T2. update_rq_clock(rq)
>>    T3. a sched_ext BPF operation
>>    T4. rq_unlock(rq)
>>    T5. a sched_ext BPF operation
>>    T6. rq_lock(rq)
>>    T7. update_rq_clock(rq)
>>
>> For [T2, T4), we consider that rq clock is valid
>> (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during
>> [T2, T4) (including T3) will return the rq clock updated at T2.
>> Let's think about what we should do for the duration [T4, T7)
>> when a BPF scheduler can still call scx_bpf_clock_get_ns (T5).
>> During that duration, we consider the rq clock is invalid
>> (SCX_RQ_CLK_UPDATED is unset). So when calling
>> scx_bpf_clock_get_ns at T5, we call sched_clock() to get the
>> fresh clock.

> So the question then becomes, what is T5 doing and is it 'right' for it
> to get a fresh clock value.
> 
> Please give an example of T5 -- I really don't know this BPF crap much
> -- and reason about how the clock should behave.

Here is one example. `scx_central` uses a BPF timer for
preemptive scheduling. In every msec, the timer callback checks
if the currently running tasks exceed their timeslice. At the
beginning of the BPF timer callback (central_timerfn in
scx_central.bpf.c), scx_central gets the current time. When the
BPF timer callback runs, the rq clock could be invalid, the same
as T5. In this case, it is reasonable to return a fresh clock
value rather than returning the old one (T2).

Besides this timer example, scx_bpf_clock_get_ns() can be called
any callbacks defined in `struct sched_ext_ops`. Some callbacks
can be called without holding a rq lock (e.g., ops.cpu_online,
ops.cgroup_init). In these cases, it is reasonable to reutrn a
fresh clock value rather returning the old one.

Regards,
Changwoo Min