An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock(). Also,
after long running operations -- ops.running() and ops.update_idle() --
in a BPF scheduler, the sched_ext core invalidates the rq clock.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
kernel/sched/core.c | 6 +++++-
kernel/sched/ext.c | 3 +++
kernel/sched/sched.h | 2 +-
3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a910a5b4c274..d0eb58b6a2da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -784,6 +784,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
void update_rq_clock(struct rq *rq)
{
s64 delta;
+ u64 clock;
lockdep_assert_rq_held(rq);
@@ -795,11 +796,14 @@ void update_rq_clock(struct rq *rq)
SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
rq->clock_update_flags |= RQCF_UPDATED;
#endif
+ clock = sched_clock_cpu(cpu_of(rq));
+ scx_rq_clock_update(rq, clock);
- delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+ delta = clock - rq->clock;
if (delta < 0)
return;
rq->clock += delta;
+
update_rq_clock_task(rq, delta);
}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9f9bc2930658..b8ad776ef516 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2928,6 +2928,8 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
+ scx_rq_clock_stale(rq);
+
clr_task_runnable(p, true);
/*
@@ -3590,6 +3592,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
{
int cpu = cpu_of(rq);
+ scx_rq_clock_stale(rq);
if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
if (!static_branch_unlikely(&scx_builtin_idle_enabled))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 61efff790e24..03854ac9914b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1758,7 +1758,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
if (rq->clock_update_flags > RQCF_ACT_SKIP)
rf->clock_update_flags = RQCF_UPDATED;
#endif
-
+ scx_rq_clock_stale(rq);
lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
}
--
2.47.0
On Sun, Nov 17, 2024 at 01:01:23AM +0900, Changwoo Min wrote:
> An rq clock becomes valid when it is updated using update_rq_clock()
> and invalidated when the rq is unlocked using rq_unpin_lock(). Also,
> after long running operations -- ops.running() and ops.update_idle() --
> in a BPF scheduler, the sched_ext core invalidates the rq clock.
>
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
> ---
> kernel/sched/core.c | 6 +++++-
> kernel/sched/ext.c | 3 +++
> kernel/sched/sched.h | 2 +-
> 3 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a910a5b4c274..d0eb58b6a2da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -784,6 +784,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> void update_rq_clock(struct rq *rq)
> {
> s64 delta;
> + u64 clock;
>
> lockdep_assert_rq_held(rq);
>
> @@ -795,11 +796,14 @@ void update_rq_clock(struct rq *rq)
> SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
> rq->clock_update_flags |= RQCF_UPDATED;
> #endif
> + clock = sched_clock_cpu(cpu_of(rq));
> + scx_rq_clock_update(rq, clock);
It is not at all clear why you think you need to keep a second copy of
that value. You like cache misses?
Hello, On 24. 11. 17. 04:32, Peter Zijlstra wrote: > It is not at all clear why you think you need to keep a second copy of > that value. You like cache misses? Of course not. :-) I always try to avoid cache misses whenever possible. The main reason to keep the second copy (rq->scx.clock) is that a BPF scheduler can call scx_bpf_clock_get_ns() at almost any time in any context, including any of sched_ext operations, BPF timer callbacks, BPF syscalls, kprobes, and so on. However, the rq->clock is supposed to be updated while holding the rq lock (lockdep_assert_rq_held), which is not always the case in a BPF scheduler. Also, rq->clock is also used in other places (e.g., PELT), so updating rq->clock in arbitrary points by a BPF scheduler will create unnecessary dependencies. Another approach would be to extend struct sched_clock_data (in kernel/sched/clock.c) to store the update flag (SCX_RQ_CLK_UPDATED). This would be the best regarding the number of cache line accesses. However, that would be an overkill since now sched_clock_data stores the sched_ext-specific data. I thought it would be better to keep sched_ext specific data in one place, struct scx_rq, for managibility. Regards, Changwoo Min
On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote: > The main reason to keep the second copy (rq->scx.clock) is that > a BPF scheduler can call scx_bpf_clock_get_ns() at almost any > time in any context, including any of sched_ext operations, BPF > timer callbacks, BPF syscalls, kprobes, and so on. If it's going to be a BPF wide thing, why is it presented as part of sched_ext ? That makes no sense. > Another approach would be to extend struct sched_clock_data (in > kernel/sched/clock.c) to store the update flag > (SCX_RQ_CLK_UPDATED). This would be the best regarding the number > of cache line accesses. However, that would be an overkill since > now sched_clock_data stores the sched_ext-specific data. > I thought it would be better to keep sched_ext specific data in > one place, struct scx_rq, for managibility. What's the purpose of that flag? Why can't BPF use sched_clock_local() and call it a day? Do note that kernel/sched/clock.c is very much x86 specific (it was briefly used by ia64 since their 'TSC' was of equal quality). Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so we have plenty free bytes there.
Hello, Thank you for the prompt feedback. I hope the following answers can clarify most of your doubts. On 24. 11. 18. 18:41, Peter Zijlstra wrote: > On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote: > >> The main reason to keep the second copy (rq->scx.clock) is that >> a BPF scheduler can call scx_bpf_clock_get_ns() at almost any >> time in any context, including any of sched_ext operations, BPF >> timer callbacks, BPF syscalls, kprobes, and so on. > > If it's going to be a BPF wide thing, why is it presented as part of > sched_ext ? That makes no sense. There is a confusion here. scx_bpf_clock_get_ns() is for BPF schedulers, not for random BPF programs. In almost all cases, it will be used in the shced_ext operations, such as ops.running() and ops.stopping(), to implement scheduling policies. However, if BPF schedulers use other BPF features, such as BPF timer, scx_bpf_clock_get_ns() also can be used. For example, scx_lavd uses a BPF timer for periodic background processing; scx_lavd and scx_flash use kprobe to trace futex system calls. Also, since scx_bpf_clock_get_ns() relies on rq lock, it is not meaningful outside of the BPF schedulers. Hence, it should be a part of sched_ext. >> Another approach would be to extend struct sched_clock_data (in >> kernel/sched/clock.c) to store the update flag >> (SCX_RQ_CLK_UPDATED). This would be the best regarding the number >> of cache line accesses. However, that would be an overkill since >> now sched_clock_data stores the sched_ext-specific data. >> I thought it would be better to keep sched_ext specific data in >> one place, struct scx_rq, for managibility. > > What's the purpose of that flag? Why can't BPF use sched_clock_local() > and call it a day? Let's suppose the following timeline: T1. rq_lock(rq) T2. update_rq_clock(rq) T3. a sched_ext BPF operation T4. rq_unlock(rq) T5. a sched_ext BPF operation T6. rq_lock(rq) T7. update_rq_clock(rq) For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during [T2, T4) (including T3) will return the rq clock updated at T2. Let's think about what we should do for the duration [T4, T7) when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). During that duration, we consider the rq clock is invalid (SCX_RQ_CLK_UPDATED is unset). So when calling scx_bpf_clock_get_ns at T5, we call sched_clock() to get the fresh clock. I think the term `UPDATED` was misleading. I will change it to `VALID` in the next version. > Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so > we have plenty free bytes there. Alright. I will change the current implementation and extend `struct sched_clock_data` to store the `VALID` flag in the next version.
On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote: > > What's the purpose of that flag? Why can't BPF use sched_clock_local() > > and call it a day? > > Let's suppose the following timeline: > > T1. rq_lock(rq) > T2. update_rq_clock(rq) > T3. a sched_ext BPF operation > T4. rq_unlock(rq) > T5. a sched_ext BPF operation > T6. rq_lock(rq) > T7. update_rq_clock(rq) > > For [T2, T4), we consider that rq clock is valid > (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during > [T2, T4) (including T3) will return the rq clock updated at T2. > Let's think about what we should do for the duration [T4, T7) > when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). > During that duration, we consider the rq clock is invalid > (SCX_RQ_CLK_UPDATED is unset). So when calling > scx_bpf_clock_get_ns at T5, we call sched_clock() to get the > fresh clock. > > I think the term `UPDATED` was misleading. I will change it to > `VALID` in the next version. So the reason rq->clock is tied to rq->lock, is to ensure a scheduling operation happens at a single point in time. Suppose re-nice, you dequeue the task, you modify its properties (weight) and then you requeue it. If time were passing 'normally' the task would loose the time between dequeue and enqueue -- this is not right. The only obvious exception here is a migration. So the question then becomes, what is T5 doing and is it 'right' for it to get a fresh clock value. Please give an example of T5 -- I really don't know this BPF crap much -- and reason about how the clock should behave.
Hello, On 24. 11. 19. 17:17, Peter Zijlstra wrote: > On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote: >> Let's suppose the following timeline: >> >> T1. rq_lock(rq) >> T2. update_rq_clock(rq) >> T3. a sched_ext BPF operation >> T4. rq_unlock(rq) >> T5. a sched_ext BPF operation >> T6. rq_lock(rq) >> T7. update_rq_clock(rq) >> >> For [T2, T4), we consider that rq clock is valid >> (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during >> [T2, T4) (including T3) will return the rq clock updated at T2. >> Let's think about what we should do for the duration [T4, T7) >> when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). >> During that duration, we consider the rq clock is invalid >> (SCX_RQ_CLK_UPDATED is unset). So when calling >> scx_bpf_clock_get_ns at T5, we call sched_clock() to get the >> fresh clock. > So the question then becomes, what is T5 doing and is it 'right' for it > to get a fresh clock value. > > Please give an example of T5 -- I really don't know this BPF crap much > -- and reason about how the clock should behave. Here is one example. `scx_central` uses a BPF timer for preemptive scheduling. In every msec, the timer callback checks if the currently running tasks exceed their timeslice. At the beginning of the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central gets the current time. When the BPF timer callback runs, the rq clock could be invalid, the same as T5. In this case, it is reasonable to return a fresh clock value rather than returning the old one (T2). Besides this timer example, scx_bpf_clock_get_ns() can be called any callbacks defined in `struct sched_ext_ops`. Some callbacks can be called without holding a rq lock (e.g., ops.cpu_online, ops.cgroup_init). In these cases, it is reasonable to reutrn a fresh clock value rather returning the old one. Regards, Changwoo Min
Hello, On 24. 11. 20. 00:57, Changwoo Min wrote: > Hello, > > On 24. 11. 19. 17:17, Peter Zijlstra wrote: >> On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote: > >>> Let's suppose the following timeline: >>> >>> T1. rq_lock(rq) >>> T2. update_rq_clock(rq) >>> T3. a sched_ext BPF operation >>> T4. rq_unlock(rq) >>> T5. a sched_ext BPF operation >>> T6. rq_lock(rq) >>> T7. update_rq_clock(rq) >>> >>> For [T2, T4), we consider that rq clock is valid >>> (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during >>> [T2, T4) (including T3) will return the rq clock updated at T2. >>> Let's think about what we should do for the duration [T4, T7) >>> when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). >>> During that duration, we consider the rq clock is invalid >>> (SCX_RQ_CLK_UPDATED is unset). So when calling >>> scx_bpf_clock_get_ns at T5, we call sched_clock() to get the >>> fresh clock. > >> So the question then becomes, what is T5 doing and is it 'right' for it >> to get a fresh clock value. >> >> Please give an example of T5 -- I really don't know this BPF crap much >> -- and reason about how the clock should behave. > > Here is one example. `scx_central` uses a BPF timer for > preemptive scheduling. In every msec, the timer callback checks > if the currently running tasks exceed their timeslice. At the > beginning of the BPF timer callback (central_timerfn in > scx_central.bpf.c), scx_central gets the current time. When the > BPF timer callback runs, the rq clock could be invalid, the same > as T5. In this case, it is reasonable to return a fresh clock > value rather than returning the old one (T2). > > Besides this timer example, scx_bpf_clock_get_ns() can be called > any callbacks defined in `struct sched_ext_ops`. Some callbacks > can be called without holding a rq lock (e.g., ops.cpu_online, > ops.cgroup_init). In these cases, it is reasonable to reutrn a > fresh clock value rather returning the old one. I wonder if the above example is sufficient for you. If you need more examples or clarification, please let me know. Regarding the my following my comment in the previous email, ... On 24. 11. 19. 10:19, Changwoo Min wrote: >> Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, >> so we have plenty free bytes there. > > Alright. I will change the current implementation and extend > `struct sched_clock_data` to store the `VALID` flag in the next > version. I found `struct sched_clock_data` is defined only when CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, so I think extending `struct sched_clock_data` is not an approach approach. Extending `struct scx_rq` seems the best option after opting out sched_clock_data. I will make sure the cached clock value and flag in the scx_rq are in the same cache line to minimize the cache misses. What do you think? Thanks! Changwoo Min
© 2016 - 2026 Red Hat, Inc.