An rq clock becomes valid when it is updated using update_rq_clock()
and invalidated when the rq is unlocked using rq_unpin_lock(). Also,
after long running operations -- ops.running() and ops.update_idle() --
in a BPF scheduler, the sched_ext core invalidates the rq clock.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
kernel/sched/core.c | 6 +++++-
kernel/sched/ext.c | 3 +++
kernel/sched/sched.h | 2 +-
3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a910a5b4c274..d0eb58b6a2da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -784,6 +784,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
void update_rq_clock(struct rq *rq)
{
s64 delta;
+ u64 clock;
lockdep_assert_rq_held(rq);
@@ -795,11 +796,14 @@ void update_rq_clock(struct rq *rq)
SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
rq->clock_update_flags |= RQCF_UPDATED;
#endif
+ clock = sched_clock_cpu(cpu_of(rq));
+ scx_rq_clock_update(rq, clock);
- delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+ delta = clock - rq->clock;
if (delta < 0)
return;
rq->clock += delta;
+
update_rq_clock_task(rq, delta);
}
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9f9bc2930658..b8ad776ef516 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2928,6 +2928,8 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
+ scx_rq_clock_stale(rq);
+
clr_task_runnable(p, true);
/*
@@ -3590,6 +3592,7 @@ void __scx_update_idle(struct rq *rq, bool idle)
{
int cpu = cpu_of(rq);
+ scx_rq_clock_stale(rq);
if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
if (!static_branch_unlikely(&scx_builtin_idle_enabled))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 61efff790e24..03854ac9914b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1758,7 +1758,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
if (rq->clock_update_flags > RQCF_ACT_SKIP)
rf->clock_update_flags = RQCF_UPDATED;
#endif
-
+ scx_rq_clock_stale(rq);
lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
}
--
2.47.0
On Sun, Nov 17, 2024 at 01:01:23AM +0900, Changwoo Min wrote: > An rq clock becomes valid when it is updated using update_rq_clock() > and invalidated when the rq is unlocked using rq_unpin_lock(). Also, > after long running operations -- ops.running() and ops.update_idle() -- > in a BPF scheduler, the sched_ext core invalidates the rq clock. > > Signed-off-by: Changwoo Min <changwoo@igalia.com> > --- > kernel/sched/core.c | 6 +++++- > kernel/sched/ext.c | 3 +++ > kernel/sched/sched.h | 2 +- > 3 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a910a5b4c274..d0eb58b6a2da 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -784,6 +784,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > void update_rq_clock(struct rq *rq) > { > s64 delta; > + u64 clock; > > lockdep_assert_rq_held(rq); > > @@ -795,11 +796,14 @@ void update_rq_clock(struct rq *rq) > SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED); > rq->clock_update_flags |= RQCF_UPDATED; > #endif > + clock = sched_clock_cpu(cpu_of(rq)); > + scx_rq_clock_update(rq, clock); It is not at all clear why you think you need to keep a second copy of that value. You like cache misses?
Hello, On 24. 11. 17. 04:32, Peter Zijlstra wrote: > It is not at all clear why you think you need to keep a second copy of > that value. You like cache misses? Of course not. :-) I always try to avoid cache misses whenever possible. The main reason to keep the second copy (rq->scx.clock) is that a BPF scheduler can call scx_bpf_clock_get_ns() at almost any time in any context, including any of sched_ext operations, BPF timer callbacks, BPF syscalls, kprobes, and so on. However, the rq->clock is supposed to be updated while holding the rq lock (lockdep_assert_rq_held), which is not always the case in a BPF scheduler. Also, rq->clock is also used in other places (e.g., PELT), so updating rq->clock in arbitrary points by a BPF scheduler will create unnecessary dependencies. Another approach would be to extend struct sched_clock_data (in kernel/sched/clock.c) to store the update flag (SCX_RQ_CLK_UPDATED). This would be the best regarding the number of cache line accesses. However, that would be an overkill since now sched_clock_data stores the sched_ext-specific data. I thought it would be better to keep sched_ext specific data in one place, struct scx_rq, for managibility. Regards, Changwoo Min
On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote: > The main reason to keep the second copy (rq->scx.clock) is that > a BPF scheduler can call scx_bpf_clock_get_ns() at almost any > time in any context, including any of sched_ext operations, BPF > timer callbacks, BPF syscalls, kprobes, and so on. If it's going to be a BPF wide thing, why is it presented as part of sched_ext ? That makes no sense. > Another approach would be to extend struct sched_clock_data (in > kernel/sched/clock.c) to store the update flag > (SCX_RQ_CLK_UPDATED). This would be the best regarding the number > of cache line accesses. However, that would be an overkill since > now sched_clock_data stores the sched_ext-specific data. > I thought it would be better to keep sched_ext specific data in > one place, struct scx_rq, for managibility. What's the purpose of that flag? Why can't BPF use sched_clock_local() and call it a day? Do note that kernel/sched/clock.c is very much x86 specific (it was briefly used by ia64 since their 'TSC' was of equal quality). Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so we have plenty free bytes there.
Hello, Thank you for the prompt feedback. I hope the following answers can clarify most of your doubts. On 24. 11. 18. 18:41, Peter Zijlstra wrote: > On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote: > >> The main reason to keep the second copy (rq->scx.clock) is that >> a BPF scheduler can call scx_bpf_clock_get_ns() at almost any >> time in any context, including any of sched_ext operations, BPF >> timer callbacks, BPF syscalls, kprobes, and so on. > > If it's going to be a BPF wide thing, why is it presented as part of > sched_ext ? That makes no sense. There is a confusion here. scx_bpf_clock_get_ns() is for BPF schedulers, not for random BPF programs. In almost all cases, it will be used in the shced_ext operations, such as ops.running() and ops.stopping(), to implement scheduling policies. However, if BPF schedulers use other BPF features, such as BPF timer, scx_bpf_clock_get_ns() also can be used. For example, scx_lavd uses a BPF timer for periodic background processing; scx_lavd and scx_flash use kprobe to trace futex system calls. Also, since scx_bpf_clock_get_ns() relies on rq lock, it is not meaningful outside of the BPF schedulers. Hence, it should be a part of sched_ext. >> Another approach would be to extend struct sched_clock_data (in >> kernel/sched/clock.c) to store the update flag >> (SCX_RQ_CLK_UPDATED). This would be the best regarding the number >> of cache line accesses. However, that would be an overkill since >> now sched_clock_data stores the sched_ext-specific data. >> I thought it would be better to keep sched_ext specific data in >> one place, struct scx_rq, for managibility. > > What's the purpose of that flag? Why can't BPF use sched_clock_local() > and call it a day? Let's suppose the following timeline: T1. rq_lock(rq) T2. update_rq_clock(rq) T3. a sched_ext BPF operation T4. rq_unlock(rq) T5. a sched_ext BPF operation T6. rq_lock(rq) T7. update_rq_clock(rq) For [T2, T4), we consider that rq clock is valid (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during [T2, T4) (including T3) will return the rq clock updated at T2. Let's think about what we should do for the duration [T4, T7) when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). During that duration, we consider the rq clock is invalid (SCX_RQ_CLK_UPDATED is unset). So when calling scx_bpf_clock_get_ns at T5, we call sched_clock() to get the fresh clock. I think the term `UPDATED` was misleading. I will change it to `VALID` in the next version. > Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so > we have plenty free bytes there. Alright. I will change the current implementation and extend `struct sched_clock_data` to store the `VALID` flag in the next version.
On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote: > > What's the purpose of that flag? Why can't BPF use sched_clock_local() > > and call it a day? > > Let's suppose the following timeline: > > T1. rq_lock(rq) > T2. update_rq_clock(rq) > T3. a sched_ext BPF operation > T4. rq_unlock(rq) > T5. a sched_ext BPF operation > T6. rq_lock(rq) > T7. update_rq_clock(rq) > > For [T2, T4), we consider that rq clock is valid > (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during > [T2, T4) (including T3) will return the rq clock updated at T2. > Let's think about what we should do for the duration [T4, T7) > when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). > During that duration, we consider the rq clock is invalid > (SCX_RQ_CLK_UPDATED is unset). So when calling > scx_bpf_clock_get_ns at T5, we call sched_clock() to get the > fresh clock. > > I think the term `UPDATED` was misleading. I will change it to > `VALID` in the next version. So the reason rq->clock is tied to rq->lock, is to ensure a scheduling operation happens at a single point in time. Suppose re-nice, you dequeue the task, you modify its properties (weight) and then you requeue it. If time were passing 'normally' the task would loose the time between dequeue and enqueue -- this is not right. The only obvious exception here is a migration. So the question then becomes, what is T5 doing and is it 'right' for it to get a fresh clock value. Please give an example of T5 -- I really don't know this BPF crap much -- and reason about how the clock should behave.
Hello, On 24. 11. 19. 17:17, Peter Zijlstra wrote: > On Tue, Nov 19, 2024 at 10:19:44AM +0900, Changwoo Min wrote: >> Let's suppose the following timeline: >> >> T1. rq_lock(rq) >> T2. update_rq_clock(rq) >> T3. a sched_ext BPF operation >> T4. rq_unlock(rq) >> T5. a sched_ext BPF operation >> T6. rq_lock(rq) >> T7. update_rq_clock(rq) >> >> For [T2, T4), we consider that rq clock is valid >> (SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during >> [T2, T4) (including T3) will return the rq clock updated at T2. >> Let's think about what we should do for the duration [T4, T7) >> when a BPF scheduler can still call scx_bpf_clock_get_ns (T5). >> During that duration, we consider the rq clock is invalid >> (SCX_RQ_CLK_UPDATED is unset). So when calling >> scx_bpf_clock_get_ns at T5, we call sched_clock() to get the >> fresh clock. > So the question then becomes, what is T5 doing and is it 'right' for it > to get a fresh clock value. > > Please give an example of T5 -- I really don't know this BPF crap much > -- and reason about how the clock should behave. Here is one example. `scx_central` uses a BPF timer for preemptive scheduling. In every msec, the timer callback checks if the currently running tasks exceed their timeslice. At the beginning of the BPF timer callback (central_timerfn in scx_central.bpf.c), scx_central gets the current time. When the BPF timer callback runs, the rq clock could be invalid, the same as T5. In this case, it is reasonable to return a fresh clock value rather than returning the old one (T2). Besides this timer example, scx_bpf_clock_get_ns() can be called any callbacks defined in `struct sched_ext_ops`. Some callbacks can be called without holding a rq lock (e.g., ops.cpu_online, ops.cgroup_init). In these cases, it is reasonable to reutrn a fresh clock value rather returning the old one. Regards, Changwoo Min
© 2016 - 2024 Red Hat, Inc.