Returns a high-performance monotonically non-decreasing clock for the
current CPU. The clock returned is in nanoseconds.
It provides the following properties:
1) High performance: Many BPF schedulers call bpf_ktime_get_ns()
frequently to account for execution time and track tasks' runtime
properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns()
-- which eventually reads a hardware timestamp counter -- is neither
performant nor scalable. scx_bpf_clock_get_ns() aims to provide a
high-performance clock by using the rq clock in the scheduler core
whenever possible.
2) High enough resolution for the BPF scheduler use cases: In most BPF
scheduler use cases, the required clock resolution is lower than the
most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_clock_get_ns()
basically uses the rq clock in the scheduler core whenever it is valid.
It considers that the rq clock is valid from the time the rq clock is
updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
In addition, it invalidates the rq clock after long operations --
ops.running() and ops.update_idle() -- in the BPF scheduler.
3) Monotonically non-decreasing clock for the same CPU:
scx_bpf_clock_get_ns() guarantees the clock never goes backward when
comparing them in the same CPU. On the other hand, when comparing clocks
in different CPUs, there is no such guarantee -- the clock can go backward.
It provides a monotonically *non-decreasing* clock so that it would provide
the same clock values in two different scx_bpf_clock_get_ns() calls in the
same CPU during the same period of when the rq clock is valid.
Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
kernel/sched/ext.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b8ad776ef516..b0374274ead2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7541,6 +7541,76 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
}
#endif
+/**
+ * scx_bpf_clock_get_ns - Returns a high-performance monotonically
+ * non-decreasing clock for the current CPU. The clock returned is in
+ * nanoseconds.
+ *
+ * It provides the following properties:
+ *
+ * 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
+ * to account for execution time and track tasks' runtime properties.
+ * Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
+ * eventually reads a hardware timestamp counter -- is neither performant nor
+ * scalable. scx_bpf_clock_get_ns() aims to provide a high-performance clock
+ * by using the rq clock in the scheduler core whenever possible.
+ *
+ * 2) High enough resolution for the BPF scheduler use cases: In most BPF
+ * scheduler use cases, the required clock resolution is lower than the most
+ * accurate hardware clock (e.g., rdtsc in x86). scx_bpf_clock_get_ns()
+ * basically uses the rq clock in the scheduler core whenever it is valid.
+ * It considers that the rq clock is valid from the time the rq clock is
+ * updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
+ * In addition, it invalidates the rq clock after long operations --
+ * ops.running() and ops.update_idle().
+ *
+ * 3) Monotonically non-decreasing clock for the same CPU:
+ * scx_bpf_clock_get_ns() guarantees the clock never goes backward when
+ * comparing them in the same CPU. On the other hand, when comparing clocks
+ * in different CPUs, there is no such guarantee -- the clock can go backward.
+ * It provides a monotonically *non-decreasing* clock so that it would provide
+ * the same clock values in two different scx_bpf_clock_get_ns() calls in the
+ * same CPU during the same period of when the rq clock is valid.
+ */
+__bpf_kfunc u64 scx_bpf_clock_get_ns(void)
+{
+ static DEFINE_PER_CPU(u64, prev_clk);
+ struct rq *rq = this_rq();
+ u64 pr_clk, cr_clk;
+
+ preempt_disable();
+ pr_clk = __this_cpu_read(prev_clk);
+
+ /*
+ * If the rq clock is invalid, start a new rq clock period
+ * with a fresh sched_clock().
+ */
+ if (!(rq->scx.flags & SCX_RQ_CLK_UPDATED)) {
+ cr_clk = sched_clock();
+ scx_rq_clock_update(rq, cr_clk);
+ }
+ /*
+ * If the rq clock is valid, use the cached rq clock
+ * whenever the clock does not go backward.
+ */
+ else {
+ cr_clk = rq->scx.clock;
+ /*
+ * If the clock goes backward, start a new rq clock period
+ * with a fresh sched_clock().
+ */
+ if (pr_clk > cr_clk) {
+ cr_clk = sched_clock();
+ scx_rq_clock_update(rq, cr_clk);
+ }
+ }
+
+ __this_cpu_write(prev_clk, cr_clk);
+ preempt_enable();
+
+ return cr_clk;
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7572,6 +7642,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
#ifdef CONFIG_CGROUP_SCHED
BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
#endif
+BTF_ID_FLAGS(func, scx_bpf_clock_get_ns)
BTF_KFUNCS_END(scx_kfunc_ids_any)
static const struct btf_kfunc_id_set scx_kfunc_set_any = {
--
2.47.0
On Sun, Nov 17, 2024 at 01:01:24AM +0900, Changwoo Min wrote: > Returns a high-performance monotonically non-decreasing clock for the > current CPU. The clock returned is in nanoseconds. > > It provides the following properties: > > 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() > frequently to account for execution time and track tasks' runtime > properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns() > -- which eventually reads a hardware timestamp counter -- is neither > performant nor scalable. scx_bpf_clock_get_ns() aims to provide a > high-performance clock by using the rq clock in the scheduler core > whenever possible. > > 2) High enough resolution for the BPF scheduler use cases: In most BPF > scheduler use cases, the required clock resolution is lower than the > most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_clock_get_ns() > basically uses the rq clock in the scheduler core whenever it is valid. > It considers that the rq clock is valid from the time the rq clock is > updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock). > In addition, it invalidates the rq clock after long operations -- > ops.running() and ops.update_idle() -- in the BPF scheduler. > > 3) Monotonically non-decreasing clock for the same CPU: > scx_bpf_clock_get_ns() guarantees the clock never goes backward when > comparing them in the same CPU. On the other hand, when comparing clocks > in different CPUs, there is no such guarantee -- the clock can go backward. > It provides a monotonically *non-decreasing* clock so that it would provide > the same clock values in two different scx_bpf_clock_get_ns() calls in the > same CPU during the same period of when the rq clock is valid. Have you seen the insides of kernel/sched/clock.c ?
Hello, On 24. 11. 17. 04:31, Peter Zijlstra wrote: > Have you seen the insides of kernel/sched/clock.c ? Of course. :-) It would be super helpful if you could let me know specific questions or comments. I didn't extend (or piggyback) on the clock.c because, as I explained in the other email, I think it is overkill, creating dependencies between clock.c and the sched_ext code. Regards, Changwoo Min
On Mon, Nov 18, 2024 at 12:48:35AM +0900, Changwoo Min wrote: > Hello, > > On 24. 11. 17. 04:31, Peter Zijlstra wrote: > > Have you seen the insides of kernel/sched/clock.c ? > > Of course. :-) It would be super helpful if you could let me know > specific questions or comments. Well, mostly I don't understand anything about what you're doing. Perhaps if you explain what's wrong with the bits we have ? If you're looking at the scheduler, then rq->clock really should be sufficient. If you're looking at rando BPF crud, then what's wrong with local_clock()? Also, why are we still caring about systems that have crazy TSC?
© 2016 - 2024 Red Hat, Inc.