sched_ext: Support high-performance monotonically non-decreasing clock

[PATCH 3/5] sched_ext: Implement scx_bpf_clock_get_ns()

Posted by Changwoo Min 5 days, 23 hours ago

Returns a high-performance monotonically non-decreasing clock for the
current CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns()
 frequently to account for execution time and track tasks' runtime
 properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns()
 -- which eventually reads a hardware timestamp counter -- is neither
 performant nor scalable. scx_bpf_clock_get_ns() aims to provide a
 high-performance clock by using the rq clock in the scheduler core
 whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
 scheduler use cases, the required clock resolution is lower than the
 most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_clock_get_ns()
 basically uses the rq clock in the scheduler core whenever it is valid.
 It considers that the rq clock is valid from the time the rq clock is
 updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
 In addition, it invalidates the rq clock after long operations --
 ops.running() and ops.update_idle() -- in the BPF scheduler.

3) Monotonically non-decreasing clock for the same CPU:
 scx_bpf_clock_get_ns() guarantees the clock never goes backward when
 comparing them in the same CPU. On the other hand, when comparing clocks
 in different CPUs, there is no such guarantee -- the clock can go backward.
 It provides a monotonically *non-decreasing* clock so that it would provide
 the same clock values in two different scx_bpf_clock_get_ns() calls in the
 same CPU during the same period of when the rq clock is valid.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b8ad776ef516..b0374274ead2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7541,6 +7541,76 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
 }
 #endif
 
+/**
+ * scx_bpf_clock_get_ns - Returns a high-performance monotonically
+ * non-decreasing clock for the current CPU. The clock returned is in
+ * nanoseconds.
+ *
+ * It provides the following properties:
+ *
+ * 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
+ *  to account for execution time and track tasks' runtime properties.
+ *  Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
+ *  eventually reads a hardware timestamp counter -- is neither performant nor
+ *  scalable. scx_bpf_clock_get_ns() aims to provide a high-performance clock
+ *  by using the rq clock in the scheduler core whenever possible.
+ *
+ * 2) High enough resolution for the BPF scheduler use cases: In most BPF
+ *  scheduler use cases, the required clock resolution is lower than the most
+ *  accurate hardware clock (e.g., rdtsc in x86). scx_bpf_clock_get_ns()
+ *  basically uses the rq clock in the scheduler core whenever it is valid.
+ *  It considers that the rq clock is valid from the time the rq clock is
+ *  updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
+ *  In addition, it invalidates the rq clock after long operations --
+ *  ops.running() and ops.update_idle().
+ *
+ * 3) Monotonically non-decreasing clock for the same CPU:
+ *  scx_bpf_clock_get_ns() guarantees the clock never goes backward when
+ *  comparing them in the same CPU. On the other hand, when comparing clocks
+ *  in different CPUs, there is no such guarantee -- the clock can go backward.
+ *  It provides a monotonically *non-decreasing* clock so that it would provide
+ *  the same clock values in two different scx_bpf_clock_get_ns() calls in the
+ *  same CPU during the same period of when the rq clock is valid.
+ */
+__bpf_kfunc u64 scx_bpf_clock_get_ns(void)
+{
+	static DEFINE_PER_CPU(u64, prev_clk);
+	struct rq *rq = this_rq();
+	u64 pr_clk, cr_clk;
+
+	preempt_disable();
+	pr_clk = __this_cpu_read(prev_clk);
+
+	/*
+	 * If the rq clock is invalid, start a new rq clock period
+	 * with a fresh sched_clock().
+	 */
+	if (!(rq->scx.flags & SCX_RQ_CLK_UPDATED)) {
+		cr_clk = sched_clock();
+		scx_rq_clock_update(rq, cr_clk);
+	}
+	/*
+	 * If the rq clock is valid, use the cached rq clock
+	 * whenever the clock does not go backward.
+	 */
+	else {
+		cr_clk = rq->scx.clock;
+		/*
+		 * If the clock goes backward, start a new rq clock period
+		 * with a fresh sched_clock().
+		 */
+		if (pr_clk > cr_clk) {
+			cr_clk = sched_clock();
+			scx_rq_clock_update(rq, cr_clk);
+		}
+	}
+
+	__this_cpu_write(prev_clk, cr_clk);
+	preempt_enable();
+
+	return cr_clk;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7572,6 +7642,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
 #ifdef CONFIG_CGROUP_SCHED
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
 #endif
+BTF_ID_FLAGS(func, scx_bpf_clock_get_ns)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
-- 
2.47.0

Re: [PATCH 3/5] sched_ext: Implement scx_bpf_clock_get_ns()

Posted by Peter Zijlstra 5 days, 20 hours ago

On Sun, Nov 17, 2024 at 01:01:24AM +0900, Changwoo Min wrote:
> Returns a high-performance monotonically non-decreasing clock for the
> current CPU. The clock returned is in nanoseconds.
> 
> It provides the following properties:
> 
> 1) High performance: Many BPF schedulers call bpf_ktime_get_ns()
>  frequently to account for execution time and track tasks' runtime
>  properties. Unfortunately, in some hardware platforms, bpf_ktime_get_ns()
>  -- which eventually reads a hardware timestamp counter -- is neither
>  performant nor scalable. scx_bpf_clock_get_ns() aims to provide a
>  high-performance clock by using the rq clock in the scheduler core
>  whenever possible.
> 
> 2) High enough resolution for the BPF scheduler use cases: In most BPF
>  scheduler use cases, the required clock resolution is lower than the
>  most accurate hardware clock (e.g., rdtsc in x86). scx_bpf_clock_get_ns()
>  basically uses the rq clock in the scheduler core whenever it is valid.
>  It considers that the rq clock is valid from the time the rq clock is
>  updated (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
>  In addition, it invalidates the rq clock after long operations --
>  ops.running() and ops.update_idle() -- in the BPF scheduler.
> 
> 3) Monotonically non-decreasing clock for the same CPU:
>  scx_bpf_clock_get_ns() guarantees the clock never goes backward when
>  comparing them in the same CPU. On the other hand, when comparing clocks
>  in different CPUs, there is no such guarantee -- the clock can go backward.
>  It provides a monotonically *non-decreasing* clock so that it would provide
>  the same clock values in two different scx_bpf_clock_get_ns() calls in the
>  same CPU during the same period of when the rq clock is valid.

Have you seen the insides of kernel/sched/clock.c ?

Re: [PATCH 3/5] sched_ext: Implement scx_bpf_clock_get_ns()

Posted by Changwoo Min 5 days ago

Hello,

On 24. 11. 17. 04:31, Peter Zijlstra wrote:
> Have you seen the insides of kernel/sched/clock.c ?

Of course. :-) It would be super helpful if you could let me know
specific questions or comments.

I didn't extend (or piggyback) on the clock.c because, as
I explained in the other email, I think it is overkill, creating
dependencies between clock.c and the sched_ext code.

Regards,
Changwoo Min

Re: [PATCH 3/5] sched_ext: Implement scx_bpf_clock_get_ns()

Posted by Peter Zijlstra 4 days, 6 hours ago

On Mon, Nov 18, 2024 at 12:48:35AM +0900, Changwoo Min wrote:
> Hello,
> 
> On 24. 11. 17. 04:31, Peter Zijlstra wrote:
> > Have you seen the insides of kernel/sched/clock.c ?
> 
> Of course. :-) It would be super helpful if you could let me know
> specific questions or comments.

Well, mostly I don't understand anything about what you're doing.
Perhaps if you explain what's wrong with the bits we have ?

If you're looking at the scheduler, then rq->clock really should be
sufficient.

If you're looking at rando BPF crud, then what's wrong with
local_clock()?

Also, why are we still caring about systems that have crazy TSC?