[v4] sched_ext: Support high-performance monotonically non-decreasing clock

[PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()

Posted by Changwoo Min 1 year, 2 months ago

Returns a high-performance monotonically non-decreasing clock for the current
CPU. The clock returned is in nanoseconds.

It provides the following properties:

1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
 to account for execution time and track tasks' runtime properties.
 Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
 eventually reads a hardware timestamp counter -- is neither performant nor
 scalable. scx_bpf_now_ns() aims to provide a high-performance clock by
 using the rq clock in the scheduler core whenever possible.

2) High enough resolution for the BPF scheduler use cases: In most BPF
 scheduler use cases, the required clock resolution is lower than the most
 accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now_ns() basically
 uses the rq clock in the scheduler core whenever it is valid. It considers
 that the rq clock is valid from the time the rq clock is updated
 (update_rq_clock) until the rq is unlocked (rq_unpin_lock).

3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
 guarantees the clock never goes backward when comparing them in the same
 CPU. On the other hand, when comparing clocks in different CPUs, there
 is no such guarantee -- the clock can go backward. It provides a
 monotonically *non-decreasing* clock so that it would provide the same
 clock values in two different scx_bpf_now_ns() calls in the same CPU
 during the same period of when the rq clock is valid.

Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
 kernel/sched/ext.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 71342f3719c1..f0476d5dd6f5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7601,6 +7601,78 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
 }
 #endif
 
+/**
+ * scx_bpf_now_ns - Returns a high-performance monotonically non-decreasing
+ * clock for the current CPU. The clock returned is in nanoseconds.
+ *
+ * It provides the following properties:
+ *
+ * 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
+ *  to account for execution time and track tasks' runtime properties.
+ *  Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
+ *  eventually reads a hardware timestamp counter -- is neither performant nor
+ *  scalable. scx_bpf_now_ns() aims to provide a high-performance clock by
+ *  using the rq clock in the scheduler core whenever possible.
+ *
+ * 2) High enough resolution for the BPF scheduler use cases: In most BPF
+ *  scheduler use cases, the required clock resolution is lower than the most
+ *  accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now_ns() basically
+ *  uses the rq clock in the scheduler core whenever it is valid. It considers
+ *  that the rq clock is valid from the time the rq clock is updated
+ *  (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
+ *
+ * 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
+ *  guarantees the clock never goes backward when comparing them in the same
+ *  CPU. On the other hand, when comparing clocks in different CPUs, there
+ *  is no such guarantee -- the clock can go backward. It provides a
+ *  monotonically *non-decreasing* clock so that it would provide the same
+ *  clock values in two different scx_bpf_now_ns() calls in the same CPU
+ *  during the same period of when the rq clock is valid.
+ */
+__bpf_kfunc u64 scx_bpf_now_ns(void)
+{
+	struct rq *rq;
+	u64 clock;
+
+	preempt_disable();
+
+	/*
+	 * If the rq clock is valid, use the cached rq clock
+	 * whenever the clock does not go backward.
+	 */
+	rq = this_rq();
+	clock = rq->scx.clock;
+
+	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
+	    (rq->scx.prev_clock >= clock)) {
+		/*
+		 * If the rq clock is invalid or goes backward,
+		 * start a new rq clock period with a fresh sched_clock_cpu().
+		 *
+		 * The cached rq clock can go backward because there is a
+		 * race with a timer interrupt. Suppose that a timer interrupt
+		 * occurred while running scx_bpf_now_ns() *after* reading the
+		 * rq clock and *before* comparing the if condition. The timer
+		 * interrupt will eventually call a BPF scheduler's ops.tick(),
+		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
+		 * scheduler core updates the rq clock before calling
+		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
+		 * clock. After handling the timer interrupt, the interrupted
+		 * scx_bpf_now_ns() will be resumed, so the if condition will
+		 * be compared. In this case, the clock, which was read before
+		 * the timer interrupt, will be the same as rq->scx.prev_clock.
+		 * When such a case is detected, start a new rq clock period
+		 * with a fresh sched_clock_cpu().
+		 */
+		clock = sched_clock_cpu(cpu_of(rq));
+		scx_rq_clock_update(rq, clock);
+	}
+
+	preempt_enable();
+
+	return clock;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -7632,6 +7704,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
 #ifdef CONFIG_CGROUP_SCHED
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
 #endif
+BTF_ID_FLAGS(func, scx_bpf_now_ns)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
-- 
2.47.1

Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()

Posted by Peter Zijlstra 1 year, 1 month ago

On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:

> +__bpf_kfunc u64 scx_bpf_now_ns(void)
> +{
> +	struct rq *rq;
> +	u64 clock;
> +
> +	preempt_disable();
> +
> +	/*
> +	 * If the rq clock is valid, use the cached rq clock
> +	 * whenever the clock does not go backward.
> +	 */
> +	rq = this_rq();
> +	clock = rq->scx.clock;
> +
> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
> +	    (rq->scx.prev_clock >= clock)) {

As TJ said, it's best to consider that the clock can wrap.

> +		/*
> +		 * If the rq clock is invalid or goes backward,
> +		 * start a new rq clock period with a fresh sched_clock_cpu().
> +		 *
> +		 * The cached rq clock can go backward because there is a
> +		 * race with a timer interrupt. Suppose that a timer interrupt
> +		 * occurred while running scx_bpf_now_ns() *after* reading the
> +		 * rq clock and *before* comparing the if condition. The timer
> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
> +		 * scheduler core updates the rq clock before calling
> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
> +		 * clock. After handling the timer interrupt, the interrupted
> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
> +		 * be compared. In this case, the clock, which was read before
> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
> +		 * When such a case is detected, start a new rq clock period
> +		 * with a fresh sched_clock_cpu().

This has a wall-of-text problem; use paragraphs?

> +		 */
> +		clock = sched_clock_cpu(cpu_of(rq));
> +		scx_rq_clock_update(rq, clock);

Doesn't this set the VALID bit again? How is using this outside of
RQ-lock and setting VALID a good idea?

> +	}
> +
> +	preempt_enable();
> +
> +	return clock;
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_any)
> @@ -7632,6 +7704,7 @@ BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
>  #ifdef CONFIG_CGROUP_SCHED
>  BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE)
>  #endif
> +BTF_ID_FLAGS(func, scx_bpf_now_ns)
>  BTF_KFUNCS_END(scx_kfunc_ids_any)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_any = {
> -- 
> 2.47.1
>

Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()

Posted by Changwoo Min 1 year, 1 month ago

Hello,

Thank you for the review!

On 24. 12. 11. 18:32, Peter Zijlstra wrote:
> On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:
>> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
>> +	    (rq->scx.prev_clock >= clock)) {
> 
> As TJ said, it's best to consider that the clock can wrap.
I will update it as Tejun suggested.

> 
>> +		/*
>> +		 * If the rq clock is invalid or goes backward,
>> +		 * start a new rq clock period with a fresh sched_clock_cpu().
>> +		 *
>> +		 * The cached rq clock can go backward because there is a
>> +		 * race with a timer interrupt. Suppose that a timer interrupt
>> +		 * occurred while running scx_bpf_now_ns() *after* reading the
>> +		 * rq clock and *before* comparing the if condition. The timer
>> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
>> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
>> +		 * scheduler core updates the rq clock before calling
>> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
>> +		 * clock. After handling the timer interrupt, the interrupted
>> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
>> +		 * be compared. In this case, the clock, which was read before
>> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
>> +		 * When such a case is detected, start a new rq clock period
>> +		 * with a fresh sched_clock_cpu().
> 
> This has a wall-of-text problem; use paragraphs?
I will improve the presentation using multiple paragraphs
and time chart.

>> +		clock = sched_clock_cpu(cpu_of(rq));
>> +		scx_rq_clock_update(rq, clock);
> Doesn't this set the VALID bit again? How is using this outside of
> RQ-lock and setting VALID a good idea?

You are right. The current implementation sets the VALID bit, so
the clock can be reused until the next update_rq_clock(). Another
approach would be not setting the VALID flag, so it gets the
fresh clock every time until next update_rq_clock(). Considering
the clock usages of the scx schedulers, both would be almost the
same in number of sched_clock_cpu() calls. But the second
approach -- not setting the VALID flag outside of rqlock -- would
be more predictable. I will double-check the difference of
sched_clock_cpu() calls, and if they are similar, I will change
it not setting the VALID flag.

Regards,
Changwoo Min

Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()

Posted by Tejun Heo 1 year, 1 month ago

Hello,

I'd roll the preceding two patches into this one.

On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:
...
> 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
>  guarantees the clock never goes backward when comparing them in the same
>  CPU. On the other hand, when comparing clocks in different CPUs, there
>  is no such guarantee -- the clock can go backward. It provides a
>  monotonically *non-decreasing* clock so that it would provide the same
>  clock values in two different scx_bpf_now_ns() calls in the same CPU
>  during the same period of when the rq clock is valid.

We probably should provide helpers to calculate deltas between timestamps
and use them consitently in SCX scheds. e.g. ops.runnable() and
ops.running() can run on different CPUs and it'd be useful and common to
calculate the delta between the two points in time.

...
> +__bpf_kfunc u64 scx_bpf_now_ns(void)
> +{
> +	struct rq *rq;
> +	u64 clock;
> +
> +	preempt_disable();
> +
> +	/*
> +	 * If the rq clock is valid, use the cached rq clock
> +	 * whenever the clock does not go backward.
> +	 */
> +	rq = this_rq();
> +	clock = rq->scx.clock;
> +
> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
> +	    (rq->scx.prev_clock >= clock)) {

The clocks usually start at zero but it'd still be a good idea to use
time_after64() and friends when comparing the ordering between timestamps.

> +		/*
> +		 * If the rq clock is invalid or goes backward,
> +		 * start a new rq clock period with a fresh sched_clock_cpu().
> +		 *
> +		 * The cached rq clock can go backward because there is a
> +		 * race with a timer interrupt. Suppose that a timer interrupt

This is not limited to timer interrupts, right? This kfunc can be called
from anywhere including tracepoints for code running in IRQ.

> +		 * occurred while running scx_bpf_now_ns() *after* reading the
> +		 * rq clock and *before* comparing the if condition. The timer
> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
> +		 * scheduler core updates the rq clock before calling
> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
> +		 * clock. After handling the timer interrupt, the interrupted

This might be easier to explain with two column table explaning what each
party is doing in what order.

> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
> +		 * be compared. In this case, the clock, which was read before
> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
> +		 * When such a case is detected, start a new rq clock period
> +		 * with a fresh sched_clock_cpu().
> +		 */
> +		clock = sched_clock_cpu(cpu_of(rq));
> +		scx_rq_clock_update(rq, clock);

Hmmm... what happens if e.g. a timer ends up performing multiple operations
each going through rq pin/unpin?

Thanks.

-- 
tejun

Re: [PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()

Posted by Changwoo Min 1 year, 1 month ago

Hello,

On 24. 12. 11. 17:14, Tejun Heo wrote:
> Hello,
> 
> I'd roll the preceding two patches into this one.
Sure. I will merge patches 2, 3, 4 into one.

> On Mon, Dec 09, 2024 at 03:15:29PM +0900, Changwoo Min wrote:
> ...
>> 3) Monotonically non-decreasing clock for the same CPU: scx_bpf_now_ns()
>>   guarantees the clock never goes backward when comparing them in the same
>>   CPU. On the other hand, when comparing clocks in different CPUs, there
>>   is no such guarantee -- the clock can go backward. It provides a
>>   monotonically *non-decreasing* clock so that it would provide the same
>>   clock values in two different scx_bpf_now_ns() calls in the same CPU
>>   during the same period of when the rq clock is valid.
> 
> We probably should provide helpers to calculate deltas between timestamps
> and use them consitently in SCX scheds. e.g. ops.runnable() and
> ops.running() can run on different CPUs and it'd be useful and common to
> calculate the delta between the two points in time.

If I understand correctly, it should be something similar to
jiffies_delta_to_msecs(). Regarding the API name, what about
scx_time_delta(s64 time_delta) and/or scx_time_diff(u64 time_a,
u64 time_b)?

>> +	if (!(rq->scx.flags & SCX_RQ_CLK_VALID) ||
>> +	    (rq->scx.prev_clock >= clock)) {
> 
> The clocks usually start at zero but it'd still be a good idea to use
> time_after64() and friends when comparing the ordering between timestamps.

Sure. I will update the code as suggested.

> 
>> +		/*
>> +		 * If the rq clock is invalid or goes backward,
>> +		 * start a new rq clock period with a fresh sched_clock_cpu().
>> +		 *
>> +		 * The cached rq clock can go backward because there is a
>> +		 * race with a timer interrupt. Suppose that a timer interrupt
> 
> This is not limited to timer interrupts, right? This kfunc can be called
> from anywhere including tracepoints for code running in IRQ
Yup, you are right. I will update the comments.


> 
>> +		 * occurred while running scx_bpf_now_ns() *after* reading the
>> +		 * rq clock and *before* comparing the if condition. The timer
>> +		 * interrupt will eventually call a BPF scheduler's ops.tick(),
>> +		 * and the BPF scheduler can call scx_bpf_now_ns(). Since the
>> +		 * scheduler core updates the rq clock before calling
>> +		 * ops.tick(), the scx_bpf_now_ns() call will get the fresh
>> +		 * clock. After handling the timer interrupt, the interrupted
> 
> This might be easier to explain with two column table explaning what each
> party is doing in what order.
I will beautify the text for readability.

> 
>> +		 * scx_bpf_now_ns() will be resumed, so the if condition will
>> +		 * be compared. In this case, the clock, which was read before
>> +		 * the timer interrupt, will be the same as rq->scx.prev_clock.
>> +		 * When such a case is detected, start a new rq clock period
>> +		 * with a fresh sched_clock_cpu().
>> +		 */
>> +		clock = sched_clock_cpu(cpu_of(rq));
>> +		scx_rq_clock_update(rq, clock);
> 
> Hmmm... what happens if e.g. a timer ends up performing multiple operations
> each going through rq pin/unpin?

That should be okay. After multiple rq pin/unpin operations, the
resumed scx_bpf_now_ns() will found that the prev_clock is
greater (not equal) than the current clock, so it will get the
fresh clock.

Thanks!
Changwoo Min

[PATCH v4 1/6] sched_ext: Relocate scx_enabled() related code
[PATCH v4 2/6] sched_ext: Implement scx_rq_clock_update/stale()
[PATCH v4 3/6] sched_ext: Manage the validity of scx_rq_clock
[PATCH v4 4/6] sched_ext: Implement scx_bpf_now_ns()
[PATCH v4 5/6] sched_ext: Add scx_bpf_now_ns() for BPF scheduler
[PATCH v4 6/6] sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns()