[v7] sched_ext: Support high-performance monotonically non-decreasing clock

[PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Changwoo Min 1 year, 4 months ago

Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
timestamp counter (TSC). However, reading a hardware TSC is not
performant in some hardware platforms, degrading IPC.

This patchset addresses the performance problem of reading hardware TSC
by leveraging the rq clock in the scheduler core, introducing a
scx_bpf_now() function for BPF schedulers. Whenever the rq clock
is fresh and valid, scx_bpf_now() provides the rq clock, which is
already updated by the scheduler core (update_rq_clock), so it can reduce
reading the hardware TSC.

When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.

In addition, scx_bpf_now() guarantees the clock is monotonically
non-decreasing for the same CPU, so the clock cannot go backward
in the same CPU.

Using scx_bpf_now() reduces the number of reading hardware TSC
by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
for the following benchmark:

    perf bench -f simple sched messaging -t -g 20 -l 6000

ChangeLog v6 -> v7:
  - Do not update rq_lock when the rq_clock is invalidated.
  - Rename scx_bpf_now_ns() to scx_bpf_now().
  - Rename vtime_*() helper functions to time_*().
  - Fix a typo.

ChangeLog v5 -> v6:
  - Drop prev_clock because a race between a process context and an interrupt
    context (e.g., timer interrupt) is not observable from a caller (discussed
    with Tejun offline), so it is not necessary to check prev_clock.

ChangeLog v4 -> v5:
  - Merge patches 2, 3, and 4 into one for readability.
  - Use a time helper (time_after_eq64) for time comparison at scx_bpf_now_ns().
  - Do not validate the rq clock outside the rq critical section for
    more predictable behavior.
  - Improve the comment at scx_bpf_now_ns() for readability.
  - Rename scx_rq_clock_stale() into  scx_rq_clock_invalidate().
  - Invalidate all the rq clocks upon unloading to prevent getting outdated
    rq clocks from a previous scx scheduler.
  - Use READ/WRITE_ONCE() when accessing rq->scx.{clock, prev_clock, flags}
    to properly handle concurrent accesses from an interrupt contex.
  - Add time helpers for BPF schedulers.
  - Update the rdtsc reduction numbers in the cover letter with the latest
    scx_scheds (1.0.7) and the updated scx_bpf_now_ns().

ChangeLog v3 -> v4:
  - Separate the code relocation related to scx_enabled() into a
    separate patch.
  - Remove scx_rq_clock_stale() after (or before) ops.running() and
    ops.update_idle() calls
  - Rename scx_bpf_clock_get_ns() into scx_bpf_now_ns() and revise it to
    address the comments
  - Move the per-CPU variable holding a prev clock into scx_rq
    (rq->scx.prev_clock)
  - Add a comment describing when the clock could go backward in
    scx_bpf_now_ns()
  - Rebase the code to the tip of Tejun's sched_ext repo (for-next
    branch)

ChangeLog v2 -> v3:
  - To avoid unnecessarily modifying cache lines, scx_rq_clock_update()
    and scx_rq_clock_stale() update the clock and flags only when a
    sched_ext scheduler is enabled.

ChangeLog v1 -> v2:
  - Rename SCX_RQ_CLK_UPDATED to SCX_RQ_CLK_VALID to denote the validity
    of an rq clock clearly.
  - Rearrange the clock and flags fields in struct scx_rq to make sure
    they are in the same cacheline to minimize the cache misses 
  - Add an additional explanation to the commit message in the 2/5 patch
    describing when the rq clock will be reused with an example.
  - Fix typos
  - Rebase the code to the tip of Tejun's sched_ext repo

Changwoo Min (6):
  sched_ext: Relocate scx_enabled() related code
  sched_ext: Implement scx_bpf_now()
  sched_ext: Add scx_bpf_now() for BPF scheduler
  sched_ext: Add time helpers for BPF schedulers
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
  sched_ext: Use time helpers in BPF schedulers

 kernel/sched/core.c                      |  6 +-
 kernel/sched/ext.c                       | 74 +++++++++++++++++-
 kernel/sched/sched.h                     | 51 +++++++++----
 tools/sched_ext/include/scx/common.bpf.h | 95 ++++++++++++++++++++++++
 tools/sched_ext/include/scx/compat.bpf.h |  5 ++
 tools/sched_ext/scx_central.bpf.c        | 11 +--
 tools/sched_ext/scx_flatcg.bpf.c         | 23 +++---
 tools/sched_ext/scx_simple.bpf.c         |  9 +--
 8 files changed, 228 insertions(+), 46 deletions(-)

-- 
2.47.1

Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Tejun Heo 1 year, 4 months ago

On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> timestamp counter (TSC). However, reading a hardware TSC is not
> performant in some hardware platforms, degrading IPC.
> 
> This patchset addresses the performance problem of reading hardware TSC
> by leveraging the rq clock in the scheduler core, introducing a
> scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> is fresh and valid, scx_bpf_now() provides the rq clock, which is
> already updated by the scheduler core (update_rq_clock), so it can reduce
> reading the hardware TSC.
> 
> When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> 
> In addition, scx_bpf_now() guarantees the clock is monotonically
> non-decreasing for the same CPU, so the clock cannot go backward
> in the same CPU.
> 
> Using scx_bpf_now() reduces the number of reading hardware TSC
> by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> for the following benchmark:

The patch series generally look good to me. Peter, if things look okay to
you, I'll apply the series to sched_ext/for-6.14.

Thanks.

-- 
tejun

Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Tejun Heo 1 year, 4 months ago

Hello,

On Fri, Jan 03, 2025 at 12:16:41PM -1000, Tejun Heo wrote:
> On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> > Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> > and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> > properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> > timestamp counter (TSC). However, reading a hardware TSC is not
> > performant in some hardware platforms, degrading IPC.
> > 
> > This patchset addresses the performance problem of reading hardware TSC
> > by leveraging the rq clock in the scheduler core, introducing a
> > scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> > is fresh and valid, scx_bpf_now() provides the rq clock, which is
> > already updated by the scheduler core (update_rq_clock), so it can reduce
> > reading the hardware TSC.
> > 
> > When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> > so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> > 
> > In addition, scx_bpf_now() guarantees the clock is monotonically
> > non-decreasing for the same CPU, so the clock cannot go backward
> > in the same CPU.
> > 
> > Using scx_bpf_now() reduces the number of reading hardware TSC
> > by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> > for the following benchmark:
> 
> The patch series generally look good to me. Peter, if things look okay to
> you, I'll apply the series to sched_ext/for-6.14.

Applying to sched_ext/for-6.14. Please holler if there are concerns.

Thanks.

-- 
tejun

Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Peter Zijlstra 1 year, 4 months ago

On Tue, Jan 07, 2025 at 09:46:48AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 03, 2025 at 12:16:41PM -1000, Tejun Heo wrote:
> > On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> > > Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> > > and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> > > properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> > > timestamp counter (TSC). However, reading a hardware TSC is not
> > > performant in some hardware platforms, degrading IPC.
> > > 
> > > This patchset addresses the performance problem of reading hardware TSC
> > > by leveraging the rq clock in the scheduler core, introducing a
> > > scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> > > is fresh and valid, scx_bpf_now() provides the rq clock, which is
> > > already updated by the scheduler core (update_rq_clock), so it can reduce
> > > reading the hardware TSC.
> > > 
> > > When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> > > so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> > > 
> > > In addition, scx_bpf_now() guarantees the clock is monotonically
> > > non-decreasing for the same CPU, so the clock cannot go backward
> > > in the same CPU.
> > > 
> > > Using scx_bpf_now() reduces the number of reading hardware TSC
> > > by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> > > for the following benchmark:
> > 
> > The patch series generally look good to me. Peter, if things look okay to
> > you, I'll apply the series to sched_ext/for-6.14.
> 
> Applying to sched_ext/for-6.14. Please holler if there are concerns.

Urgh, I'll try and have a look, but have a hate for everybody who's been
working through x-mas :/

Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Tejun Heo 1 year, 4 months ago

Hello,

On Tue, Jan 07, 2025 at 08:53:47PM +0100, Peter Zijlstra wrote:
> > > The patch series generally look good to me. Peter, if things look okay to
> > > you, I'll apply the series to sched_ext/for-6.14.
> > 
> > Applying to sched_ext/for-6.14. Please holler if there are concerns.
> 
> Urgh, I'll try and have a look, but have a hate for everybody who's been
> working through x-mas :/

Ah, I'm just back too. FWIW, the patchset is a lot simpler now, so hopefully
there shouldn't be too much to object to. I'll hold off until you can take a
look.

Thanks.

-- 
tejun

Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Peter Zijlstra 1 year, 4 months ago

On Tue, Jan 07, 2025 at 09:55:16AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 07, 2025 at 08:53:47PM +0100, Peter Zijlstra wrote:
> > > > The patch series generally look good to me. Peter, if things look okay to
> > > > you, I'll apply the series to sched_ext/for-6.14.
> > > 
> > > Applying to sched_ext/for-6.14. Please holler if there are concerns.
> > 
> > Urgh, I'll try and have a look, but have a hate for everybody who's been
> > working through x-mas :/
> 
> Ah, I'm just back too. FWIW, the patchset is a lot simpler now, so hopefully
> there shouldn't be too much to object to. I'll hold off until you can take a
> look.

Thanks! Code does appear simpler now. But I think there's an ordering
issue, I've replied to the patch in question.

Re: [PATCH v7 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

Posted by Andrea Righi 1 year, 4 months ago

Hi Changwoo,

On Mon, Dec 30, 2024 at 06:56:19PM +0900, Changwoo Min wrote:
> Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> timestamp counter (TSC). However, reading a hardware TSC is not
> performant in some hardware platforms, degrading IPC.
> 
> This patchset addresses the performance problem of reading hardware TSC
> by leveraging the rq clock in the scheduler core, introducing a
> scx_bpf_now() function for BPF schedulers. Whenever the rq clock
> is fresh and valid, scx_bpf_now() provides the rq clock, which is
> already updated by the scheduler core (update_rq_clock), so it can reduce
> reading the hardware TSC.
> 
> When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> so a subsequent scx_bpf_now() call gets the fresh sched_clock for the caller.
> 
> In addition, scx_bpf_now() guarantees the clock is monotonically
> non-decreasing for the same CPU, so the clock cannot go backward
> in the same CPU.
> 
> Using scx_bpf_now() reduces the number of reading hardware TSC
> by 50-80% (76% for scx_lavd, 82% for scx_bpfland, and 51% for scx_rusty)
> for the following benchmark:
> 
>     perf bench -f simple sched messaging -t -g 20 -l 6000

Looks good to me, I also tested it and I have nothing to report, therefore:

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea