sched: EEVDF and latency-nice and/or slice-attr

[PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 8 months ago

Hi!

Latest version of the EEVDF [1] patches.

The only real change since last time is the fix for tick-preemption [2], and a 
simple safe-guard for the mixed slice heuristic.

Other than that, I've re-arranged the patches to make EEVDF come first and have
the latency-nice or slice-attribute patches on top.

Results should not be different from last time around, lots of people ran them
and found no major performance issues; what was found was better latency and
smaller variance (probably due to the more stable latency).

I'm hoping we can start queueing this part.

The big question is what additional interface to expose; some people have
voiced objections to the latency-nice interface, the 'obvious' alternative
is to directly expose the slice length as a request/hint.

The very last patch implements this alternative using sched_attr::sched_runtime
but is untested.

Diffstat for the base patches [1-11]:

 include/linux/rbtree_augmented.h |   26 +
 include/linux/sched.h            |    7 +-
 kernel/sched/core.c              |    2 +
 kernel/sched/debug.c             |   48 +-
 kernel/sched/fair.c              | 1105 ++++++++++++++++++--------------------
 kernel/sched/features.h          |   24 +-
 kernel/sched/sched.h             |   16 +-
 7 files changed, 587 insertions(+), 641 deletions(-)


[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

[2] https://lkml.kernel.org/r/20230420150537.GC4253%40hirez.programming.kicks-ass.net

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Daniel Jordan 2 years, 5 months ago

Hi Peter,

On Wed, May 31, 2023 at 01:58:39PM +0200, Peter Zijlstra wrote:
> 
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> The only real change since last time is the fix for tick-preemption [2], and a 
> simple safe-guard for the mixed slice heuristic.

We're seeing regressions from EEVDF with SPEC CPU, a database workload,
and a Java workload.  We tried SPEC CPU on five systems, and here are
numbers from one of them (high core count, two-socket x86 machine).

    SPECrate2017 oversubscribed by 2x (two copies of the test per CPU)

    Base: v6.3-based kernel
    EEVDF: Base + patches from May 31 [0]

    Performance comparison: >0 if EEVDF wins

    Integer
     
     -0.5% 500.perlbench_r
     -6.6% 502.gcc_r
     -8.7% 505.mcf_r
     -9.2% 520.omnetpp_r
     -6.6% 523.xalancbmk_r
     -0.7% 525.x264_r
     -2.1% 531.deepsjeng_r
     -0.4% 541.leela_r
     -0.3% 548.exchange2_r
     -2.6% 557.xz_r
     
     -3.8% Est(*) SPECrate2017_int_base
     
    Floating Point
     
     -0.6% 503.bwaves_r
     -1.3% 507.cactuBSSN_r
     -0.8% 508.namd_r
    -17.8% 510.parest_r
      0.3% 511.povray_r
     -1.0% 519.lbm_r
     -7.7% 521.wrf_r
     -2.4% 526.blender_r
     -6.1% 527.cam4_r
     -2.0% 538.imagick_r
      0.1% 544.nab_r
     -0.7% 549.fotonik3d_r
    -11.3% 554.roms_r
     
     -4.1% Est(*) SPECrate2017_fp_base
     
    (*) SPEC CPU Fair Use rules require that tests with non-production
        components must be marked as estimates.

The other machines show similarly consistent regressions, and we've tried a
v6.5-rc4-based kernel with the latest EEVDF patches from tip/sched/core
including the recent fixlet "sched/eevdf: Curb wakeup-preemption".  I can post
the rest of the numbers, but I'm trying to keep this on the shorter side for
now.

Running the database workload on a two-socket x86 server, we see
regressions of up to 6% when the number of users exceeds the number of
CPUs.

With the Java workload on another two-socket x86 server, we see a 10%
regression.

We're investigating the other benchmarks, but here's what I've found so far
with SPEC CPU.  Some schedstats showed that eevdf is tick-preemption happy
(patches below).  These stats were taken over 1 minute near the middle of a ~26
minute benchmark (502.gcc_r).

    Base: v6.5-rc4-based kernel
    EEVDF: Base + the latest EEVDF patches from tip/sched/core

    schedstat                     Base            EEVDF

    sched                    1,243,911        3,947,251

    tick_check_preempts     12,899,049
    tick_preempts            1,022,998

    check_deadline                           15,878,463
    update_deadline                           3,895,530
    preempt_deadline                          3,751,580

In both kernels, tick preemption is primarily what drives schedule()s.
Preemptions happen over three times more often for EEVDF because in the base,
tick preemption happens after a task has run through its ideal timeslice as a
fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
slice determines when it gets tick-preempted, and that's 3ms by default.  It
seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
load like this.  When I set the base_slice_ns sysctl to 12000000, the
regression disappears.

I'm still thinking about how to fix it.  Pre-EEVDF, tick preemption was
more flexible in that a task's timeslice could change depending on how
much competition it had on the same CPU, but with EEVDF the timeslice is
fixed no matter what else is running, and growing or shrinking it
depending on nr_running doesn't honor whatever deadline was set for the
task.

The schedstat patch for the base:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..fb5a35aa07ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4996,6 +4996,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
        struct sched_entity *se;
        s64 delta;

+       schedstat_inc(rq_of(cfs_rq)->tick_check_preempts);
+
        /*
         * When many tasks blow up the sched_period; it is possible that
         * sched_slice() reports unusually large results (when many tasks are
@@ -5005,6 +5007,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
+               schedstat_inc(rq_of(cfs_rq)->tick_preempts);
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
@@ -5028,8 +5031,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
        if (delta < 0)
                return;

-       if (delta > ideal_runtime)
+       if (delta > ideal_runtime) {
+               schedstat_inc(rq_of(cfs_rq)->tick_preempts);
                resched_curr(rq_of(cfs_rq));
+       }
 }

 static void
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..1bf12e271756 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,6 +1123,10 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+
+       /* tick preempt stats */
+       unsigned int            tick_check_preempts;
+       unsigned int            tick_preempts;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..7997b8538b72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,13 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->tick_check_preempts, rq->tick_preempts);

                seq_printf(seq, "\n");


The schedstat patch for eevdf:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cffec98724f3..675f4bbac471 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -975,18 +975,21 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+       schedstat_inc(rq_of(cfs_rq)->check_deadline);
        if ((s64)(se->vruntime - se->deadline) < 0)
                return;

        /*
         * EEVDF: vd_i = ve_i + r_i / w_i
         */
+       schedstat_inc(rq_of(cfs_rq)->update_deadline);
        se->deadline = se->vruntime + calc_delta_fair(se->slice, se);

        /*
         * The task has consumed its request, reschedule.
         */
        if (cfs_rq->nr_running > 1) {
+               schedstat_inc(rq_of(cfs_rq)->preempt_deadline);
                resched_curr(rq_of(cfs_rq));
                clear_buddies(cfs_rq, se);
        }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93c2dc80143f..c44b59556367 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,11 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+
+       /* update_deadline() stats */
+       unsigned int            check_deadline;
+       unsigned int            update_deadline;
+       unsigned int            preempt_deadline;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..2a8bd742507d 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->check_deadline, rq->update_deadline,
+                   rq->preempt_deadline);

                seq_printf(seq, "\n");


[0] https://lore.kernel.org/all/20230531115839.089944915@infradead.org/

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 5 months ago

On Wed, Aug 23, 2023 at 08:52:26PM -0400, Daniel Jordan wrote:

> We're investigating the other benchmarks, but here's what I've found so far
> with SPEC CPU.  Some schedstats showed that eevdf is tick-preemption happy
> (patches below).  These stats were taken over 1 minute near the middle of a ~26
> minute benchmark (502.gcc_r).
> 
>     Base: v6.5-rc4-based kernel
>     EEVDF: Base + the latest EEVDF patches from tip/sched/core
> 
>     schedstat                     Base            EEVDF
> 
>     sched                    1,243,911        3,947,251
> 
>     tick_check_preempts     12,899,049
>     tick_preempts            1,022,998
> 
>     check_deadline                           15,878,463
>     update_deadline                           3,895,530
>     preempt_deadline                          3,751,580
> 
> In both kernels, tick preemption is primarily what drives schedule()s.
> Preemptions happen over three times more often for EEVDF because in the base,
> tick preemption happens after a task has run through its ideal timeslice as a
> fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
> with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
> slice determines when it gets tick-preempted, and that's 3ms by default.  It
> seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
> load like this.  When I set the base_slice_ns sysctl to 12000000, the
> regression disappears.
> 
> I'm still thinking about how to fix it. 

EEVDF fundamentally supports per task request/slice sizes, which is the
primary motivator for finally finishing these patches.

So the plan is to extend sched_setattr() to allow tasks setting their
own ideal slice length. But we're not quite there yet.

Having just returned from PTO the mailbox is an utter trainwreck, but
I'll try and refresh those few patches this week for consideration.

In the meantime I think you found the right knob to twiddle.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Youssef Esmat 2 years, 4 months ago

>
> EEVDF fundamentally supports per task request/slice sizes, which is the
> primary motivator for finally finishing these patches.
>
> So the plan is to extend sched_setattr() to allow tasks setting their
> own ideal slice length. But we're not quite there yet.
>
> Having just returned from PTO the mailbox is an utter trainwreck, but
> I'll try and refresh those few patches this week for consideration.
>
> In the meantime I think you found the right knob to twiddle.

Hello Peter,

I am trying to understand a little better the need for the eligibility
checks (entity_eligible). I understand the general concept, but I am
trying to find a scenario where it is necessary. And maybe propose to
have it toggled by a feature flag.

Some of my testing:

All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
It was done on the 6.5-rc3 kernel with EEVDF changes ported.

I have two CPU bound tasks one with a nice of -4 and the other with a
nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })

With entity_eligible *enabled* and with entity_eligible *disabled*
(always returns 1):
Top showed consistent results, one at ~70% and the other at ~30%

So it seems the deadline adjustment will naturally achieve fairness.

I also added a few trace_printks to see if there is a case where
entity_eligible would have returned 0 before the deadline forced us to
reschedule. There were a few such cases. The following snippet of
prints shows that an entity became ineligible 2 slices before its
deadline expired. It seems like this will add more context switching
but still achieve a similar result at the end.

bprint:               pick_eevdf: eligibility check: tid=4568,
eligible=0, deadline=26577257249, vruntime=26575761118
bprint:               pick_eevdf: found best deadline: tid=4573,
deadline=26575451399, vruntime=26574838855
sched_switch:         prev_comm=loop prev_pid=4568 prev_prio=120
prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
bputs:                task_tick_fair: tick
bputs:                task_tick_fair: tick
bprint:               pick_eevdf: eligibility check: tid=4573,
eligible=1, deadline=26576270304, vruntime=26575657159
bprint:               pick_eevdf: found best deadline: tid=4573,
deadline=26576270304, vruntime=26575657159
bputs:                task_tick_fair: tick
bputs:                task_tick_fair: tick
bprint:               pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577089170, vruntime=26576476006
bprint:               pick_eevdf: found best deadline: tid=4573,
deadline=26577089170, vruntime=26576476006
bputs:                task_tick_fair: tick
bputs:                task_tick_fair: tick
bprint:               pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577908042, vruntime=26577294838
bprint:               pick_eevdf: found best deadline: tid=4568,
deadline=26577257249, vruntime=26575761118
sched_switch:         prev_comm=loop prev_pid=4573 prev_prio=116
prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120

In a more practical example, I tried this with one of our benchmarks
that involves running Meet and Docs side by side and measuring the
input latency in the Docs document. The following is the average
latency for 5 runs:

(These numbers are after removing our cgroup hierarchy - that might be
a discussion for a later time).

CFS: 168ms
EEVDF with eligibility: 206ms (regression from CFS)
EEVDF *without* eligibility: 143ms (improvement to CFS)
EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
104ms (great improvement)

Removing the eligibility check for this workload seemed to result in a
great improvement. I haven't dug deeper but I suspect it's related to
reduced context switches on our 2 core system.
As an extra test I also increased the base_slice_ns and it further
improved the input latency significantly.

I would love to hear your thoughts. Thanks!

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 4 months ago

On Fri, Sep 29, 2023 at 11:54:25AM -0500, Youssef Esmat wrote:
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
> 
> Hello Peter,
> 
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.

My initial response was that it ensures fairness, but thinking a little
about it I'm not so sure anymore.

I do think it features in section 6 lemma 4 and 5, which provide
interference bounds. But I'd have to think a little more about it.

The current setup, where all requests are of equal size, then virtual
deadline tree is basically identical to the virtual runtime tree, just
transposed (in the static state scenario).

When mixing request sizes things become a little more interesting.

Let me ponder this a little bit more.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 4 months ago

On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:

> When mixing request sizes things become a little more interesting.
> 
> Let me ponder this a little bit more.

Using the attached program (I got *REALLY* fed up trying to draw these
diagrams by hand), let us illustrate the difference between Earliest
*Eligible* Virtual Deadline First and the one with the Eligible test
taken out: EVDF.

Specifically, the program was used with the following argument for
EEVDF:

  ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 

and with an additional '-n' for the EVDF column.


EEVDF							EVDF


d = 6							d = 6
d = 2                                                   d = 2
d = 18                                                  d = 18
q = 2                                                   q = 2
                                                        
t=0 V=1                                                 t=0 V=1
 A |----<                                                A |----<
>B  |<                                                  >B  |<
 C   |----------------<                                  C   |----------------<
   |*--------|---------|---------|---------|----           |*--------|---------|---------|---------|----
                                                        
                                                        
t=2 V=1                                                 t=2 V=1
>A |----<                                                A |----<
 B    |<                                                >B    |<
 C   |----------------<                                  C   |----------------<
   |*--------|---------|---------|---------|----           |*--------|---------|---------|---------|----
                                                        
                                                        
t=8 V=3                                                 t=4 V=2
 A       |----<                                         >A |----<
>B    |<                                                 B      |<
 C   |----------------<                                  C   |----------------<
   |--*------|---------|---------|---------|----           |-*-------|---------|---------|---------|----
                                                        
                                                        
t=10 V=4                                                t=10 V=4
 A       |----<                                          A       |----<
 B      |<                                              >B      |<
>C   |----------------<                                  C   |----------------<
   |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
                                                        
                                                        
t=28 V=10                                               t=12 V=5
 A       |----<                                          A       |----<
>B      |<                                              >B        |<
 C                     |----------------<                C   |----------------<
   |---------*---------|---------|---------|----           |----*----|---------|---------|---------|----
                                                        
                                                        
t=30 V=11                                               t=14 V=5
 A       |----<                                          A       |----<
>B        |<                                            >B          |<
 C                     |----------------<                C   |----------------<
   |---------|*--------|---------|---------|----           |----*----|---------|---------|---------|----
                                                        
                                                        
t=32 V=11                                               t=16 V=6
 A       |----<                                         >A       |----<
>B          |<                                           B            |<
 C                     |----------------<                C   |----------------<
   |---------|*--------|---------|---------|----           |-----*---|---------|---------|---------|----
                                                        
                                                        
t=34 V=12                                               t=22 V=8
>A       |----<                                          A             |----<
 B            |<                                        >B            |<
 C                     |----------------<                C   |----------------<
   |---------|-*-------|---------|---------|----           |-------*-|---------|---------|---------|----
                                                        
                                                        
t=40 V=14                                               t=24 V=9
 A             |----<                                    A             |----<
>B            |<                                        >B              |<
 C                     |----------------<                C   |----------------<
   |---------|---*-----|---------|---------|----           |--------*|---------|---------|---------|----
                                                        
                                                        
t=42 V=15                                               t=26 V=9
 A             |----<                                    A             |----<
>B              |<                                      >B                |<
 C                     |----------------<                C   |----------------<
   |---------|----*----|---------|---------|----           |--------*|---------|---------|---------|----
                                                        
                                                        
t=44 V=15                                               t=28 V=10
 A             |----<                                   >A             |----<
>B                |<                                     B                  |<
 C                     |----------------<                C   |----------------<
   |---------|----*----|---------|---------|----           |---------*---------|---------|---------|----
                                                        
                                                        
t=46 V=16                                               t=34 V=12
>A             |----<                                    A                   |----<
 B                  |<                                  >B                  |<
 C                     |----------------<                C   |----------------<
   |---------|-----*---|---------|---------|----           |---------|-*-------|---------|---------|----
                                                        
                                                        
t=52 V=18                                               t=36 V=13
 A                   |----<                              A                   |----<
>B                  |<                                   B                    |<
 C                     |----------------<               >C   |----------------<
   |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
                                                        
                                                        
t=54 V=19                                               t=54 V=19
 A                   |----<                              A                   |----<
>B                    |<                                >B                    |<
 C                     |----------------<                C                     |----------------<
   |---------|--------*|---------|---------|----           |---------|--------*|---------|---------|----
                                                        
                                                        
lags: -10, 6                                            lags: -7, 11
                                                        
BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB



As I wrote before; EVDF has worse lag bounds, but this is not
insurmountable. The biggest problem that I can see is that of wakeup
preemption. Currently we allow to preempt when 'current' has reached V
(RUN_TO_PARITY in pick_eevdf()).

With these rules, when EEVDF schedules C (our large slice task) at t=10
above, it is only a little behind C and can be reaily preempted after
about 2 time units.

However, EVDF will delay scheduling C until much later, see how A and B
walk far ahead of V until t=36. Only when will we pick C. But this means
that we're firmly stuck with C for at least 11 time units. A newly
placed task will be around V and will have no chance to preempt.

That said, I do have me a patch to cure some of that:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621

That would allow a task with a shorter request time to preempt in spite
of RUN_TO_PARITY.

However, in this example V is only 2/3 of the way to C's deadline, but
it we were to have many more tasks, you'll see V gets closer and closer
to C's deadline and it will become harder and harder to place such that
preemption becomes viable.

Adding 4 more tasks:

  ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"

t=92 V=16
 A                   |----<
 B                    |<
>C   |----------------<
 D                    |<
 E                   |<
 F                    |<
 G                   |<
   |---------|-----*---|---------|---------|----


And I worry this will create very real latency spikes.

That said; I do see not having the eligibility check can help. So I'm
not opposed to having a sched_feat for this, but I would not want to
default to EVDF.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 4 months ago

On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:

> t=10 V=4                                                t=10 V=4
>  A       |----<                                          A       |----<
>  B      |<                                              >B      |<
> >C   |----------------<                                  C   |----------------<
>    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
>                                                         

>                                                         
> t=52 V=18                                               t=36 V=13
>  A                   |----<                              A                   |----<
> >B                  |<                                   B                    |<
>  C                     |----------------<               >C   |----------------<
>    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
>                                                         

>                                                         
> BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> 
> 
> 
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
> 
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
> 
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.

Playing around with it a little:

EEVDF					EVDF

slice 30000000				slice 30000000
# Min Latencies: 00014                  # Min Latencies: 00048
# Avg Latencies: 00692                  # Avg Latencies: 188239
# Max Latencies: 94633                  # Max Latencies: 961241
                                        
slice 3000000                           slice 3000000
# Min Latencies: 00054                  # Min Latencies: 00055
# Avg Latencies: 00522                  # Avg Latencies: 00673
# Max Latencies: 41475                  # Max Latencies: 13297
                                        
slice 300000                            slice 300000
# Min Latencies: 00018                  # Min Latencies: 00024
# Avg Latencies: 00344                  # Avg Latencies: 00056
# Max Latencies: 20061                  # Max Latencies: 00860


So while it improves the short slices, it completely blows up the large
slices -- utterly slaughters the large slices in fact.

And all the many variants of BIAS_ELIGIBLE that I've tried so far only
manage to murder the high end while simultaneously not actually helping
the low end -- so that's a complete write off.


By far the sanest option so far is PLACE_SLEEPER -- and that is very
much not a nice option either :-(

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Youssef Esmat 2 years, 4 months ago

On Sat, Oct 7, 2023 at 5:04 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
>
> > t=10 V=4                                                t=10 V=4
> >  A       |----<                                          A       |----<
> >  B      |<                                              >B      |<
> > >C   |----------------<                                  C   |----------------<
> >    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
> >
>
> >
> > t=52 V=18                                               t=36 V=13
> >  A                   |----<                              A                   |----<
> > >B                  |<                                   B                    |<
> >  C                     |----------------<               >C   |----------------<
> >    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
> >
>
> >
> > BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> >
> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
>
> Playing around with it a little:
>
> EEVDF                                   EVDF
>
> slice 30000000                          slice 30000000
> # Min Latencies: 00014                  # Min Latencies: 00048
> # Avg Latencies: 00692                  # Avg Latencies: 188239
> # Max Latencies: 94633                  # Max Latencies: 961241
>
> slice 3000000                           slice 3000000
> # Min Latencies: 00054                  # Min Latencies: 00055
> # Avg Latencies: 00522                  # Avg Latencies: 00673
> # Max Latencies: 41475                  # Max Latencies: 13297
>
> slice 300000                            slice 300000
> # Min Latencies: 00018                  # Min Latencies: 00024
> # Avg Latencies: 00344                  # Avg Latencies: 00056
> # Max Latencies: 20061                  # Max Latencies: 00860
>

Thanks for sharing. Which workload was used to generate these numbers?

I think looking at the sched latency numbers alone does not show the
complete picture. I ran the same input latency test again and tried to
capture some of these numbers for the chrome processes.

EEVDF 1.5ms slice:

Input latency test result: 226ms
perf sched latency:
switches: 1,084,694
avg:   1.139 ms
max: 408.397 ms

EEVDF 6.0ms slice:

Input latency test result: 178ms
perf sched latency:
switches: 892,306
avg:   1.145 ms
max: 354.344 ms

EVDF 6.0ms slice:

Input latency test result: 112ms
perf sched latency:
switches: 134,200
avg:   2.610 ms
max: 374.888 ms

EVDF 6.0ms slice
(no run_to_parity, no place_lag, no place_deadline_initial):

Input latency test result: 110ms
perf sched latency:
switches: 531,656
avg:   0.830 ms
max: 520.463 ms

For our scenario, it is very expensive to interrupt UI threads. It
will increase the input latency significantly. Lowering the scheduling
latency at the cost of switching out important threads can be very
detrimental in this workload. UI and input threads run with a nice
value of -8.

This also seems to match Daniel's message earlier in this thread where
using 12ms base slice improved their benchmarks.

That said, this might not be beneficial for all workloads, and we are
still trying our other workloads out.

>
> So while it improves the short slices, it completely blows up the large
> slices -- utterly slaughters the large slices in fact.
>
> And all the many variants of BIAS_ELIGIBLE that I've tried so far only
> manage to murder the high end while simultaneously not actually helping
> the low end -- so that's a complete write off.
>
>
> By far the sanest option so far is PLACE_SLEEPER -- and that is very
> much not a nice option either :-(

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 3 months ago

Sorry, I seem to have forgotten to reply to this part...

On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote:

> I think looking at the sched latency numbers alone does not show the
> complete picture. I ran the same input latency test again and tried to
> capture some of these numbers for the chrome processes.
> 
> EEVDF 1.5ms slice:
> 
> Input latency test result: 226ms
> perf sched latency:
> switches: 1,084,694
> avg:   1.139 ms
> max: 408.397 ms
> 
> EEVDF 6.0ms slice:
> 
> Input latency test result: 178ms
> perf sched latency:
> switches: 892,306
> avg:   1.145 ms
> max: 354.344 ms

> For our scenario, it is very expensive to interrupt UI threads. It
> will increase the input latency significantly. Lowering the scheduling
> latency at the cost of switching out important threads can be very
> detrimental in this workload. UI and input threads run with a nice
> value of -8.

> That said, this might not be beneficial for all workloads, and we are
> still trying our other workloads out.

Right, this seems to suggest something on your critical path (you should
trace that) has more than 3ms of compute in a single activation. 

Basically this means chrome is fairly fat on this critical path. But it
seems you know about that.

Anyway, once you know the 95% length of the longest activation on your
critical path, you can indeed set your slice to that. This should be
readily available from trace data.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 3 months ago

On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote:

> > Playing around with it a little:
> >
> > EEVDF                                   EVDF
> >
> > slice 30000000                          slice 30000000
> > # Min Latencies: 00014                  # Min Latencies: 00048
> > # Avg Latencies: 00692                  # Avg Latencies: 188239
> > # Max Latencies: 94633                  # Max Latencies: 961241
> >
> > slice 3000000                           slice 3000000
> > # Min Latencies: 00054                  # Min Latencies: 00055
> > # Avg Latencies: 00522                  # Avg Latencies: 00673
> > # Max Latencies: 41475                  # Max Latencies: 13297
> >
> > slice 300000                            slice 300000
> > # Min Latencies: 00018                  # Min Latencies: 00024
> > # Avg Latencies: 00344                  # Avg Latencies: 00056
> > # Max Latencies: 20061                  # Max Latencies: 00860
> >
> 
> Thanks for sharing. Which workload was used to generate these numbers?

This is hackbench vs cyclictest, where cyclictest gets a custom slice
set, the big slice is 10 * normal, the middle slice is normal (equal to
hackbench) and the short slice is normal / 10.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 4 months ago

On Sun, Oct 08, 2023 at 12:04:00AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
> 
> > t=10 V=4                                                t=10 V=4
> >  A       |----<                                          A       |----<
> >  B      |<                                              >B      |<
> > >C   |----------------<                                  C   |----------------<
> >    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
> >                                                         
> 
> >                                                         
> > t=52 V=18                                               t=36 V=13
> >  A                   |----<                              A                   |----<
> > >B                  |<                                   B                    |<
> >  C                     |----------------<               >C   |----------------<
> >    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
> >                                                         
> 
> >                                                         
> > BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> > 
> > 
> > 
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> > 
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> > 
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
> 
> Playing around with it a little:
> 
> EEVDF					EVDF
> 
> slice 30000000				slice 30000000
> # Min Latencies: 00014                  # Min Latencies: 00048
> # Avg Latencies: 00692                  # Avg Latencies: 188239
> # Max Latencies: 94633                  # Max Latencies: 961241
>                                         
> slice 3000000                           slice 3000000
> # Min Latencies: 00054                  # Min Latencies: 00055
> # Avg Latencies: 00522                  # Avg Latencies: 00673
> # Max Latencies: 41475                  # Max Latencies: 13297
>                                         
> slice 300000                            slice 300000
> # Min Latencies: 00018                  # Min Latencies: 00024
> # Avg Latencies: 00344                  # Avg Latencies: 00056
> # Max Latencies: 20061                  # Max Latencies: 00860
> 
> 
> So while it improves the short slices, it completely blows up the large
> slices -- utterly slaughters the large slices in fact.
> 
> And all the many variants of BIAS_ELIGIBLE that I've tried so far only
> manage to murder the high end while simultaneously not actually helping
> the low end -- so that's a complete write off.
> 
> 
> By far the sanest option so far is PLACE_SLEEPER -- and that is very
> much not a nice option either :-(

And this can be easily explained by the fact that we insert tasks around
0-lag, so if we delay execution past this point we create an effective
DoS window.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 4 months ago

On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:

> Using the attached program (I got *REALLY* fed up trying to draw these
> diagrams by hand), let us illustrate the difference between Earliest
> *Eligible* Virtual Deadline First and the one with the Eligible test
> taken out: EVDF.
> 
> Specifically, the program was used with the following argument for
> EEVDF:
> 
>   ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 
> 
> and with an additional '-n' for the EVDF column.
> 

<snip a metric ton of diagrams>

> 
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
> 
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
> 
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
> 
> That said, I do have me a patch to cure some of that:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
> 
> That would allow a task with a shorter request time to preempt in spite
> of RUN_TO_PARITY.
> 

So, doing all that gave me an idea, see queue/sched/eevdf BIAS_ELIGIBLE.

It pushes the eligibility threshold (V) right by one average request.
The below patch is against eevdf.c.

I've not yet ran it through the normal set of hackbench,netperf etc.. so
it might eat your pet and set your granny on fire.


--- eevdf.c.orig	2023-10-05 16:11:35.821114320 +0200
+++ eevdf.c	2023-10-05 16:08:38.387080327 +0200
@@ -7,6 +7,7 @@
 #include <sys/param.h>
 
 bool eligible = true;
+bool bias = false;
 unsigned long V_lim = 20;
 
 struct entity {
@@ -79,16 +80,17 @@
 
 struct entity *pick_entity(int nr, struct entity *es)
 {
-	unsigned long W = 0, V = 0;
+	unsigned long W = 0, V = 0, R = 0;
 	struct entity *e = NULL;
 
 	for (int i = 0; i < nr; i++) {
 		V += es[i].weight * es[i].vruntime;
+		R += es[i].request;
 		W += es[i].weight;
 	}
 
 	for (int i = 0; i < nr; i++) {
-		if (eligible && W*es[i].vruntime > V)
+		if (eligible && W*es[i].vruntime > V + (bias * R))
 			continue;
 
 		if (!e || es[i].vdeadline < e->vdeadline)
@@ -169,10 +171,14 @@
 
 	const int N = sizeof(es) / sizeof(es[0]);
 
-	while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
+	while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
 		unsigned int v,w,r;
 
 		switch (opt) {
+		case 'b':
+			bias = true;
+			break;
+
 		case 'n':
 			eligible = false;
 			break;

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Peter Zijlstra 2 years, 4 months ago

On Thu, Oct 05, 2023 at 04:14:08PM +0200, Peter Zijlstra wrote:

> --- eevdf.c.orig	2023-10-05 16:11:35.821114320 +0200
> +++ eevdf.c	2023-10-05 16:08:38.387080327 +0200
> @@ -7,6 +7,7 @@
>  #include <sys/param.h>
>  
>  bool eligible = true;
> +bool bias = false;
>  unsigned long V_lim = 20;
>  
>  struct entity {
> @@ -79,16 +80,17 @@
>  
>  struct entity *pick_entity(int nr, struct entity *es)
>  {
> -	unsigned long W = 0, V = 0;
> +	unsigned long W = 0, V = 0, R = 0;
>  	struct entity *e = NULL;
>  
>  	for (int i = 0; i < nr; i++) {
>  		V += es[i].weight * es[i].vruntime;
> +		R += es[i].request;

				* 1024

Also, average seems too much, one large value lifts it too easily. 

Need to come up with something better :/

>  		W += es[i].weight;
>  	}
>  
>  	for (int i = 0; i < nr; i++) {
> -		if (eligible && W*es[i].vruntime > V)
> +		if (eligible && W*es[i].vruntime > V + (bias * R))
>  			continue;
>  
>  		if (!e || es[i].vdeadline < e->vdeadline)
> @@ -169,10 +171,14 @@
>  
>  	const int N = sizeof(es) / sizeof(es[0]);
>  
> -	while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
> +	while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
>  		unsigned int v,w,r;
>  
>  		switch (opt) {
> +		case 'b':
> +			bias = true;
> +			break;
> +
>  		case 'n':
>  			eligible = false;
>  			break;

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Posted by Youssef Esmat 2 years, 4 months ago

On Fri, Sep 29, 2023 at 11:54 AM Youssef Esmat
<youssefesmat@chromium.org> wrote:
>
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
>
> Hello Peter,
>
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.
>
> Some of my testing:
>
> All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
> It was done on the 6.5-rc3 kernel with EEVDF changes ported.
>
> I have two CPU bound tasks one with a nice of -4 and the other with a
> nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })
>
> With entity_eligible *enabled* and with entity_eligible *disabled*
> (always returns 1):
> Top showed consistent results, one at ~70% and the other at ~30%
>
> So it seems the deadline adjustment will naturally achieve fairness.
>
> I also added a few trace_printks to see if there is a case where
> entity_eligible would have returned 0 before the deadline forced us to
> reschedule. There were a few such cases. The following snippet of
> prints shows that an entity became ineligible 2 slices before its
> deadline expired. It seems like this will add more context switching
> but still achieve a similar result at the end.
>
> bprint:               pick_eevdf: eligibility check: tid=4568,
> eligible=0, deadline=26577257249, vruntime=26575761118
> bprint:               pick_eevdf: found best deadline: tid=4573,
> deadline=26575451399, vruntime=26574838855
> sched_switch:         prev_comm=loop prev_pid=4568 prev_prio=120
> prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
> bputs:                task_tick_fair: tick
> bputs:                task_tick_fair: tick
> bprint:               pick_eevdf: eligibility check: tid=4573,
> eligible=1, deadline=26576270304, vruntime=26575657159
> bprint:               pick_eevdf: found best deadline: tid=4573,
> deadline=26576270304, vruntime=26575657159
> bputs:                task_tick_fair: tick
> bputs:                task_tick_fair: tick
> bprint:               pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577089170, vruntime=26576476006
> bprint:               pick_eevdf: found best deadline: tid=4573,
> deadline=26577089170, vruntime=26576476006
> bputs:                task_tick_fair: tick
> bputs:                task_tick_fair: tick
> bprint:               pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577908042, vruntime=26577294838
> bprint:               pick_eevdf: found best deadline: tid=4568,
> deadline=26577257249, vruntime=26575761118
> sched_switch:         prev_comm=loop prev_pid=4573 prev_prio=116
> prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120
>
> In a more practical example, I tried this with one of our benchmarks
> that involves running Meet and Docs side by side and measuring the
> input latency in the Docs document. The following is the average
> latency for 5 runs:
>
> (These numbers are after removing our cgroup hierarchy - that might be
> a discussion for a later time).
>
> CFS: 168ms
> EEVDF with eligibility: 206ms (regression from CFS)
> EEVDF *without* eligibility: 143ms (improvement to CFS)
> EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
> 104ms (great improvement)
>
> Removing the eligibility check for this workload seemed to result in a
> great improvement. I haven't dug deeper but I suspect it's related to
> reduced context switches on our 2 core system.
> As an extra test I also increased the base_slice_ns and it further
> improved the input latency significantly.
>
> I would love to hear your thoughts. Thanks!

For completeness I ran two more tests:

1. EEVDF with eligibility and 6ms base_slice_ns.
2. EEVDF with eligibility with Benjamin Segall's patch
(https://lore.kernel.org/all/xm261qego72d.fsf_-_@google.com/).

I copied over all the previous results for easier comparison.

CFS:                                          168ms
EEVDF, eligib, 1.5ms slice:        206ms
EEVDF, eligib, 6ms slice:           167ms
EEVDF_Fix, eligib, 1.5ms slice: 190ms
EEVDF, *no*eligib, 1.5ms slice: 143ms
EEVDF, *no*eligib, 6ms slice:    104ms

It does seem like Benjamin's fix did have an improvement.