include/linux/rbtree_augmented.h | 26 + include/linux/sched.h | 7 +- kernel/sched/core.c | 2 + kernel/sched/debug.c | 48 +- kernel/sched/fair.c | 1105 ++++++++++++++++++-------------------- kernel/sched/features.h | 24 +- kernel/sched/sched.h | 16 +- 7 files changed, 587 insertions(+), 641 deletions(-)
Hi! Latest version of the EEVDF [1] patches. The only real change since last time is the fix for tick-preemption [2], and a simple safe-guard for the mixed slice heuristic. Other than that, I've re-arranged the patches to make EEVDF come first and have the latency-nice or slice-attribute patches on top. Results should not be different from last time around, lots of people ran them and found no major performance issues; what was found was better latency and smaller variance (probably due to the more stable latency). I'm hoping we can start queueing this part. The big question is what additional interface to expose; some people have voiced objections to the latency-nice interface, the 'obvious' alternative is to directly expose the slice length as a request/hint. The very last patch implements this alternative using sched_attr::sched_runtime but is untested. Diffstat for the base patches [1-11]: include/linux/rbtree_augmented.h | 26 + include/linux/sched.h | 7 +- kernel/sched/core.c | 2 + kernel/sched/debug.c | 48 +- kernel/sched/fair.c | 1105 ++++++++++++++++++-------------------- kernel/sched/features.h | 24 +- kernel/sched/sched.h | 16 +- 7 files changed, 587 insertions(+), 641 deletions(-) [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564 [2] https://lkml.kernel.org/r/20230420150537.GC4253%40hirez.programming.kicks-ass.net
Hi Peter,
On Wed, May 31, 2023 at 01:58:39PM +0200, Peter Zijlstra wrote:
>
> Hi!
>
> Latest version of the EEVDF [1] patches.
>
> The only real change since last time is the fix for tick-preemption [2], and a
> simple safe-guard for the mixed slice heuristic.
We're seeing regressions from EEVDF with SPEC CPU, a database workload,
and a Java workload. We tried SPEC CPU on five systems, and here are
numbers from one of them (high core count, two-socket x86 machine).
SPECrate2017 oversubscribed by 2x (two copies of the test per CPU)
Base: v6.3-based kernel
EEVDF: Base + patches from May 31 [0]
Performance comparison: >0 if EEVDF wins
Integer
-0.5% 500.perlbench_r
-6.6% 502.gcc_r
-8.7% 505.mcf_r
-9.2% 520.omnetpp_r
-6.6% 523.xalancbmk_r
-0.7% 525.x264_r
-2.1% 531.deepsjeng_r
-0.4% 541.leela_r
-0.3% 548.exchange2_r
-2.6% 557.xz_r
-3.8% Est(*) SPECrate2017_int_base
Floating Point
-0.6% 503.bwaves_r
-1.3% 507.cactuBSSN_r
-0.8% 508.namd_r
-17.8% 510.parest_r
0.3% 511.povray_r
-1.0% 519.lbm_r
-7.7% 521.wrf_r
-2.4% 526.blender_r
-6.1% 527.cam4_r
-2.0% 538.imagick_r
0.1% 544.nab_r
-0.7% 549.fotonik3d_r
-11.3% 554.roms_r
-4.1% Est(*) SPECrate2017_fp_base
(*) SPEC CPU Fair Use rules require that tests with non-production
components must be marked as estimates.
The other machines show similarly consistent regressions, and we've tried a
v6.5-rc4-based kernel with the latest EEVDF patches from tip/sched/core
including the recent fixlet "sched/eevdf: Curb wakeup-preemption". I can post
the rest of the numbers, but I'm trying to keep this on the shorter side for
now.
Running the database workload on a two-socket x86 server, we see
regressions of up to 6% when the number of users exceeds the number of
CPUs.
With the Java workload on another two-socket x86 server, we see a 10%
regression.
We're investigating the other benchmarks, but here's what I've found so far
with SPEC CPU. Some schedstats showed that eevdf is tick-preemption happy
(patches below). These stats were taken over 1 minute near the middle of a ~26
minute benchmark (502.gcc_r).
Base: v6.5-rc4-based kernel
EEVDF: Base + the latest EEVDF patches from tip/sched/core
schedstat Base EEVDF
sched 1,243,911 3,947,251
tick_check_preempts 12,899,049
tick_preempts 1,022,998
check_deadline 15,878,463
update_deadline 3,895,530
preempt_deadline 3,751,580
In both kernels, tick preemption is primarily what drives schedule()s.
Preemptions happen over three times more often for EEVDF because in the base,
tick preemption happens after a task has run through its ideal timeslice as a
fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
slice determines when it gets tick-preempted, and that's 3ms by default. It
seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
load like this. When I set the base_slice_ns sysctl to 12000000, the
regression disappears.
I'm still thinking about how to fix it. Pre-EEVDF, tick preemption was
more flexible in that a task's timeslice could change depending on how
much competition it had on the same CPU, but with EEVDF the timeslice is
fixed no matter what else is running, and growing or shrinking it
depending on nr_running doesn't honor whatever deadline was set for the
task.
The schedstat patch for the base:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..fb5a35aa07ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4996,6 +4996,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
struct sched_entity *se;
s64 delta;
+ schedstat_inc(rq_of(cfs_rq)->tick_check_preempts);
+
/*
* When many tasks blow up the sched_period; it is possible that
* sched_slice() reports unusually large results (when many tasks are
@@ -5005,6 +5007,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
+ schedstat_inc(rq_of(cfs_rq)->tick_preempts);
resched_curr(rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
@@ -5028,8 +5031,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
if (delta < 0)
return;
- if (delta > ideal_runtime)
+ if (delta > ideal_runtime) {
+ schedstat_inc(rq_of(cfs_rq)->tick_preempts);
resched_curr(rq_of(cfs_rq));
+ }
}
static void
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..1bf12e271756 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,6 +1123,10 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* tick preempt stats */
+ unsigned int tick_check_preempts;
+ unsigned int tick_preempts;
#endif
#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..7997b8538b72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,13 @@ static int show_schedstat(struct seq_file *seq, void *v)
/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->tick_check_preempts, rq->tick_preempts);
seq_printf(seq, "\n");
The schedstat patch for eevdf:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cffec98724f3..675f4bbac471 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -975,18 +975,21 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
*/
static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ schedstat_inc(rq_of(cfs_rq)->check_deadline);
if ((s64)(se->vruntime - se->deadline) < 0)
return;
/*
* EEVDF: vd_i = ve_i + r_i / w_i
*/
+ schedstat_inc(rq_of(cfs_rq)->update_deadline);
se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
/*
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
+ schedstat_inc(rq_of(cfs_rq)->preempt_deadline);
resched_curr(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93c2dc80143f..c44b59556367 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,11 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* update_deadline() stats */
+ unsigned int check_deadline;
+ unsigned int update_deadline;
+ unsigned int preempt_deadline;
#endif
#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..2a8bd742507d 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_deadline, rq->update_deadline,
+ rq->preempt_deadline);
seq_printf(seq, "\n");
[0] https://lore.kernel.org/all/20230531115839.089944915@infradead.org/
On Wed, Aug 23, 2023 at 08:52:26PM -0400, Daniel Jordan wrote: > We're investigating the other benchmarks, but here's what I've found so far > with SPEC CPU. Some schedstats showed that eevdf is tick-preemption happy > (patches below). These stats were taken over 1 minute near the middle of a ~26 > minute benchmark (502.gcc_r). > > Base: v6.5-rc4-based kernel > EEVDF: Base + the latest EEVDF patches from tip/sched/core > > schedstat Base EEVDF > > sched 1,243,911 3,947,251 > > tick_check_preempts 12,899,049 > tick_preempts 1,022,998 > > check_deadline 15,878,463 > update_deadline 3,895,530 > preempt_deadline 3,751,580 > > In both kernels, tick preemption is primarily what drives schedule()s. > Preemptions happen over three times more often for EEVDF because in the base, > tick preemption happens after a task has run through its ideal timeslice as a > fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server > with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base > slice determines when it gets tick-preempted, and that's 3ms by default. It > seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound > load like this. When I set the base_slice_ns sysctl to 12000000, the > regression disappears. > > I'm still thinking about how to fix it. EEVDF fundamentally supports per task request/slice sizes, which is the primary motivator for finally finishing these patches. So the plan is to extend sched_setattr() to allow tasks setting their own ideal slice length. But we're not quite there yet. Having just returned from PTO the mailbox is an utter trainwreck, but I'll try and refresh those few patches this week for consideration. In the meantime I think you found the right knob to twiddle.
>
> EEVDF fundamentally supports per task request/slice sizes, which is the
> primary motivator for finally finishing these patches.
>
> So the plan is to extend sched_setattr() to allow tasks setting their
> own ideal slice length. But we're not quite there yet.
>
> Having just returned from PTO the mailbox is an utter trainwreck, but
> I'll try and refresh those few patches this week for consideration.
>
> In the meantime I think you found the right knob to twiddle.
Hello Peter,
I am trying to understand a little better the need for the eligibility
checks (entity_eligible). I understand the general concept, but I am
trying to find a scenario where it is necessary. And maybe propose to
have it toggled by a feature flag.
Some of my testing:
All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
It was done on the 6.5-rc3 kernel with EEVDF changes ported.
I have two CPU bound tasks one with a nice of -4 and the other with a
nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })
With entity_eligible *enabled* and with entity_eligible *disabled*
(always returns 1):
Top showed consistent results, one at ~70% and the other at ~30%
So it seems the deadline adjustment will naturally achieve fairness.
I also added a few trace_printks to see if there is a case where
entity_eligible would have returned 0 before the deadline forced us to
reschedule. There were a few such cases. The following snippet of
prints shows that an entity became ineligible 2 slices before its
deadline expired. It seems like this will add more context switching
but still achieve a similar result at the end.
bprint: pick_eevdf: eligibility check: tid=4568,
eligible=0, deadline=26577257249, vruntime=26575761118
bprint: pick_eevdf: found best deadline: tid=4573,
deadline=26575451399, vruntime=26574838855
sched_switch: prev_comm=loop prev_pid=4568 prev_prio=120
prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
bputs: task_tick_fair: tick
bputs: task_tick_fair: tick
bprint: pick_eevdf: eligibility check: tid=4573,
eligible=1, deadline=26576270304, vruntime=26575657159
bprint: pick_eevdf: found best deadline: tid=4573,
deadline=26576270304, vruntime=26575657159
bputs: task_tick_fair: tick
bputs: task_tick_fair: tick
bprint: pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577089170, vruntime=26576476006
bprint: pick_eevdf: found best deadline: tid=4573,
deadline=26577089170, vruntime=26576476006
bputs: task_tick_fair: tick
bputs: task_tick_fair: tick
bprint: pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577908042, vruntime=26577294838
bprint: pick_eevdf: found best deadline: tid=4568,
deadline=26577257249, vruntime=26575761118
sched_switch: prev_comm=loop prev_pid=4573 prev_prio=116
prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120
In a more practical example, I tried this with one of our benchmarks
that involves running Meet and Docs side by side and measuring the
input latency in the Docs document. The following is the average
latency for 5 runs:
(These numbers are after removing our cgroup hierarchy - that might be
a discussion for a later time).
CFS: 168ms
EEVDF with eligibility: 206ms (regression from CFS)
EEVDF *without* eligibility: 143ms (improvement to CFS)
EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
104ms (great improvement)
Removing the eligibility check for this workload seemed to result in a
great improvement. I haven't dug deeper but I suspect it's related to
reduced context switches on our 2 core system.
As an extra test I also increased the base_slice_ns and it further
improved the input latency significantly.
I would love to hear your thoughts. Thanks!
On Fri, Sep 29, 2023 at 11:54:25AM -0500, Youssef Esmat wrote: > > > > EEVDF fundamentally supports per task request/slice sizes, which is the > > primary motivator for finally finishing these patches. > > > > So the plan is to extend sched_setattr() to allow tasks setting their > > own ideal slice length. But we're not quite there yet. > > > > Having just returned from PTO the mailbox is an utter trainwreck, but > > I'll try and refresh those few patches this week for consideration. > > > > In the meantime I think you found the right knob to twiddle. > > Hello Peter, > > I am trying to understand a little better the need for the eligibility > checks (entity_eligible). I understand the general concept, but I am > trying to find a scenario where it is necessary. And maybe propose to > have it toggled by a feature flag. My initial response was that it ensures fairness, but thinking a little about it I'm not so sure anymore. I do think it features in section 6 lemma 4 and 5, which provide interference bounds. But I'd have to think a little more about it. The current setup, where all requests are of equal size, then virtual deadline tree is basically identical to the virtual runtime tree, just transposed (in the static state scenario). When mixing request sizes things become a little more interesting. Let me ponder this a little bit more.
On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:
> When mixing request sizes things become a little more interesting.
>
> Let me ponder this a little bit more.
Using the attached program (I got *REALLY* fed up trying to draw these
diagrams by hand), let us illustrate the difference between Earliest
*Eligible* Virtual Deadline First and the one with the Eligible test
taken out: EVDF.
Specifically, the program was used with the following argument for
EEVDF:
./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
and with an additional '-n' for the EVDF column.
EEVDF EVDF
d = 6 d = 6
d = 2 d = 2
d = 18 d = 18
q = 2 q = 2
t=0 V=1 t=0 V=1
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|*--------|---------|---------|---------|---- |*--------|---------|---------|---------|----
t=2 V=1 t=2 V=1
>A |----< A |----<
B |< >B |<
C |----------------< C |----------------<
|*--------|---------|---------|---------|---- |*--------|---------|---------|---------|----
t=8 V=3 t=4 V=2
A |----< >A |----<
>B |< B |<
C |----------------< C |----------------<
|--*------|---------|---------|---------|---- |-*-------|---------|---------|---------|----
t=10 V=4 t=10 V=4
A |----< A |----<
B |< >B |<
>C |----------------< C |----------------<
|---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----
t=28 V=10 t=12 V=5
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------*---------|---------|---------|---- |----*----|---------|---------|---------|----
t=30 V=11 t=14 V=5
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|*--------|---------|---------|---- |----*----|---------|---------|---------|----
t=32 V=11 t=16 V=6
A |----< >A |----<
>B |< B |<
C |----------------< C |----------------<
|---------|*--------|---------|---------|---- |-----*---|---------|---------|---------|----
t=34 V=12 t=22 V=8
>A |----< A |----<
B |< >B |<
C |----------------< C |----------------<
|---------|-*-------|---------|---------|---- |-------*-|---------|---------|---------|----
t=40 V=14 t=24 V=9
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|---*-----|---------|---------|---- |--------*|---------|---------|---------|----
t=42 V=15 t=26 V=9
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|----*----|---------|---------|---- |--------*|---------|---------|---------|----
t=44 V=15 t=28 V=10
A |----< >A |----<
>B |< B |<
C |----------------< C |----------------<
|---------|----*----|---------|---------|---- |---------*---------|---------|---------|----
t=46 V=16 t=34 V=12
>A |----< A |----<
B |< >B |<
C |----------------< C |----------------<
|---------|-----*---|---------|---------|---- |---------|-*-------|---------|---------|----
t=52 V=18 t=36 V=13
A |----< A |----<
>B |< B |<
C |----------------< >C |----------------<
|---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----
t=54 V=19 t=54 V=19
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|--------*|---------|---------|---- |---------|--------*|---------|---------|----
lags: -10, 6 lags: -7, 11
BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
As I wrote before; EVDF has worse lag bounds, but this is not
insurmountable. The biggest problem that I can see is that of wakeup
preemption. Currently we allow to preempt when 'current' has reached V
(RUN_TO_PARITY in pick_eevdf()).
With these rules, when EEVDF schedules C (our large slice task) at t=10
above, it is only a little behind C and can be reaily preempted after
about 2 time units.
However, EVDF will delay scheduling C until much later, see how A and B
walk far ahead of V until t=36. Only when will we pick C. But this means
that we're firmly stuck with C for at least 11 time units. A newly
placed task will be around V and will have no chance to preempt.
That said, I do have me a patch to cure some of that:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
That would allow a task with a shorter request time to preempt in spite
of RUN_TO_PARITY.
However, in this example V is only 2/3 of the way to C's deadline, but
it we were to have many more tasks, you'll see V gets closer and closer
to C's deadline and it will become harder and harder to place such that
preemption becomes viable.
Adding 4 more tasks:
./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"
t=92 V=16
A |----<
B |<
>C |----------------<
D |<
E |<
F |<
G |<
|---------|-----*---|---------|---------|----
And I worry this will create very real latency spikes.
That said; I do see not having the eligibility check can help. So I'm
not opposed to having a sched_feat for this, but I would not want to
default to EVDF.
On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
> t=10 V=4 t=10 V=4
> A |----< A |----<
> B |< >B |<
> >C |----------------< C |----------------<
> |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----
>
>
> t=52 V=18 t=36 V=13
> A |----< A |----<
> >B |< B |<
> C |----------------< >C |----------------<
> |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----
>
>
> BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
>
>
>
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
>
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
>
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
Playing around with it a little:
EEVDF EVDF
slice 30000000 slice 30000000
# Min Latencies: 00014 # Min Latencies: 00048
# Avg Latencies: 00692 # Avg Latencies: 188239
# Max Latencies: 94633 # Max Latencies: 961241
slice 3000000 slice 3000000
# Min Latencies: 00054 # Min Latencies: 00055
# Avg Latencies: 00522 # Avg Latencies: 00673
# Max Latencies: 41475 # Max Latencies: 13297
slice 300000 slice 300000
# Min Latencies: 00018 # Min Latencies: 00024
# Avg Latencies: 00344 # Avg Latencies: 00056
# Max Latencies: 20061 # Max Latencies: 00860
So while it improves the short slices, it completely blows up the large
slices -- utterly slaughters the large slices in fact.
And all the many variants of BIAS_ELIGIBLE that I've tried so far only
manage to murder the high end while simultaneously not actually helping
the low end -- so that's a complete write off.
By far the sanest option so far is PLACE_SLEEPER -- and that is very
much not a nice option either :-(
On Sat, Oct 7, 2023 at 5:04 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote: > > > t=10 V=4 t=10 V=4 > > A |----< A |----< > > B |< >B |< > > >C |----------------< C |----------------< > > |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|---- > > > > > > > t=52 V=18 t=36 V=13 > > A |----< A |----< > > >B |< B |< > > C |----------------< >C |----------------< > > |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|---- > > > > > > > BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB > > > > > > > > As I wrote before; EVDF has worse lag bounds, but this is not > > insurmountable. The biggest problem that I can see is that of wakeup > > preemption. Currently we allow to preempt when 'current' has reached V > > (RUN_TO_PARITY in pick_eevdf()). > > > > With these rules, when EEVDF schedules C (our large slice task) at t=10 > > above, it is only a little behind C and can be reaily preempted after > > about 2 time units. > > > > However, EVDF will delay scheduling C until much later, see how A and B > > walk far ahead of V until t=36. Only when will we pick C. But this means > > that we're firmly stuck with C for at least 11 time units. A newly > > placed task will be around V and will have no chance to preempt. > > Playing around with it a little: > > EEVDF EVDF > > slice 30000000 slice 30000000 > # Min Latencies: 00014 # Min Latencies: 00048 > # Avg Latencies: 00692 # Avg Latencies: 188239 > # Max Latencies: 94633 # Max Latencies: 961241 > > slice 3000000 slice 3000000 > # Min Latencies: 00054 # Min Latencies: 00055 > # Avg Latencies: 00522 # Avg Latencies: 00673 > # Max Latencies: 41475 # Max Latencies: 13297 > > slice 300000 slice 300000 > # Min Latencies: 00018 # Min Latencies: 00024 > # Avg Latencies: 00344 # Avg Latencies: 00056 > # Max Latencies: 20061 # Max Latencies: 00860 > Thanks for sharing. Which workload was used to generate these numbers? I think looking at the sched latency numbers alone does not show the complete picture. I ran the same input latency test again and tried to capture some of these numbers for the chrome processes. EEVDF 1.5ms slice: Input latency test result: 226ms perf sched latency: switches: 1,084,694 avg: 1.139 ms max: 408.397 ms EEVDF 6.0ms slice: Input latency test result: 178ms perf sched latency: switches: 892,306 avg: 1.145 ms max: 354.344 ms EVDF 6.0ms slice: Input latency test result: 112ms perf sched latency: switches: 134,200 avg: 2.610 ms max: 374.888 ms EVDF 6.0ms slice (no run_to_parity, no place_lag, no place_deadline_initial): Input latency test result: 110ms perf sched latency: switches: 531,656 avg: 0.830 ms max: 520.463 ms For our scenario, it is very expensive to interrupt UI threads. It will increase the input latency significantly. Lowering the scheduling latency at the cost of switching out important threads can be very detrimental in this workload. UI and input threads run with a nice value of -8. This also seems to match Daniel's message earlier in this thread where using 12ms base slice improved their benchmarks. That said, this might not be beneficial for all workloads, and we are still trying our other workloads out. > > So while it improves the short slices, it completely blows up the large > slices -- utterly slaughters the large slices in fact. > > And all the many variants of BIAS_ELIGIBLE that I've tried so far only > manage to murder the high end while simultaneously not actually helping > the low end -- so that's a complete write off. > > > By far the sanest option so far is PLACE_SLEEPER -- and that is very > much not a nice option either :-(
Sorry, I seem to have forgotten to reply to this part... On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote: > I think looking at the sched latency numbers alone does not show the > complete picture. I ran the same input latency test again and tried to > capture some of these numbers for the chrome processes. > > EEVDF 1.5ms slice: > > Input latency test result: 226ms > perf sched latency: > switches: 1,084,694 > avg: 1.139 ms > max: 408.397 ms > > EEVDF 6.0ms slice: > > Input latency test result: 178ms > perf sched latency: > switches: 892,306 > avg: 1.145 ms > max: 354.344 ms > For our scenario, it is very expensive to interrupt UI threads. It > will increase the input latency significantly. Lowering the scheduling > latency at the cost of switching out important threads can be very > detrimental in this workload. UI and input threads run with a nice > value of -8. > That said, this might not be beneficial for all workloads, and we are > still trying our other workloads out. Right, this seems to suggest something on your critical path (you should trace that) has more than 3ms of compute in a single activation. Basically this means chrome is fairly fat on this critical path. But it seems you know about that. Anyway, once you know the 95% length of the longest activation on your critical path, you can indeed set your slice to that. This should be readily available from trace data.
On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote: > > Playing around with it a little: > > > > EEVDF EVDF > > > > slice 30000000 slice 30000000 > > # Min Latencies: 00014 # Min Latencies: 00048 > > # Avg Latencies: 00692 # Avg Latencies: 188239 > > # Max Latencies: 94633 # Max Latencies: 961241 > > > > slice 3000000 slice 3000000 > > # Min Latencies: 00054 # Min Latencies: 00055 > > # Avg Latencies: 00522 # Avg Latencies: 00673 > > # Max Latencies: 41475 # Max Latencies: 13297 > > > > slice 300000 slice 300000 > > # Min Latencies: 00018 # Min Latencies: 00024 > > # Avg Latencies: 00344 # Avg Latencies: 00056 > > # Max Latencies: 20061 # Max Latencies: 00860 > > > > Thanks for sharing. Which workload was used to generate these numbers? This is hackbench vs cyclictest, where cyclictest gets a custom slice set, the big slice is 10 * normal, the middle slice is normal (equal to hackbench) and the short slice is normal / 10.
On Sun, Oct 08, 2023 at 12:04:00AM +0200, Peter Zijlstra wrote: > On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote: > > > t=10 V=4 t=10 V=4 > > A |----< A |----< > > B |< >B |< > > >C |----------------< C |----------------< > > |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|---- > > > > > > > t=52 V=18 t=36 V=13 > > A |----< A |----< > > >B |< B |< > > C |----------------< >C |----------------< > > |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|---- > > > > > > > BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB > > > > > > > > As I wrote before; EVDF has worse lag bounds, but this is not > > insurmountable. The biggest problem that I can see is that of wakeup > > preemption. Currently we allow to preempt when 'current' has reached V > > (RUN_TO_PARITY in pick_eevdf()). > > > > With these rules, when EEVDF schedules C (our large slice task) at t=10 > > above, it is only a little behind C and can be reaily preempted after > > about 2 time units. > > > > However, EVDF will delay scheduling C until much later, see how A and B > > walk far ahead of V until t=36. Only when will we pick C. But this means > > that we're firmly stuck with C for at least 11 time units. A newly > > placed task will be around V and will have no chance to preempt. > > Playing around with it a little: > > EEVDF EVDF > > slice 30000000 slice 30000000 > # Min Latencies: 00014 # Min Latencies: 00048 > # Avg Latencies: 00692 # Avg Latencies: 188239 > # Max Latencies: 94633 # Max Latencies: 961241 > > slice 3000000 slice 3000000 > # Min Latencies: 00054 # Min Latencies: 00055 > # Avg Latencies: 00522 # Avg Latencies: 00673 > # Max Latencies: 41475 # Max Latencies: 13297 > > slice 300000 slice 300000 > # Min Latencies: 00018 # Min Latencies: 00024 > # Avg Latencies: 00344 # Avg Latencies: 00056 > # Max Latencies: 20061 # Max Latencies: 00860 > > > So while it improves the short slices, it completely blows up the large > slices -- utterly slaughters the large slices in fact. > > And all the many variants of BIAS_ELIGIBLE that I've tried so far only > manage to murder the high end while simultaneously not actually helping > the low end -- so that's a complete write off. > > > By far the sanest option so far is PLACE_SLEEPER -- and that is very > much not a nice option either :-( And this can be easily explained by the fact that we insert tasks around 0-lag, so if we delay execution past this point we create an effective DoS window.
On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
> Using the attached program (I got *REALLY* fed up trying to draw these
> diagrams by hand), let us illustrate the difference between Earliest
> *Eligible* Virtual Deadline First and the one with the Eligible test
> taken out: EVDF.
>
> Specifically, the program was used with the following argument for
> EEVDF:
>
> ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
>
> and with an additional '-n' for the EVDF column.
>
<snip a metric ton of diagrams>
>
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
>
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
>
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
>
> That said, I do have me a patch to cure some of that:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
>
> That would allow a task with a shorter request time to preempt in spite
> of RUN_TO_PARITY.
>
So, doing all that gave me an idea, see queue/sched/eevdf BIAS_ELIGIBLE.
It pushes the eligibility threshold (V) right by one average request.
The below patch is against eevdf.c.
I've not yet ran it through the normal set of hackbench,netperf etc.. so
it might eat your pet and set your granny on fire.
--- eevdf.c.orig 2023-10-05 16:11:35.821114320 +0200
+++ eevdf.c 2023-10-05 16:08:38.387080327 +0200
@@ -7,6 +7,7 @@
#include <sys/param.h>
bool eligible = true;
+bool bias = false;
unsigned long V_lim = 20;
struct entity {
@@ -79,16 +80,17 @@
struct entity *pick_entity(int nr, struct entity *es)
{
- unsigned long W = 0, V = 0;
+ unsigned long W = 0, V = 0, R = 0;
struct entity *e = NULL;
for (int i = 0; i < nr; i++) {
V += es[i].weight * es[i].vruntime;
+ R += es[i].request;
W += es[i].weight;
}
for (int i = 0; i < nr; i++) {
- if (eligible && W*es[i].vruntime > V)
+ if (eligible && W*es[i].vruntime > V + (bias * R))
continue;
if (!e || es[i].vdeadline < e->vdeadline)
@@ -169,10 +171,14 @@
const int N = sizeof(es) / sizeof(es[0]);
- while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
+ while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
unsigned int v,w,r;
switch (opt) {
+ case 'b':
+ bias = true;
+ break;
+
case 'n':
eligible = false;
break;
On Thu, Oct 05, 2023 at 04:14:08PM +0200, Peter Zijlstra wrote:
> --- eevdf.c.orig 2023-10-05 16:11:35.821114320 +0200
> +++ eevdf.c 2023-10-05 16:08:38.387080327 +0200
> @@ -7,6 +7,7 @@
> #include <sys/param.h>
>
> bool eligible = true;
> +bool bias = false;
> unsigned long V_lim = 20;
>
> struct entity {
> @@ -79,16 +80,17 @@
>
> struct entity *pick_entity(int nr, struct entity *es)
> {
> - unsigned long W = 0, V = 0;
> + unsigned long W = 0, V = 0, R = 0;
> struct entity *e = NULL;
>
> for (int i = 0; i < nr; i++) {
> V += es[i].weight * es[i].vruntime;
> + R += es[i].request;
* 1024
Also, average seems too much, one large value lifts it too easily.
Need to come up with something better :/
> W += es[i].weight;
> }
>
> for (int i = 0; i < nr; i++) {
> - if (eligible && W*es[i].vruntime > V)
> + if (eligible && W*es[i].vruntime > V + (bias * R))
> continue;
>
> if (!e || es[i].vdeadline < e->vdeadline)
> @@ -169,10 +171,14 @@
>
> const int N = sizeof(es) / sizeof(es[0]);
>
> - while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
> + while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
> unsigned int v,w,r;
>
> switch (opt) {
> + case 'b':
> + bias = true;
> + break;
> +
> case 'n':
> eligible = false;
> break;
On Fri, Sep 29, 2023 at 11:54 AM Youssef Esmat
<youssefesmat@chromium.org> wrote:
>
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
>
> Hello Peter,
>
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.
>
> Some of my testing:
>
> All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
> It was done on the 6.5-rc3 kernel with EEVDF changes ported.
>
> I have two CPU bound tasks one with a nice of -4 and the other with a
> nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })
>
> With entity_eligible *enabled* and with entity_eligible *disabled*
> (always returns 1):
> Top showed consistent results, one at ~70% and the other at ~30%
>
> So it seems the deadline adjustment will naturally achieve fairness.
>
> I also added a few trace_printks to see if there is a case where
> entity_eligible would have returned 0 before the deadline forced us to
> reschedule. There were a few such cases. The following snippet of
> prints shows that an entity became ineligible 2 slices before its
> deadline expired. It seems like this will add more context switching
> but still achieve a similar result at the end.
>
> bprint: pick_eevdf: eligibility check: tid=4568,
> eligible=0, deadline=26577257249, vruntime=26575761118
> bprint: pick_eevdf: found best deadline: tid=4573,
> deadline=26575451399, vruntime=26574838855
> sched_switch: prev_comm=loop prev_pid=4568 prev_prio=120
> prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
> bputs: task_tick_fair: tick
> bputs: task_tick_fair: tick
> bprint: pick_eevdf: eligibility check: tid=4573,
> eligible=1, deadline=26576270304, vruntime=26575657159
> bprint: pick_eevdf: found best deadline: tid=4573,
> deadline=26576270304, vruntime=26575657159
> bputs: task_tick_fair: tick
> bputs: task_tick_fair: tick
> bprint: pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577089170, vruntime=26576476006
> bprint: pick_eevdf: found best deadline: tid=4573,
> deadline=26577089170, vruntime=26576476006
> bputs: task_tick_fair: tick
> bputs: task_tick_fair: tick
> bprint: pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577908042, vruntime=26577294838
> bprint: pick_eevdf: found best deadline: tid=4568,
> deadline=26577257249, vruntime=26575761118
> sched_switch: prev_comm=loop prev_pid=4573 prev_prio=116
> prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120
>
> In a more practical example, I tried this with one of our benchmarks
> that involves running Meet and Docs side by side and measuring the
> input latency in the Docs document. The following is the average
> latency for 5 runs:
>
> (These numbers are after removing our cgroup hierarchy - that might be
> a discussion for a later time).
>
> CFS: 168ms
> EEVDF with eligibility: 206ms (regression from CFS)
> EEVDF *without* eligibility: 143ms (improvement to CFS)
> EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
> 104ms (great improvement)
>
> Removing the eligibility check for this workload seemed to result in a
> great improvement. I haven't dug deeper but I suspect it's related to
> reduced context switches on our 2 core system.
> As an extra test I also increased the base_slice_ns and it further
> improved the input latency significantly.
>
> I would love to hear your thoughts. Thanks!
For completeness I ran two more tests:
1. EEVDF with eligibility and 6ms base_slice_ns.
2. EEVDF with eligibility with Benjamin Segall's patch
(https://lore.kernel.org/all/xm261qego72d.fsf_-_@google.com/).
I copied over all the previous results for easier comparison.
CFS: 168ms
EEVDF, eligib, 1.5ms slice: 206ms
EEVDF, eligib, 6ms slice: 167ms
EEVDF_Fix, eligib, 1.5ms slice: 190ms
EEVDF, *no*eligib, 1.5ms slice: 143ms
EEVDF, *no*eligib, 6ms slice: 104ms
It does seem like Benjamin's fix did have an improvement.
© 2016 - 2026 Red Hat, Inc.