The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.
For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.
For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)
There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.
In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.
This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e78caa21436..34e7d09320f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -74,10 +74,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
/*
* Minimal preemption granularity for CPU-bound tasks:
*
- * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
-unsigned int sysctl_sched_base_slice = 750000ULL;
-static unsigned int normalized_sysctl_sched_base_slice = 750000ULL;
+unsigned int sysctl_sched_base_slice = 700000ULL;
+static unsigned int normalized_sysctl_sched_base_slice = 700000ULL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
--
2.33.0
Hello Zhou,
I'll leave some testing data below but overall, in my testing with
CONFIG_HZ=250 and CONFIG_HZ=10000, I cannot see any major regressions
(at least not for any stable data point) There are few small regressions
probably as a result of grater opportunity for wakeup preemption since
RUN_TO_PARITY will work for a slightly shorter duration now but I
haven't dug deeper to confirm if they are run to run variation or a
result of the larger number of wakeup preemption.
Since most servers run with CONFIG_HZ=250, and the tick is anyways 4ms
and with default base slice currently at 3ms, I don't think there will
be any discernible difference in most workloads (fingers crossed)
Please find full data below.
On 2/8/2025 1:23 PM, zihan zhou wrote:
> The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
> means that we have a default slice of
> 0.75 for 1 cpu
> 1.50 up to 3 cpus
> 2.25 up to 7 cpus
> 3.00 for 8 cpus and above.
>
> For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
> tasks is far higher than their slice.
>
> For HZ=1000 with 8 cpus or more, the accuracy of tick is already
> satisfactory, but there is still an issue that tasks will get an extra
> tick because the tick often arrives a little faster than expected. In
> this case, the task can only wait until the next tick to consider that it
> has reached its deadline, and will run 1ms longer.
>
> vruntime + sysctl_sched_base_slice = deadline
> |-----------|-----------|-----------|-----------|
> 1ms 1ms 1ms 1ms
> ^ ^ ^ ^
> tick1 tick2 tick3 tick4(nearly 4ms)
>
> There are two reasons for tick error: clockevent precision and the
> CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
> CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
> without it, because of clockevent precision, tick still often less than
> 1ms.
>
> In order to make scheduling more precise, we changed 0.75 to 0.70,
> Using 0.70 instead of 0.75 should not change much for other configs
> and would fix this issue:
> 0.70 for 1 cpu
> 1.40 up to 3 cpus
> 2.10 up to 7 cpus
> 2.8 for 8 cpus and above.
>
> This does not guarantee that tasks can run the slice time accurately
> every time, but occasionally running an extra tick has little impact.
>
> Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
o System Details
- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode
- Boost Enabled
- C2 disabled; POLL and MWAIT based C1 remained enabled
o Kernels
mainline: For CONFIG_HZ=250 runs:
mainline kernel at commit 0de63bb7d919 ("Merge
tag 'pull-fix' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs")
For CONFIG_HZ=1000 runs it was v6.14-rc2
new_base_slice: respective mainline + Patch 1
o Benchmark results (CONFIG_HZ=250)
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1-groups 1.00 [ -0.00]( 9.88) 1.09 [ -9.19](11.57)
2-groups 1.00 [ -0.00]( 3.49) 0.97 [ 2.91]( 4.51)
4-groups 1.00 [ -0.00]( 1.22) 0.99 [ 1.04]( 2.47)
8-groups 1.00 [ -0.00]( 0.80) 1.01 [ -1.10]( 1.81)
16-groups 1.00 [ -0.00]( 1.40) 1.01 [ -0.50]( 0.92)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ 0.00]( 1.14) 0.99 [ -1.46]( 0.41)
2 1.00 [ 0.00]( 1.57) 1.01 [ 1.12]( 0.53)
4 1.00 [ 0.00]( 1.16) 0.99 [ -0.79]( 0.50)
8 1.00 [ 0.00]( 0.84) 0.98 [ -1.51]( 0.71)
16 1.00 [ 0.00]( 0.63) 0.97 [ -3.20]( 0.82)
32 1.00 [ 0.00]( 0.96) 0.99 [ -1.36]( 0.86)
64 1.00 [ 0.00]( 0.52) 0.97 [ -2.95]( 3.36)
128 1.00 [ 0.00]( 0.83) 0.99 [ -1.30]( 1.00)
256 1.00 [ 0.00]( 0.67) 1.00 [ -0.45]( 0.49)
512 1.00 [ 0.00]( 0.03) 1.00 [ -0.20]( 0.67)
1024 1.00 [ 0.00]( 0.19) 1.00 [ -0.14]( 0.24)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: mainline[pct imp](CV) new_base_slice[pct imp](CV)
Copy 1.00 [ 0.00](15.75) 1.21 [ 20.54]( 7.80)
Scale 1.00 [ 0.00]( 7.43) 1.00 [ 0.48]( 6.22)
Add 1.00 [ 0.00](10.35) 1.08 [ 7.98]( 6.29)
Triad 1.00 [ 0.00]( 9.34) 1.09 [ 9.09]( 6.91)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: mainline[pct imp](CV) new_base_slice[pct imp](CV)
Copy 1.00 [ 0.00]( 2.19) 1.05 [ 5.06]( 2.16)
Scale 1.00 [ 0.00]( 6.17) 1.02 [ 1.65]( 4.07)
Add 1.00 [ 0.00]( 5.88) 1.04 [ 3.81]( 1.07)
Triad 1.00 [ 0.00]( 1.40) 1.00 [ 0.06]( 3.79)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.14) 0.98 [ -1.66]( 0.64)
2-clients 1.00 [ 0.00]( 0.85) 0.98 [ -1.52]( 0.82)
4-clients 1.00 [ 0.00]( 0.77) 0.98 [ -1.72]( 0.77)
8-clients 1.00 [ 0.00]( 0.53) 0.98 [ -1.60]( 0.59)
16-clients 1.00 [ 0.00]( 0.91) 0.98 [ -1.79]( 0.74)
32-clients 1.00 [ 0.00]( 0.99) 0.99 [ -1.32]( 0.99)
64-clients 1.00 [ 0.00]( 1.35) 0.99 [ -1.43]( 1.39)
128-clients 1.00 [ 0.00]( 1.20) 0.99 [ -1.17]( 1.22)
256-clients 1.00 [ 0.00]( 4.41) 0.99 [ -1.07]( 4.95)
512-clients 1.00 [ 0.00](59.74) 1.00 [ -0.17](59.70)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ -0.00]( 7.39) 1.02 [ -2.38](35.97)
2 1.00 [ -0.00](10.14) 1.02 [ -2.22]( 7.22)
4 1.00 [ -0.00]( 3.53) 1.08 [ -8.33]( 3.27)
8 1.00 [ -0.00](11.48) 0.91 [ 8.93]( 4.97)
16 1.00 [ -0.00]( 7.02) 0.98 [ 1.72]( 6.22)
32 1.00 [ -0.00]( 3.79) 0.97 [ 3.23]( 2.53)
64 1.00 [ -0.00]( 8.22) 0.99 [ 0.57]( 2.31)
128 1.00 [ -0.00]( 4.38) 0.92 [ 8.25](87.57)
256 1.00 [ -0.00](19.81) 1.27 [-27.13](13.43)
512 1.00 [ -0.00]( 2.41) 1.00 [ -0.00]( 2.73)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 0.97 [ -2.64]( 0.68)
2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
4 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15)
8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 0.42) 0.99 [ -0.92]( 3.95)
64 1.00 [ 0.00]( 2.45) 1.03 [ 3.09](15.04)
128 1.00 [ 0.00]( 0.20) 1.00 [ 0.00]( 0.00)
256 1.00 [ 0.00]( 0.84) 1.01 [ 0.92]( 0.54)
512 1.00 [ 0.00]( 0.97) 0.99 [ -0.72]( 0.75)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ -0.00](12.81) 0.91 [ 9.09](14.13)
2 1.00 [ -0.00]( 8.85) 1.00 [ -0.00]( 4.84)
4 1.00 [ -0.00](21.61) 0.86 [ 14.29]( 4.43)
8 1.00 [ -0.00]( 8.13) 0.91 [ 9.09](18.23)
16 1.00 [ -0.00]( 4.08) 1.00 [ -0.00]( 8.37)
32 1.00 [ -0.00]( 4.43) 1.00 [ -0.00](21.56)
64 1.00 [ -0.00]( 4.71) 1.00 [ -0.00](10.16)
128 1.00 [ -0.00]( 2.35) 0.93 [ 7.11]( 6.69)
256 1.00 [ -0.00]( 1.52) 1.02 [ -1.60]( 1.51)
512 1.00 [ -0.00]( 0.40) 1.01 [ -1.17]( 0.34)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ -0.00]( 2.46) 1.04 [ -3.67]( 0.35)
2 1.00 [ -0.00]( 3.16) 1.00 [ -0.26]( 0.13)
4 1.00 [ -0.00]( 3.16) 0.95 [ 4.60]( 2.82)
8 1.00 [ -0.00]( 1.00) 1.05 [ -4.81]( 0.00)
16 1.00 [ -0.00]( 3.77) 1.01 [ -0.80]( 2.44)
32 1.00 [ -0.00]( 1.94) 1.06 [ -6.24](27.22)
64 1.00 [ -0.00]( 1.07) 0.99 [ 1.29]( 0.68)
128 1.00 [ -0.00]( 0.44) 1.01 [ -0.62]( 0.32)
256 1.00 [ -0.00]( 7.02) 1.04 [ -4.45]( 7.53)
512 1.00 [ -0.00]( 1.10) 1.01 [ -1.02]( 2.59)
==================================================================
Test : longer running benchmarks
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmark pct imp
ycsb-cassandra -1.14%
ycsb-mongodb -0.84%
deathstarbench-1x -4.13%
deathstarbench-2x -3.93%
deathstarbench-3x -1.27%
deathstarbench-6x -0.10%
mysql-hammerdb-64VU -0.37%
---
o Benchmark results (CONFIG_HZ=1000)
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1-groups 1.00 [ -0.00]( 8.66) 1.05 [ -5.30](16.73)
2-groups 1.00 [ -0.00]( 5.02) 1.07 [ -6.54]( 7.29)
4-groups 1.00 [ -0.00]( 1.27) 1.02 [ -1.67]( 3.74)
8-groups 1.00 [ -0.00]( 2.75) 0.99 [ 0.78]( 2.61)
16-groups 1.00 [ -0.00]( 2.02) 0.97 [ 2.97]( 1.19)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ 0.00]( 0.40) 1.00 [ -0.44]( 0.47)
2 1.00 [ 0.00]( 0.49) 0.99 [ -0.65]( 1.39)
4 1.00 [ 0.00]( 0.94) 1.00 [ -0.34]( 0.09)
8 1.00 [ 0.00]( 0.64) 0.99 [ -0.77]( 1.57)
16 1.00 [ 0.00]( 1.04) 0.98 [ -2.00]( 0.98)
32 1.00 [ 0.00]( 1.13) 1.00 [ 0.34]( 1.31)
64 1.00 [ 0.00]( 0.58) 1.00 [ -0.28]( 0.80)
128 1.00 [ 0.00]( 1.40) 0.99 [ -0.91]( 0.51)
256 1.00 [ 0.00]( 1.14) 0.99 [ -1.48]( 1.17)
512 1.00 [ 0.00]( 0.51) 1.00 [ -0.25]( 0.66)
1024 1.00 [ 0.00]( 0.62) 0.99 [ -0.79]( 0.40)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: mainline[pct imp](CV) new_base_slice[pct imp](CV)
Copy 1.00 [ 0.00](16.03) 0.98 [ -2.33](17.69)
Scale 1.00 [ 0.00]( 6.26) 0.99 [ -0.60]( 7.94)
Add 1.00 [ 0.00]( 8.35) 1.01 [ 0.50](11.49)
Triad 1.00 [ 0.00]( 9.56) 1.01 [ 0.66]( 9.25)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: mainline[pct imp](CV) new_base_slice[pct imp](CV)
Copy 1.00 [ 0.00]( 6.03) 1.02 [ 1.58]( 2.27)
Scale 1.00 [ 0.00]( 5.78) 1.02 [ 1.64]( 4.50)
Add 1.00 [ 0.00]( 5.25) 1.01 [ 1.37]( 4.17)
Triad 1.00 [ 0.00]( 5.25) 1.03 [ 3.35]( 1.18)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.06) 1.01 [ 0.66]( 0.75)
2-clients 1.00 [ 0.00]( 0.80) 1.01 [ 0.79]( 0.31)
4-clients 1.00 [ 0.00]( 0.65) 1.01 [ 0.56]( 0.73)
8-clients 1.00 [ 0.00]( 0.82) 1.01 [ 0.70]( 0.59)
16-clients 1.00 [ 0.00]( 0.68) 1.01 [ 0.63]( 0.77)
32-clients 1.00 [ 0.00]( 0.95) 1.01 [ 0.87]( 1.06)
64-clients 1.00 [ 0.00]( 1.55) 1.01 [ 0.66]( 1.60)
128-clients 1.00 [ 0.00]( 1.23) 1.00 [ -0.28]( 1.58)
256-clients 1.00 [ 0.00]( 4.92) 1.00 [ 0.25]( 4.47)
512-clients 1.00 [ 0.00](57.12) 1.00 [ 0.24](62.52)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ -0.00](27.55) 0.81 [ 19.35](31.80)
2 1.00 [ -0.00](19.98) 0.87 [ 12.82]( 9.17)
4 1.00 [ -0.00](10.66) 1.09 [ -9.09]( 6.45)
8 1.00 [ -0.00]( 4.06) 0.90 [ 9.62]( 6.38)
16 1.00 [ -0.00]( 5.33) 0.98 [ 1.69]( 1.97)
32 1.00 [ -0.00]( 8.92) 0.97 [ 3.16]( 1.09)
64 1.00 [ -0.00]( 6.06) 0.97 [ 3.30]( 2.97)
128 1.00 [ -0.00](10.15) 1.05 [ -5.47]( 4.75)
256 1.00 [ -0.00](27.12) 1.00 [ -0.20](13.52)
512 1.00 [ -0.00]( 2.54) 0.80 [ 19.75]( 0.40)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.46)
2 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15)
4 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15)
8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 0.43) 1.01 [ 0.63]( 0.28)
64 1.00 [ 0.00]( 1.17) 1.00 [ 0.00]( 0.20)
128 1.00 [ 0.00]( 0.20) 1.00 [ 0.00]( 0.20)
256 1.00 [ 0.00]( 0.27) 1.00 [ 0.00]( 1.69)
512 1.00 [ 0.00]( 0.21) 0.95 [ -4.70]( 0.34)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ -0.00](11.08) 1.33 [-33.33](15.78)
2 1.00 [ -0.00]( 4.08) 1.08 [ -7.69](10.00)
4 1.00 [ -0.00]( 6.39) 1.21 [-21.43](22.13)
8 1.00 [ -0.00]( 6.88) 1.15 [-15.38](11.93)
16 1.00 [ -0.00](13.62) 1.08 [ -7.69](10.33)
32 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 3.87)
64 1.00 [ -0.00]( 8.13) 1.00 [ -0.00]( 2.38)
128 1.00 [ -0.00]( 5.26) 0.98 [ 2.11]( 1.92)
256 1.00 [ -0.00]( 1.00) 0.78 [ 22.36](14.65)
512 1.00 [ -0.00]( 0.48) 0.73 [ 27.15]( 6.75)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) new_base_slice[pct imp](CV)
1 1.00 [ -0.00]( 1.53) 1.00 [ -0.00]( 1.77)
2 1.00 [ -0.00]( 0.50) 1.01 [ -1.35]( 1.19)
4 1.00 [ -0.00]( 0.14) 1.00 [ -0.00]( 0.42)
8 1.00 [ -0.00]( 0.24) 1.00 [ -0.27]( 1.37)
16 1.00 [ -0.00]( 0.00) 1.00 [ 0.27]( 0.14)
32 1.00 [ -0.00]( 0.66) 1.01 [ -1.48]( 2.65)
64 1.00 [ -0.00]( 5.72) 0.96 [ 4.32]( 5.64)
128 1.00 [ -0.00]( 0.10) 1.00 [ -0.20]( 0.18)
256 1.00 [ -0.00]( 2.52) 0.96 [ 4.04]( 9.70)
512 1.00 [ -0.00]( 0.68) 1.06 [ -5.52]( 0.36)
==================================================================
Test : longer running benchmarks
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmark pct imp
ycsb-cassandra -0.64%
ycsb-mongodb 0.56%
deathstarbench-1x 0.30%
deathstarbench-2x 3.21%
deathstarbench-3x 2.18%
deathstarbench-6x -0.40%
mysql-hammerdb-64VU -0.63%
---
If folks are interested in how CONFIG_HZ=250 vs CONFIG_HZ=1000 stack up,
here you go (Note, there is slight variation between the two kernels
since CONFIG_HZ=250 version is closer to v6.14-rc1 and CONFIG_HZ=1000
results are based on v6.14-rc2)
o Benchmark results (CONFIG_HZ=250 vs CONFIG_HZ=1000)
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1-groups 1.00 [ -0.00]( 9.88) 1.02 [ -1.57]( 8.66)
2-groups 1.00 [ -0.00]( 3.49) 0.95 [ 4.57]( 5.02)
4-groups 1.00 [ -0.00]( 1.22) 0.99 [ 0.62]( 1.27)
8-groups 1.00 [ -0.00]( 0.80) 1.00 [ -0.31]( 2.75)
16-groups 1.00 [ -0.00]( 1.40) 0.99 [ 1.17]( 2.02)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1 1.00 [ 0.00]( 1.14) 1.00 [ -0.45]( 0.40)
2 1.00 [ 0.00]( 1.57) 1.01 [ 1.40]( 0.49)
4 1.00 [ 0.00]( 1.16) 1.01 [ 1.16]( 0.94)
8 1.00 [ 0.00]( 0.84) 1.01 [ 1.24]( 0.64)
16 1.00 [ 0.00]( 0.63) 1.00 [ -0.33]( 1.04)
32 1.00 [ 0.00]( 0.96) 1.00 [ -0.30]( 1.13)
64 1.00 [ 0.00]( 0.52) 1.00 [ 0.27]( 0.58)
128 1.00 [ 0.00]( 0.83) 1.00 [ -0.45]( 1.40)
256 1.00 [ 0.00]( 0.67) 1.00 [ 0.15]( 1.14)
512 1.00 [ 0.00]( 0.03) 0.99 [ -0.73]( 0.51)
1024 1.00 [ 0.00]( 0.19) 0.99 [ -1.29]( 0.62)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
Copy 1.00 [ 0.00](15.75) 0.93 [ -6.67](16.03)
Scale 1.00 [ 0.00]( 7.43) 0.97 [ -2.70]( 6.26)
Add 1.00 [ 0.00](10.35) 0.94 [ -6.42]( 8.35)
Triad 1.00 [ 0.00]( 9.34) 0.92 [ -8.26]( 9.56)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
Copy 1.00 [ 0.00]( 2.19) 0.96 [ -3.52]( 6.03)
Scale 1.00 [ 0.00]( 6.17) 1.00 [ -0.22]( 5.78)
Add 1.00 [ 0.00]( 5.88) 0.99 [ -1.05]( 5.25)
Triad 1.00 [ 0.00]( 1.40) 0.96 [ -3.64]( 5.25)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.14) 0.99 [ -0.94]( 0.06)
2-clients 1.00 [ 0.00]( 0.85) 1.00 [ -0.43]( 0.80)
4-clients 1.00 [ 0.00]( 0.77) 0.99 [ -0.63]( 0.65)
8-clients 1.00 [ 0.00]( 0.53) 1.00 [ -0.49]( 0.82)
16-clients 1.00 [ 0.00]( 0.91) 0.99 [ -0.55]( 0.68)
32-clients 1.00 [ 0.00]( 0.99) 0.99 [ -1.01]( 0.95)
64-clients 1.00 [ 0.00]( 1.35) 0.98 [ -1.58]( 1.55)
128-clients 1.00 [ 0.00]( 1.20) 0.99 [ -1.38]( 1.23)
256-clients 1.00 [ 0.00]( 4.41) 0.99 [ -0.68]( 4.92)
512-clients 1.00 [ 0.00](59.74) 0.99 [ -1.16](57.12)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1 1.00 [ -0.00]( 7.39) 0.74 [ 26.19](27.55)
2 1.00 [ -0.00](10.14) 0.87 [ 13.33](19.98)
4 1.00 [ -0.00]( 3.53) 0.92 [ 8.33](10.66)
8 1.00 [ -0.00](11.48) 0.93 [ 7.14]( 4.06)
16 1.00 [ -0.00]( 7.02) 1.02 [ -1.72]( 5.33)
32 1.00 [ -0.00]( 3.79) 1.02 [ -2.15]( 8.92)
64 1.00 [ -0.00]( 8.22) 1.05 [ -4.60]( 6.06)
128 1.00 [ -0.00]( 4.38) 0.91 [ 9.48](10.15)
256 1.00 [ -0.00](19.81) 1.01 [ -0.60](27.12)
512 1.00 [ -0.00]( 2.41) 0.91 [ 9.45]( 2.54)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 0.99 [ -0.59]( 0.15)
2 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.15)
4 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.15)
8 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.00)
16 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.00)
32 1.00 [ 0.00]( 0.42) 0.98 [ -1.54]( 0.43)
64 1.00 [ 0.00]( 2.45) 1.03 [ 3.09]( 1.17)
128 1.00 [ 0.00]( 0.20) 0.98 [ -1.51]( 0.20)
256 1.00 [ 0.00]( 0.84) 1.02 [ 1.53]( 0.27)
512 1.00 [ 0.00]( 0.97) 1.02 [ 2.16]( 0.21)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1 1.00 [ -0.00](12.81) 1.09 [ -9.09](11.08)
2 1.00 [ -0.00]( 8.85) 1.18 [-18.18]( 4.08)
4 1.00 [ -0.00](21.61) 1.00 [ -0.00]( 6.39)
8 1.00 [ -0.00]( 8.13) 1.18 [-18.18]( 6.88)
16 1.00 [ -0.00]( 4.08) 1.00 [ -0.00](13.62)
32 1.00 [ -0.00]( 4.43) 1.08 [ -8.33]( 0.00)
64 1.00 [ -0.00]( 4.71) 1.16 [-15.79]( 8.13)
128 1.00 [ -0.00]( 2.35) 0.96 [ 3.55]( 5.26)
256 1.00 [ -0.00]( 1.52) 0.80 [ 19.58]( 1.00)
512 1.00 [ -0.00]( 0.40) 0.92 [ 8.09]( 0.48)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: mainline[pct imp](CV) mainline_1000HZ[pct imp](CV)
1 1.00 [ -0.00]( 2.46) 0.99 [ 0.52]( 1.53)
2 1.00 [ -0.00]( 3.16) 0.95 [ 5.11]( 0.50)
4 1.00 [ -0.00]( 3.16) 0.94 [ 5.62]( 0.14)
8 1.00 [ -0.00]( 1.00) 0.99 [ 0.80]( 0.24)
16 1.00 [ -0.00]( 3.77) 0.99 [ 0.53]( 0.00)
32 1.00 [ -0.00]( 1.94) 1.01 [ -1.00]( 0.66)
64 1.00 [ -0.00]( 1.07) 0.95 [ 5.38]( 5.72)
128 1.00 [ -0.00]( 0.44) 1.02 [ -1.65]( 0.10)
256 1.00 [ -0.00]( 7.02) 1.19 [-19.01]( 2.52)
512 1.00 [ -0.00]( 1.10) 0.89 [ 10.56]( 0.68)
==================================================================
Test : longer running benchmarks
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmark pct imp
ycsb-cassandra -1.25%
ycsb-mongodb -1.33%
deathstarbench-1x -2.27%
deathstarbench-2x -4.85%
deathstarbench-3x -0.25%
deathstarbench-6x -0.86%
mysql-hammerdb-64VU -1.78%
---
With that overwhelming amount of data out of the way, please feel free
to add:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> kernel/sched/fair.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e78caa21436..34e7d09320f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -74,10 +74,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
> /*
> * Minimal preemption granularity for CPU-bound tasks:
> *
> - * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
> + * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
> */
> -unsigned int sysctl_sched_base_slice = 750000ULL;
> -static unsigned int normalized_sysctl_sched_base_slice = 750000ULL;
> +unsigned int sysctl_sched_base_slice = 700000ULL;
> +static unsigned int normalized_sysctl_sched_base_slice = 700000ULL;
>
> const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
>
--
Thanks and Regards,
Prateek
Thank you for your reply, thank you for providing such a detailed test, which also let me learn a lot. > Hello Zhou, > > I'll leave some testing data below but overall, in my testing with > CONFIG_HZ=250 and CONFIG_HZ=10000, I cannot see any major regressions > (at least not for any stable data point) There are few small regressions > probably as a result of grater opportunity for wakeup preemption since > RUN_TO_PARITY will work for a slightly shorter duration now but I > haven't dug deeper to confirm if they are run to run variation or a > result of the larger number of wakeup preemption. > > Since most servers run with CONFIG_HZ=250, and the tick is anyways 4ms > and with default base slice currently at 3ms, I don't think there will > be any discernible difference in most workloads (fingers crossed) > > Please find full data below. This should be CONFIG_HZ=250 and CONFIG_HZ=1000, is it wrong? It seems that no performance difference is good news. This change will not affect performance. This problem was first found in the openeuler 6.6 kernel. If one task runs all the time and the other runs for 3ms and then sleeps for 1us, the running time of the two tasks will become 4:3, but 1:1 on orig cfs. This problem has disappeared in the mainline kernel. > o Benchmark results (CONFIG_HZ=1000) > > ================================================================== > Test : hackbench > Units : Normalized time in seconds > Interpretation: Lower is better > Statistic : AMean > ================================================================== > Case: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1-groups 1.00 [ -0.00]( 8.66) 1.05 [ -5.30](16.73) > 2-groups 1.00 [ -0.00]( 5.02) 1.07 [ -6.54]( 7.29) > 4-groups 1.00 [ -0.00]( 1.27) 1.02 [ -1.67]( 3.74) > 8-groups 1.00 [ -0.00]( 2.75) 0.99 [ 0.78]( 2.61) > 16-groups 1.00 [ -0.00]( 2.02) 0.97 [ 2.97]( 1.19) > > > ================================================================== > Test : tbench > Units : Normalized throughput > Interpretation: Higher is better > Statistic : AMean > ================================================================== > Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1 1.00 [ 0.00]( 0.40) 1.00 [ -0.44]( 0.47) > 2 1.00 [ 0.00]( 0.49) 0.99 [ -0.65]( 1.39) > 4 1.00 [ 0.00]( 0.94) 1.00 [ -0.34]( 0.09) > 8 1.00 [ 0.00]( 0.64) 0.99 [ -0.77]( 1.57) > 16 1.00 [ 0.00]( 1.04) 0.98 [ -2.00]( 0.98) > 32 1.00 [ 0.00]( 1.13) 1.00 [ 0.34]( 1.31) > 64 1.00 [ 0.00]( 0.58) 1.00 [ -0.28]( 0.80) > 128 1.00 [ 0.00]( 1.40) 0.99 [ -0.91]( 0.51) > 256 1.00 [ 0.00]( 1.14) 0.99 [ -1.48]( 1.17) > 512 1.00 [ 0.00]( 0.51) 1.00 [ -0.25]( 0.66) > 1024 1.00 [ 0.00]( 0.62) 0.99 [ -0.79]( 0.40) > > > ================================================================== > Test : stream-10 > Units : Normalized Bandwidth, MB/s > Interpretation: Higher is better > Statistic : HMean > ================================================================== > Test: mainline[pct imp](CV) new_base_slice[pct imp](CV) > Copy 1.00 [ 0.00](16.03) 0.98 [ -2.33](17.69) > Scale 1.00 [ 0.00]( 6.26) 0.99 [ -0.60]( 7.94) > Add 1.00 [ 0.00]( 8.35) 1.01 [ 0.50](11.49) > Triad 1.00 [ 0.00]( 9.56) 1.01 [ 0.66]( 9.25) > > > ================================================================== > Test : stream-100 > Units : Normalized Bandwidth, MB/s > Interpretation: Higher is better > Statistic : HMean > ================================================================== > Test: mainline[pct imp](CV) new_base_slice[pct imp](CV) > Copy 1.00 [ 0.00]( 6.03) 1.02 [ 1.58]( 2.27) > Scale 1.00 [ 0.00]( 5.78) 1.02 [ 1.64]( 4.50) > Add 1.00 [ 0.00]( 5.25) 1.01 [ 1.37]( 4.17) > Triad 1.00 [ 0.00]( 5.25) 1.03 [ 3.35]( 1.18) > > > ================================================================== > Test : netperf > Units : Normalized Througput > Interpretation: Higher is better > Statistic : AMean > ================================================================== > Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1-clients 1.00 [ 0.00]( 0.06) 1.01 [ 0.66]( 0.75) > 2-clients 1.00 [ 0.00]( 0.80) 1.01 [ 0.79]( 0.31) > 4-clients 1.00 [ 0.00]( 0.65) 1.01 [ 0.56]( 0.73) > 8-clients 1.00 [ 0.00]( 0.82) 1.01 [ 0.70]( 0.59) > 16-clients 1.00 [ 0.00]( 0.68) 1.01 [ 0.63]( 0.77) > 32-clients 1.00 [ 0.00]( 0.95) 1.01 [ 0.87]( 1.06) > 64-clients 1.00 [ 0.00]( 1.55) 1.01 [ 0.66]( 1.60) > 128-clients 1.00 [ 0.00]( 1.23) 1.00 [ -0.28]( 1.58) > 256-clients 1.00 [ 0.00]( 4.92) 1.00 [ 0.25]( 4.47) > 512-clients 1.00 [ 0.00](57.12) 1.00 [ 0.24](62.52) > > > ================================================================== > Test : schbench > Units : Normalized 99th percentile latency in us > Interpretation: Lower is better > Statistic : Median > ================================================================== > #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1 1.00 [ -0.00](27.55) 0.81 [ 19.35](31.80) > 2 1.00 [ -0.00](19.98) 0.87 [ 12.82]( 9.17) > 4 1.00 [ -0.00](10.66) 1.09 [ -9.09]( 6.45) > 8 1.00 [ -0.00]( 4.06) 0.90 [ 9.62]( 6.38) > 16 1.00 [ -0.00]( 5.33) 0.98 [ 1.69]( 1.97) > 32 1.00 [ -0.00]( 8.92) 0.97 [ 3.16]( 1.09) > 64 1.00 [ -0.00]( 6.06) 0.97 [ 3.30]( 2.97) > 128 1.00 [ -0.00](10.15) 1.05 [ -5.47]( 4.75) > 256 1.00 [ -0.00](27.12) 1.00 [ -0.20](13.52) > 512 1.00 [ -0.00]( 2.54) 0.80 [ 19.75]( 0.40) > > > ================================================================== > Test : new-schbench-requests-per-second > Units : Normalized Requests per second > Interpretation: Higher is better > Statistic : Median > ================================================================== > #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.46) > 2 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15) > 4 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15) > 8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15) > 16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) > 32 1.00 [ 0.00]( 0.43) 1.01 [ 0.63]( 0.28) > 64 1.00 [ 0.00]( 1.17) 1.00 [ 0.00]( 0.20) > 128 1.00 [ 0.00]( 0.20) 1.00 [ 0.00]( 0.20) > 256 1.00 [ 0.00]( 0.27) 1.00 [ 0.00]( 1.69) > 512 1.00 [ 0.00]( 0.21) 0.95 [ -4.70]( 0.34) > > > ================================================================== > Test : new-schbench-wakeup-latency > Units : Normalized 99th percentile latency in us > Interpretation: Lower is better > Statistic : Median > ================================================================== > #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1 1.00 [ -0.00](11.08) 1.33 [-33.33](15.78) > 2 1.00 [ -0.00]( 4.08) 1.08 [ -7.69](10.00) > 4 1.00 [ -0.00]( 6.39) 1.21 [-21.43](22.13) > 8 1.00 [ -0.00]( 6.88) 1.15 [-15.38](11.93) > 16 1.00 [ -0.00](13.62) 1.08 [ -7.69](10.33) > 32 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 3.87) > 64 1.00 [ -0.00]( 8.13) 1.00 [ -0.00]( 2.38) > 128 1.00 [ -0.00]( 5.26) 0.98 [ 2.11]( 1.92) > 256 1.00 [ -0.00]( 1.00) 0.78 [ 22.36](14.65) > 512 1.00 [ -0.00]( 0.48) 0.73 [ 27.15]( 6.75) > > > ================================================================== > Test : new-schbench-request-latency > Units : Normalized 99th percentile latency in us > Interpretation: Lower is better > Statistic : Median > ================================================================== > #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) > 1 1.00 [ -0.00]( 1.53) 1.00 [ -0.00]( 1.77) > 2 1.00 [ -0.00]( 0.50) 1.01 [ -1.35]( 1.19) > 4 1.00 [ -0.00]( 0.14) 1.00 [ -0.00]( 0.42) > 8 1.00 [ -0.00]( 0.24) 1.00 [ -0.27]( 1.37) > 16 1.00 [ -0.00]( 0.00) 1.00 [ 0.27]( 0.14) > 32 1.00 [ -0.00]( 0.66) 1.01 [ -1.48]( 2.65) > 64 1.00 [ -0.00]( 5.72) 0.96 [ 4.32]( 5.64) > 128 1.00 [ -0.00]( 0.10) 1.00 [ -0.20]( 0.18) > 256 1.00 [ -0.00]( 2.52) 0.96 [ 4.04]( 9.70) > 512 1.00 [ -0.00]( 0.68) 1.06 [ -5.52]( 0.36) > > > ================================================================== > Test : longer running benchmarks > Units : Normalized throughput > Interpretation: Higher is better > Statistic : Median > ================================================================== > Benchmark pct imp > ycsb-cassandra -0.64% > ycsb-mongodb 0.56% > deathstarbench-1x 0.30% > deathstarbench-2x 3.21% > deathstarbench-3x 2.18% > deathstarbench-6x -0.40% > mysql-hammerdb-64VU -0.63% > --- It seems that new_base_slice has made some progress in high load/latency and regressed a bit on low load. It seems that slice should not only be related to the number of cpus, but also to the corresponding relationship between the overall load and the number of cpus. The load is relatively heavy, so the slice should be smaller. The load is relatively light, so the slice should be larger. Fixing it to a value may not be the optimal solution. > With that overwhelming amount of data out of the way, please feel free > to add: > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> I think you're worth it, but it seems a bit late. I have received the email of tip-bot2, I am not sure if there can still add it. Your email made me realize that I should establish a systematic testing method. Can you give me some useful projects? Thanks!
Hello Zhou, Sorry this slipped past me. On 2/22/2025 8:32 AM, zihan zhou wrote: > Thank you for your reply, thank you for providing such a detailed test, > which also let me learn a lot. > >> Hello Zhou, >> >> I'll leave some testing data below but overall, in my testing with >> CONFIG_HZ=250 and CONFIG_HZ=10000, I cannot see any major regressions >> (at least not for any stable data point) There are few small regressions >> probably as a result of grater opportunity for wakeup preemption since >> RUN_TO_PARITY will work for a slightly shorter duration now but I >> haven't dug deeper to confirm if they are run to run variation or a >> result of the larger number of wakeup preemption. >> >> Since most servers run with CONFIG_HZ=250, and the tick is anyways 4ms >> and with default base slice currently at 3ms, I don't think there will >> be any discernible difference in most workloads (fingers crossed) >> >> Please find full data below. > > > This should be CONFIG_HZ=250 and CONFIG_HZ=1000, is it wrong? That is correct! My bad. > > It seems that no performance difference is good news. This change will not > affect performance. This problem was first found in the openeuler 6.6 > kernel. If one task runs all the time and the other runs for 3ms and then > sleeps for 1us, the running time of the two tasks will become 4:3, but 1:1 > on orig cfs. This problem has disappeared in the mainline kernel. > >> o Benchmark results (CONFIG_HZ=1000) >> >> ================================================================== >> Test : hackbench >> Units : Normalized time in seconds >> Interpretation: Lower is better >> Statistic : AMean >> ================================================================== >> Case: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1-groups 1.00 [ -0.00]( 8.66) 1.05 [ -5.30](16.73) >> 2-groups 1.00 [ -0.00]( 5.02) 1.07 [ -6.54]( 7.29) >> 4-groups 1.00 [ -0.00]( 1.27) 1.02 [ -1.67]( 3.74) >> 8-groups 1.00 [ -0.00]( 2.75) 0.99 [ 0.78]( 2.61) >> 16-groups 1.00 [ -0.00]( 2.02) 0.97 [ 2.97]( 1.19) >> >> >> ================================================================== >> Test : tbench >> Units : Normalized throughput >> Interpretation: Higher is better >> Statistic : AMean >> ================================================================== >> Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1 1.00 [ 0.00]( 0.40) 1.00 [ -0.44]( 0.47) >> 2 1.00 [ 0.00]( 0.49) 0.99 [ -0.65]( 1.39) >> 4 1.00 [ 0.00]( 0.94) 1.00 [ -0.34]( 0.09) >> 8 1.00 [ 0.00]( 0.64) 0.99 [ -0.77]( 1.57) >> 16 1.00 [ 0.00]( 1.04) 0.98 [ -2.00]( 0.98) >> 32 1.00 [ 0.00]( 1.13) 1.00 [ 0.34]( 1.31) >> 64 1.00 [ 0.00]( 0.58) 1.00 [ -0.28]( 0.80) >> 128 1.00 [ 0.00]( 1.40) 0.99 [ -0.91]( 0.51) >> 256 1.00 [ 0.00]( 1.14) 0.99 [ -1.48]( 1.17) >> 512 1.00 [ 0.00]( 0.51) 1.00 [ -0.25]( 0.66) >> 1024 1.00 [ 0.00]( 0.62) 0.99 [ -0.79]( 0.40) >> >> >> ================================================================== >> Test : stream-10 >> Units : Normalized Bandwidth, MB/s >> Interpretation: Higher is better >> Statistic : HMean >> ================================================================== >> Test: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> Copy 1.00 [ 0.00](16.03) 0.98 [ -2.33](17.69) >> Scale 1.00 [ 0.00]( 6.26) 0.99 [ -0.60]( 7.94) >> Add 1.00 [ 0.00]( 8.35) 1.01 [ 0.50](11.49) >> Triad 1.00 [ 0.00]( 9.56) 1.01 [ 0.66]( 9.25) >> >> >> ================================================================== >> Test : stream-100 >> Units : Normalized Bandwidth, MB/s >> Interpretation: Higher is better >> Statistic : HMean >> ================================================================== >> Test: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> Copy 1.00 [ 0.00]( 6.03) 1.02 [ 1.58]( 2.27) >> Scale 1.00 [ 0.00]( 5.78) 1.02 [ 1.64]( 4.50) >> Add 1.00 [ 0.00]( 5.25) 1.01 [ 1.37]( 4.17) >> Triad 1.00 [ 0.00]( 5.25) 1.03 [ 3.35]( 1.18) >> >> >> ================================================================== >> Test : netperf >> Units : Normalized Througput >> Interpretation: Higher is better >> Statistic : AMean >> ================================================================== >> Clients: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1-clients 1.00 [ 0.00]( 0.06) 1.01 [ 0.66]( 0.75) >> 2-clients 1.00 [ 0.00]( 0.80) 1.01 [ 0.79]( 0.31) >> 4-clients 1.00 [ 0.00]( 0.65) 1.01 [ 0.56]( 0.73) >> 8-clients 1.00 [ 0.00]( 0.82) 1.01 [ 0.70]( 0.59) >> 16-clients 1.00 [ 0.00]( 0.68) 1.01 [ 0.63]( 0.77) >> 32-clients 1.00 [ 0.00]( 0.95) 1.01 [ 0.87]( 1.06) >> 64-clients 1.00 [ 0.00]( 1.55) 1.01 [ 0.66]( 1.60) >> 128-clients 1.00 [ 0.00]( 1.23) 1.00 [ -0.28]( 1.58) >> 256-clients 1.00 [ 0.00]( 4.92) 1.00 [ 0.25]( 4.47) >> 512-clients 1.00 [ 0.00](57.12) 1.00 [ 0.24](62.52) >> >> >> ================================================================== >> Test : schbench >> Units : Normalized 99th percentile latency in us >> Interpretation: Lower is better >> Statistic : Median >> ================================================================== >> #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1 1.00 [ -0.00](27.55) 0.81 [ 19.35](31.80) >> 2 1.00 [ -0.00](19.98) 0.87 [ 12.82]( 9.17) >> 4 1.00 [ -0.00](10.66) 1.09 [ -9.09]( 6.45) >> 8 1.00 [ -0.00]( 4.06) 0.90 [ 9.62]( 6.38) >> 16 1.00 [ -0.00]( 5.33) 0.98 [ 1.69]( 1.97) >> 32 1.00 [ -0.00]( 8.92) 0.97 [ 3.16]( 1.09) >> 64 1.00 [ -0.00]( 6.06) 0.97 [ 3.30]( 2.97) >> 128 1.00 [ -0.00](10.15) 1.05 [ -5.47]( 4.75) >> 256 1.00 [ -0.00](27.12) 1.00 [ -0.20](13.52) >> 512 1.00 [ -0.00]( 2.54) 0.80 [ 19.75]( 0.40) >> >> >> ================================================================== >> Test : new-schbench-requests-per-second >> Units : Normalized Requests per second >> Interpretation: Higher is better >> Statistic : Median >> ================================================================== >> #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.46) >> 2 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15) >> 4 1.00 [ 0.00]( 0.15) 1.00 [ 0.00]( 0.15) >> 8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15) >> 16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) >> 32 1.00 [ 0.00]( 0.43) 1.01 [ 0.63]( 0.28) >> 64 1.00 [ 0.00]( 1.17) 1.00 [ 0.00]( 0.20) >> 128 1.00 [ 0.00]( 0.20) 1.00 [ 0.00]( 0.20) >> 256 1.00 [ 0.00]( 0.27) 1.00 [ 0.00]( 1.69) >> 512 1.00 [ 0.00]( 0.21) 0.95 [ -4.70]( 0.34) >> >> >> ================================================================== >> Test : new-schbench-wakeup-latency >> Units : Normalized 99th percentile latency in us >> Interpretation: Lower is better >> Statistic : Median >> ================================================================== >> #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1 1.00 [ -0.00](11.08) 1.33 [-33.33](15.78) >> 2 1.00 [ -0.00]( 4.08) 1.08 [ -7.69](10.00) >> 4 1.00 [ -0.00]( 6.39) 1.21 [-21.43](22.13) >> 8 1.00 [ -0.00]( 6.88) 1.15 [-15.38](11.93) >> 16 1.00 [ -0.00](13.62) 1.08 [ -7.69](10.33) >> 32 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 3.87) >> 64 1.00 [ -0.00]( 8.13) 1.00 [ -0.00]( 2.38) >> 128 1.00 [ -0.00]( 5.26) 0.98 [ 2.11]( 1.92) >> 256 1.00 [ -0.00]( 1.00) 0.78 [ 22.36](14.65) >> 512 1.00 [ -0.00]( 0.48) 0.73 [ 27.15]( 6.75) >> >> >> ================================================================== >> Test : new-schbench-request-latency >> Units : Normalized 99th percentile latency in us >> Interpretation: Lower is better >> Statistic : Median >> ================================================================== >> #workers: mainline[pct imp](CV) new_base_slice[pct imp](CV) >> 1 1.00 [ -0.00]( 1.53) 1.00 [ -0.00]( 1.77) >> 2 1.00 [ -0.00]( 0.50) 1.01 [ -1.35]( 1.19) >> 4 1.00 [ -0.00]( 0.14) 1.00 [ -0.00]( 0.42) >> 8 1.00 [ -0.00]( 0.24) 1.00 [ -0.27]( 1.37) >> 16 1.00 [ -0.00]( 0.00) 1.00 [ 0.27]( 0.14) >> 32 1.00 [ -0.00]( 0.66) 1.01 [ -1.48]( 2.65) >> 64 1.00 [ -0.00]( 5.72) 0.96 [ 4.32]( 5.64) >> 128 1.00 [ -0.00]( 0.10) 1.00 [ -0.20]( 0.18) >> 256 1.00 [ -0.00]( 2.52) 0.96 [ 4.04]( 9.70) >> 512 1.00 [ -0.00]( 0.68) 1.06 [ -5.52]( 0.36) >> >> >> ================================================================== >> Test : longer running benchmarks >> Units : Normalized throughput >> Interpretation: Higher is better >> Statistic : Median >> ================================================================== >> Benchmark pct imp >> ycsb-cassandra -0.64% >> ycsb-mongodb 0.56% >> deathstarbench-1x 0.30% >> deathstarbench-2x 3.21% >> deathstarbench-3x 2.18% >> deathstarbench-6x -0.40% >> mysql-hammerdb-64VU -0.63% >> --- > > It seems that new_base_slice has made some progress in high load/latency > and regressed a bit on low load. > > It seems that slice should not only be related to the number of cpus, but > also to the corresponding relationship between the overall load and the > number of cpus. The load is relatively heavy, so the slice should be > smaller. The load is relatively light, so the slice should be larger. > Fixing it to a value may not be the optimal solution. We've seen that assumptions go wrong in our experiments; some benchmarks really love their time on the CPU without any preemptions :) > >> With that overwhelming amount of data out of the way, please feel free >> to add: >> >> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > I think you're worth it, but it seems a bit late. I have received the email > of tip-bot2, I am not sure if there can still add it. That is fine as long as there is a record on lore :) > > Your email made me realize that I should establish a systematic testing > method. Can you give me some useful projects? We use selective benchmarks from LKP: https://github.com/intel/lkp-tests Then there are some larger benchmarks we run based on previous regression reports and debugs. some of them are: YCSB: https://github.com/brianfrankcooper/YCSB netperf: https://github.com/HewlettPackard/netperf DeathStarBench: https://github.com/delimitrou/DeathStarBench HammerDB: https://github.com/TPC-Council/HammerDB.git tbench (part of dbench): https://dbench.samba.org/web/download.html schbench: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git sched-messaging: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/sched-messaging.c?h=v6.14-rc4 Some of them are hard to setup the first time; we internally have some tools that have made it easy to run these benchmarks in a way that stresses the system but we keep an eye out for regression reports to understand what benchmarks folks are running in the field. Sorry again for the delay and thank you. > > Thanks! -- Thanks and Regards, Prateek
Thank you for your reply! I don't mind at all, and I'm also sorry for the slow response due to too many things lately. > Hello Zhou, > > Sorry this slipped past me. Thank you very much for your guidance! I realize that without a good benchmark, it is impossible to truly do a good job in scheduling. I will try my best to make time to do this well. > We use selective benchmarks from LKP: https://github.com/intel/lkp-tests > > Then there are some larger benchmarks we run based on previous regression > reports and debugs. some of them are: > > YCSB: https://github.com/brianfrankcooper/YCSB > netperf: https://github.com/HewlettPackard/netperf > DeathStarBench: https://github.com/delimitrou/DeathStarBench > HammerDB: https://github.com/TPC-Council/HammerDB.git > tbench (part of dbench): https://dbench.samba.org/web/download.html > schbench: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git > sched-messaging: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/sched-messaging.c?h=v6.14-rc4 > > Some of them are hard to setup the first time; we internally have some > tools that have made it easy to run these benchmarks in a way that > stresses the system but we keep an eye out for regression reports to > understand what benchmarks folks are running in the field. > > Sorry again for the delay and thank you. Thank you for your support! Wishing you all the best.
On 02/08/25 15:53, zihan zhou wrote: > The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which > means that we have a default slice of > 0.75 for 1 cpu > 1.50 up to 3 cpus > 2.25 up to 7 cpus > 3.00 for 8 cpus and above. I brought the topic up of these magic values with Peter and Vincent in LPC as I think this logic is confusing. I have nothing against your patch, but if the maintainers agree I am in favour of removing it completely in favour of setting it to a single value that is the same across all systems. I do think 1ms makes more sense as a default value given how modern workloads need faster responsiveness across the board. But keeping it 3ms to avoid much disturbance would be fine. We could also make it equal to TICK_MSEC (this define doesn't exist) if it is higher than 3ms. Do you use HZ=100 by the way? If yes, are you able to share the reasons? This configuration is too aggressive and bad for latencies and I doubt this tweak of the formula will make things better to them anyway.
On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > I brought the topic up of these magic values with Peter and Vincent in LPC as > I think this logic is confusing. I have nothing against your patch, but if the > maintainers agree I am in favour of removing it completely in favour of setting > it to a single value that is the same across all systems. You're talking about the scaling, right? Yeah, it is of limited use. The cap at 8, combined with the fact that its really hard to find a machine with less than 8 CPUs on, makes the whole thing mostly useless. Back when we did this, we still had dual-core laptops. Now phones have 8 or more CPUs on. So I don't think I mind ripping it out.
On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > I think this logic is confusing. I have nothing against your patch, but if the > > maintainers agree I am in favour of removing it completely in favour of setting > > it to a single value that is the same across all systems. > > You're talking about the scaling, right? > > Yeah, it is of limited use. The cap at 8, combined with the fact that > its really hard to find a machine with less than 8 CPUs on, makes the > whole thing mostly useless. > > Back when we did this, we still had dual-core laptops. Now phones have > 8 or more CPUs on. > > So I don't think I mind ripping it out. Beside the question of ripping it out or not. We still have a number of devices with less than 8 cores but they are not targeting phones, laptops or servers ...
On 02/24/25 15:15, Vincent Guittot wrote: > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > > > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > > I think this logic is confusing. I have nothing against your patch, but if the > > > maintainers agree I am in favour of removing it completely in favour of setting > > > it to a single value that is the same across all systems. > > > > You're talking about the scaling, right? > > > > Yeah, it is of limited use. The cap at 8, combined with the fact that > > its really hard to find a machine with less than 8 CPUs on, makes the > > whole thing mostly useless. > > > > Back when we did this, we still had dual-core laptops. Now phones have > > 8 or more CPUs on. > > > > So I don't think I mind ripping it out. > > Beside the question of ripping it out or not. We still have a number > of devices with less than 8 cores but they are not targeting phones, > laptops or servers ... I'm not sure if this is in favour or against the rip out, or highlighting a new problem. But in case it is against the rip-out, hopefully my answer in [1] highlights why the relationship to CPU number is actually weak and not really helping much - I think it is making implicit assumption about the workloads and I don't think this holds anymore. Ignore me otherwise :-) FWIW a raspberry PI can be used as a server, a personal computer, a multimedia entertainment system, a dumb sensor recorder/relayer or anything else. I think most systems expect to run a variety of workloads and IMHO the fact the system is overloaded and we need a reasonable default base_slice to ensure timely progress of all running tasks has little relation to NR_CPUs nowadays. [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
On Tue, 25 Feb 2025 at 01:25, Qais Yousef <qyousef@layalina.io> wrote: > > On 02/24/25 15:15, Vincent Guittot wrote: > > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > > > > > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > > > I think this logic is confusing. I have nothing against your patch, but if the > > > > maintainers agree I am in favour of removing it completely in favour of setting > > > > it to a single value that is the same across all systems. > > > > > > You're talking about the scaling, right? > > > > > > Yeah, it is of limited use. The cap at 8, combined with the fact that > > > its really hard to find a machine with less than 8 CPUs on, makes the > > > whole thing mostly useless. > > > > > > Back when we did this, we still had dual-core laptops. Now phones have > > > 8 or more CPUs on. > > > > > > So I don't think I mind ripping it out. > > > > Beside the question of ripping it out or not. We still have a number > > of devices with less than 8 cores but they are not targeting phones, > > laptops or servers ... > > I'm not sure if this is in favour or against the rip out, or highlighting a new > problem. But in case it is against the rip-out, hopefully my answer in [1] My comment was only about the fact that assuming that systems now have 8 cpus or more so scaling doesn't make any real diff at the end is not really true. > highlights why the relationship to CPU number is actually weak and not really > helping much - I think it is making implicit assumptions about the workloads and > I don't think this holds anymore. Ignore me otherwise :-) Then regarding the scaling factor, I don't have a strong opinion but I would not be so definitive about its uselessness as there are few things to take into account: - From a scheduling PoV, the scheduling delay is impacted by largeer slices on devices with small number of CPUs even for light loaded cases - 1000 HZ with 1ms slice will generate 3 times more context switch than 2.8ms in a steady loaded case and if some people were concerned but using 1000hz by default, we will not feel better with 1ms slice - 1ms is not a good value. In fact anything which is a multiple of the tick is not a good number as the actual time accounted to the task is usually less than the tick - And you can always set the scaling to none with tunable_scaling to get a fixed 0.7ms default slice whatever the number of CPUs > > FWIW a raspberry PI can be used as a server, a personal computer, a multimedia > entertainment system, a dumb sensor recorder/relayer or anything else. I think > most systems expect to run a variety of workloads and IMHO the fact the system > is overloaded and we need a reasonable default base_slice to ensure timely > progress of all running tasks has little relation to NR_CPUs nowadays. > > [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
On Tue, 25 Feb 2025 at 02:29, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Tue, 25 Feb 2025 at 01:25, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 02/24/25 15:15, Vincent Guittot wrote: > > > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > > > > > > > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > > > > I think this logic is confusing. I have nothing against your patch, but if the > > > > > maintainers agree I am in favour of removing it completely in favour of setting > > > > > it to a single value that is the same across all systems. > > > > > > > > You're talking about the scaling, right? > > > > > > > > Yeah, it is of limited use. The cap at 8, combined with the fact that > > > > its really hard to find a machine with less than 8 CPUs on, makes the > > > > whole thing mostly useless. > > > > > > > > Back when we did this, we still had dual-core laptops. Now phones have > > > > 8 or more CPUs on. > > > > > > > > So I don't think I mind ripping it out. > > > > > > Beside the question of ripping it out or not. We still have a number > > > of devices with less than 8 cores but they are not targeting phones, > > > laptops or servers ... > > > > I'm not sure if this is in favour or against the rip out, or highlighting a new > > problem. But in case it is against the rip-out, hopefully my answer in [1] > > My comment was only about the fact that assuming that systems now have > 8 cpus or more so scaling doesn't make any real diff at the end is not > really true. > > > highlights why the relationship to CPU number is actually weak and not really > > helping much - I think it is making implicit assumptions about the workloads and > > I don't think this holds anymore. Ignore me otherwise :-) > > Then regarding the scaling factor, I don't have a strong opinion but I > would not be so definitive about its uselessness as there are few > things to take into account: > - From a scheduling PoV, the scheduling delay is impacted by largeer > slices on devices with small number of CPUs even for light loaded > cases > - 1000 HZ with 1ms slice will generate 3 times more context switch > than 2.8ms in a steady loaded case and if some people were concerned > but using 1000hz by default, we will not feel better with 1ms slice Figures showing that there is no major regression to use a base slice < 1ms everywhere would be a good starting point. Some slight performance regression has just been reported for this patch which moves base slice from 3ms down to 2.8ms [1]. [1] https://lore.kernel.org/lkml/202502251026.bb927780-lkp@intel.com/ > - 1ms is not a good value. In fact anything which is a multiple of the > tick is not a good number as the actual time accounted to the task is > usually less than the tick > - And you can always set the scaling to none with tunable_scaling to > get a fixed 0.7ms default slice whatever the number of CPUs > > > > > FWIW a raspberry PI can be used as a server, a personal computer, a multimedia > > entertainment system, a dumb sensor recorder/relayer or anything else. I think > > most systems expect to run a variety of workloads and IMHO the fact the system > > is overloaded and we need a reasonable default base_slice to ensure timely > > progress of all running tasks has little relation to NR_CPUs nowadays. > > > > [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
On 02/25/25 11:13, Vincent Guittot wrote: > On Tue, 25 Feb 2025 at 02:29, Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Tue, 25 Feb 2025 at 01:25, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 02/24/25 15:15, Vincent Guittot wrote: > > > > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > > > > > > > > > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > > > > > I think this logic is confusing. I have nothing against your patch, but if the > > > > > > maintainers agree I am in favour of removing it completely in favour of setting > > > > > > it to a single value that is the same across all systems. > > > > > > > > > > You're talking about the scaling, right? > > > > > > > > > > Yeah, it is of limited use. The cap at 8, combined with the fact that > > > > > its really hard to find a machine with less than 8 CPUs on, makes the > > > > > whole thing mostly useless. > > > > > > > > > > Back when we did this, we still had dual-core laptops. Now phones have > > > > > 8 or more CPUs on. > > > > > > > > > > So I don't think I mind ripping it out. > > > > > > > > Beside the question of ripping it out or not. We still have a number > > > > of devices with less than 8 cores but they are not targeting phones, > > > > laptops or servers ... > > > > > > I'm not sure if this is in favour or against the rip out, or highlighting a new > > > problem. But in case it is against the rip-out, hopefully my answer in [1] > > > > My comment was only about the fact that assuming that systems now have > > 8 cpus or more so scaling doesn't make any real diff at the end is not > > really true. > > > > > highlights why the relationship to CPU number is actually weak and not really > > > helping much - I think it is making implicit assumptions about the workloads and > > > I don't think this holds anymore. Ignore me otherwise :-) > > > > Then regarding the scaling factor, I don't have a strong opinion but I > > would not be so definitive about its uselessness as there are few > > things to take into account: > > - From a scheduling PoV, the scheduling delay is impacted by largeer > > slices on devices with small number of CPUs even for light loaded > > cases > > - 1000 HZ with 1ms slice will generate 3 times more context switch > > than 2.8ms in a steady loaded case and if some people were concerned > > but using 1000hz by default, we will not feel better with 1ms slice Oh I was thinking of keeping the 3ms base_slice for all systems instead. While I think 3ms is a bit too high, but this is more contentious topic and needs more thinking/experimenting. > > Figures showing that there is no major regression to use a base slice > < 1ms everywhere would be a good starting point. I haven't tried less than 1ms. Worth experimenting with. Given our fastest tick is 1ms, then without HRTICK this will not be helpful. Except for helping wakeup preemption. I do strongly believe a shorter base slice (than 3ms) and HRTICK are the right defaults. But this needs more data and evaluation. And fixing x86 (and similar archs) expensive HRTIMERs. > Some slight performance regression has just been reported for this > patch which moves base slice from 3ms down to 2.8ms [1]. > > [1] https://lore.kernel.org/lkml/202502251026.bb927780-lkp@intel.com/ Oh I didn't realize this patch was already picked up. Let me send a patch myself then, assuming we agree ripping the scaling logic out and keeping base_slice a constant 3ms for all systems is fine. This will undo this patch though.. I do want to encourage people to think more about their workloads and their requirements. The kernel won't ever have a default that is optimum across the board. They shouldn't be shy to tweak it via the task runtime or debugfs instead. But the default should be representative for modern systems/workloads still as the world moves on. It would be great if we get feedback outside of these synthetic benchmarks though. I really don't think they represent reality that much (not saying they are completely useless). > > > > - 1ms is not a good value. In fact anything which is a multiple of the > > tick is not a good number as the actual time accounted to the task is > > usually less than the tick > > - And you can always set the scaling to none with tunable_scaling to > > get a fixed 0.7ms default slice whatever the number of CPUs > > > > > > > > FWIW a raspberry PI can be used as a server, a personal computer, a multimedia > > > entertainment system, a dumb sensor recorder/relayer or anything else. I think > > > most systems expect to run a variety of workloads and IMHO the fact the system > > > is overloaded and we need a reasonable default base_slice to ensure timely > > > progress of all running tasks has little relation to NR_CPUs nowadays. > > > > > > [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
On 02/10/25 10:13, Peter Zijlstra wrote: > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote: > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > I think this logic is confusing. I have nothing against your patch, but if the > > maintainers agree I am in favour of removing it completely in favour of setting > > it to a single value that is the same across all systems. > > You're talking about the scaling, right? Right. > > Yeah, it is of limited use. The cap at 8, combined with the fact that > its really hard to find a machine with less than 8 CPUs on, makes the > whole thing mostly useless. Yes. The minimum bar of modern hardware is higher now. And generally IMHO this value depends on workload. NR_CPUs can make an overloaded case harder, but it really wouldn't take much to saturate 8 CPUs compared to 2 CPUs. And from experience the larger the machine the larger the workload, so the worst case scenario of having to slice won't be in practice too much different. Especially many programmers look at NR_CPUs and spawn as many threads.. Besides with EAS we force packing, so we artificially force contention to save power. Dynamically depending on rq->hr_nr_runnable looks attractive but I think this is a recipe for more confusion. We sort of had this with sched_period, the new fixed model is better IMHO. > > Back when we did this, we still had dual-core laptops. Now phones have > 8 or more CPUs on. > > So I don't think I mind ripping it out. Great!
> Yes. The minimum bar of modern hardware is higher now. And generally IMHO this > value depends on workload. NR_CPUs can make an overloaded case harder, but it > really wouldn't take much to saturate 8 CPUs compared to 2 CPUs. And from > experience the larger the machine the larger the workload, so the worst case > scenario of having to slice won't be in practice too much different. Especially > many programmers look at NR_CPUs and spawn as many threads.. > > Besides with EAS we force packing, so we artificially force contention to save > power. > > Dynamically depending on rq->hr_nr_runnable looks attractive but I think this > is a recipe for more confusion. We sort of had this with sched_period, the new > fixed model is better IMHO. Hi, It seems that I have been thinking less about things. I have been re reading these emails recently. Can you give me the LPC links for these discussions? I want to relearn this part seriously, such as why we don't dynamically adjust the slice. Thanks!
On 02/22/25 11:19, zihan zhou wrote: > > Yes. The minimum bar of modern hardware is higher now. And generally IMHO this > > value depends on workload. NR_CPUs can make an overloaded case harder, but it > > really wouldn't take much to saturate 8 CPUs compared to 2 CPUs. And from > > experience the larger the machine the larger the workload, so the worst case > > scenario of having to slice won't be in practice too much different. Especially > > many programmers look at NR_CPUs and spawn as many threads.. > > > > Besides with EAS we force packing, so we artificially force contention to save > > power. > > > > Dynamically depending on rq->hr_nr_runnable looks attractive but I think this > > is a recipe for more confusion. We sort of had this with sched_period, the new > > fixed model is better IMHO. > > Hi, It seems that I have been thinking less about things. I have been re > reading these emails recently. Can you give me the LPC links for these > discussions? I want to relearn this part seriously, such as why we don't > dynamically adjust the slice. No LPC talks. It was just something I noticed and brought up during LPC offline and was planning to send a patch with that effect. The reasons above is pretty much is all of it. We are simply better off having a constant base_slice. debugfs allows modifying it if users think they know better and need to use another default. But the scaling factor doesn't hold great (or any) value anymore and can create confusions for our users.
Thank you for your comments! > I brought the topic up of these magic values with Peter and Vincent in LPC as > I think this logic is confusing. I have nothing against your patch, but if the > maintainers agree I am in favour of removing it completely in favour of setting > it to a single value that is the same across all systems. Here is my shallow understanding: I think when the number of cpus is small, this type of machine is usually a desktop. If the slice is still relatively large, a task has to have a longer wake-up delay, which may result in a poorer user experience. When there are a large number of cpus, it is likely to mean that the machine is a server, its tasks often are batch workloads, a slight increase in slice is acceptable. And a server often has idle cpus or cpus with low load, So even if there are larger slice, the interaction experience is also not bad. > I do think 1ms makes more sense as a default value given how modern workloads > need faster responsiveness across the board. But keeping it 3ms to avoid much > disturbance would be fine. We could also make it equal to TICK_MSEC (this > define doesn't exist) if it is higher than 3ms. I don't quite understand this. What is TICK_MSEC? If HZ=1000, then TICK_MSEC=1ms? Why is it said that more than 3ms (slice) equals 1ms (tick)? It seems that this value was originally designed for sysctl_sched_wakeup_granularity. CFS does not force tasks to switch after running for this time, but EEVDF does require it, So if slice is too small like 1ms, it looks not conducive to cache, and is not good for batch workloads. > Do you use HZ=100 by the way? If yes, are you able to share the reasons? This > configuration is too aggressive and bad for latencies and I doubt this tweak of > the formula will make things better to them anyway. I don't use HZ=100, in fact, all the machines I use have HZ=1000 and more than 8 cpus, so I'm not familiar with some scenarios. I think that if the slice is smaller than tick (10ms), there is not much difference between 3ms slice and 1ms slice in tick preemption, but the two are still different in wake-up preemption. After all, when waking up preemption, there also has update_curr->update_deadline, and the wake-up latency should be slightly lower with 1ms slice. So I think, when HZ=100, different slices still have an impact on latency.
On 02/10/25 14:18, zihan zhou wrote: > Thank you for your comments! > > > I brought the topic up of these magic values with Peter and Vincent in LPC as > > I think this logic is confusing. I have nothing against your patch, but if the > > maintainers agree I am in favour of removing it completely in favour of setting > > it to a single value that is the same across all systems. > > Here is my shallow understanding: > I think when the number of cpus is small, this type of machine is usually > a desktop. If the slice is still relatively large, a task has to have a > longer wake-up delay, which may result in a poorer user experience. When > there are a large number of cpus, it is likely to mean that the machine is > a server, its tasks often are batch workloads, a slight increase in slice > is acceptable. And a server often has idle cpus or cpus with low load, So > even if there are larger slice, the interaction experience is also not bad. I think the logic has served its purpose and it's time to retire it. Any larger than 8 CPUs will have the same mapping anyway. So let's simplify and make it 3ms by default for everyone. So the suggestion is to remove this logic and always set base_slice to 3ms for all systems instead. No need to do the scaling anymore. > > > I do think 1ms makes more sense as a default value given how modern workloads > > need faster responsiveness across the board. But keeping it 3ms to avoid much > > disturbance would be fine. We could also make it equal to TICK_MSEC (this > > define doesn't exist) if it is higher than 3ms. > > I don't quite understand this. What is TICK_MSEC? If HZ=1000, then > TICK_MSEC=1ms? Why is it said that more than 3ms (slice) equals 1ms (tick)? I meant base_slice = max(3ms, TICK_USEC * USEC_PER_MSEC) I was lazy to type TICK_USEC * USEC_PER_MSEC and used TICK_MSEC instead. But this is a bad idea. Please ignore it. With HZ=100 still selectable, doing that will wreck havoc on wake up preemption on those systems. > > It seems that this value was originally designed for > sysctl_sched_wakeup_granularity. CFS does not force tasks to switch after > running for this time, but EEVDF does require it, So if slice is too small > like 1ms, it looks not conducive to cache, and is not good for batch > workloads. > > > Do you use HZ=100 by the way? If yes, are you able to share the reasons? This > > configuration is too aggressive and bad for latencies and I doubt this tweak of > > the formula will make things better to them anyway. > > I don't use HZ=100, in fact, all the machines I use have HZ=1000 and more > than 8 cpus, so I'm not familiar with some scenarios. > > I think that if the slice is smaller than tick (10ms), there is not much > difference between 3ms slice and 1ms slice in tick preemption, but the two > are still different in wake-up preemption. After all, when waking up > preemption, there also has update_curr->update_deadline, and the wake-up > latency should be slightly lower with 1ms slice. So I think, when HZ=100, > different slices still have an impact on latency. I am trying to argue elsewhere to remove HZ=100. Just was curious if you actually use this value and if yes why. Sorry a bit of a tangent :) Thanks! -- Qais Yousef
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 2ae891b826958b60919ea21c727f77bcd6ffcc2c
Gitweb: https://git.kernel.org/tip/2ae891b826958b60919ea21c727f77bcd6ffcc2c
Author: zihan zhou <15645113830zzh@gmail.com>
AuthorDate: Sat, 08 Feb 2025 15:53:23 +08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 14 Feb 2025 10:32:00 +01:00
sched: Reduce the default slice to avoid tasks getting an extra tick
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.
For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.
For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)
There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.
In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.
This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
---
kernel/sched/fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61b826f..1784752 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -74,10 +74,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
/*
* Minimal preemption granularity for CPU-bound tasks:
*
- * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
-unsigned int sysctl_sched_base_slice = 750000ULL;
-static unsigned int normalized_sysctl_sched_base_slice = 750000ULL;
+unsigned int sysctl_sched_base_slice = 700000ULL;
+static unsigned int normalized_sysctl_sched_base_slice = 700000ULL;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
© 2016 - 2025 Red Hat, Inc.