[PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick

zihan zhou posted 2 patches 10 months, 2 weeks ago
[PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by zihan zhou 10 months, 2 weeks ago
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.

For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.

For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.

In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.

This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.

Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e78caa21436..34e7d09320f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -74,10 +74,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
 /*
  * Minimal preemption granularity for CPU-bound tasks:
  *
- * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_base_slice			= 750000ULL;
-static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
+unsigned int sysctl_sched_base_slice			= 700000ULL;
+static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
-- 
2.33.0
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by K Prateek Nayak 10 months, 1 week ago
Hello Zhou,

I'll leave some testing data below but overall, in my testing with
CONFIG_HZ=250 and CONFIG_HZ=10000, I cannot see any major regressions
(at least not for any stable data point) There are few small regressions
probably as a result of grater opportunity for wakeup preemption since
RUN_TO_PARITY will work for a slightly shorter duration now but I
haven't dug deeper to confirm if they are run to run variation or a
result of the larger number of wakeup preemption.

Since most servers run with CONFIG_HZ=250, and the tick is anyways 4ms
and with default base slice currently at 3ms, I don't think there will
be any discernible difference in most workloads (fingers crossed)

Please find full data below.

On 2/8/2025 1:23 PM, zihan zhou wrote:
> The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
> means that we have a default slice of
> 0.75 for 1 cpu
> 1.50 up to 3 cpus
> 2.25 up to 7 cpus
> 3.00 for 8 cpus and above.
> 
> For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
> tasks is far higher than their slice.
> 
> For HZ=1000 with 8 cpus or more, the accuracy of tick is already
> satisfactory, but there is still an issue that tasks will get an extra
> tick because the tick often arrives a little faster than expected. In
> this case, the task can only wait until the next tick to consider that it
> has reached its deadline, and will run 1ms longer.
> 
> vruntime + sysctl_sched_base_slice =     deadline
>          |-----------|-----------|-----------|-----------|
>               1ms          1ms         1ms         1ms
>                     ^           ^           ^           ^
>                   tick1       tick2       tick3       tick4(nearly 4ms)
> 
> There are two reasons for tick error: clockevent precision and the
> CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
> CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
> without it, because of clockevent precision, tick still often less than
> 1ms.
> 
> In order to make scheduling more precise, we changed 0.75 to 0.70,
> Using 0.70 instead of 0.75 should not change much for other configs
> and would fix this issue:
> 0.70 for 1 cpu
> 1.40 up to 3 cpus
> 2.10 up to 7 cpus
> 2.8 for 8 cpus and above.
> 
> This does not guarantee that tasks can run the slice time accurately
> every time, but occasionally running an extra tick has little impact.
> 
> Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode
- Boost Enabled
- C2 disabled; POLL and MWAIT based C1 remained enabled

o Kernels

mainline:		For CONFIG_HZ=250 runs:
			mainline kernel at commit 0de63bb7d919 ("Merge
			tag 'pull-fix' of
			git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs")

			For CONFIG_HZ=1000 runs it was v6.14-rc2

new_base_slice:		respective mainline + Patch 1

o Benchmark results (CONFIG_HZ=250)

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  1-groups     1.00 [ -0.00]( 9.88)     1.09 [ -9.19](11.57)
  2-groups     1.00 [ -0.00]( 3.49)     0.97 [  2.91]( 4.51)
  4-groups     1.00 [ -0.00]( 1.22)     0.99 [  1.04]( 2.47)
  8-groups     1.00 [ -0.00]( 0.80)     1.01 [ -1.10]( 1.81)
16-groups     1.00 [ -0.00]( 1.40)     1.01 [ -0.50]( 0.92)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
     1     1.00 [  0.00]( 1.14)     0.99 [ -1.46]( 0.41)
     2     1.00 [  0.00]( 1.57)     1.01 [  1.12]( 0.53)
     4     1.00 [  0.00]( 1.16)     0.99 [ -0.79]( 0.50)
     8     1.00 [  0.00]( 0.84)     0.98 [ -1.51]( 0.71)
    16     1.00 [  0.00]( 0.63)     0.97 [ -3.20]( 0.82)
    32     1.00 [  0.00]( 0.96)     0.99 [ -1.36]( 0.86)
    64     1.00 [  0.00]( 0.52)     0.97 [ -2.95]( 3.36)
   128     1.00 [  0.00]( 0.83)     0.99 [ -1.30]( 1.00)
   256     1.00 [  0.00]( 0.67)     1.00 [ -0.45]( 0.49)
   512     1.00 [  0.00]( 0.03)     1.00 [ -0.20]( 0.67)
  1024     1.00 [  0.00]( 0.19)     1.00 [ -0.14]( 0.24)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  Copy     1.00 [  0.00](15.75)     1.21 [ 20.54]( 7.80)
Scale     1.00 [  0.00]( 7.43)     1.00 [  0.48]( 6.22)
   Add     1.00 [  0.00](10.35)     1.08 [  7.98]( 6.29)
Triad     1.00 [  0.00]( 9.34)     1.09 [  9.09]( 6.91)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  Copy     1.00 [  0.00]( 2.19)     1.05 [  5.06]( 2.16)
Scale     1.00 [  0.00]( 6.17)     1.02 [  1.65]( 4.07)
   Add     1.00 [  0.00]( 5.88)     1.04 [  3.81]( 1.07)
Triad     1.00 [  0.00]( 1.40)     1.00 [  0.06]( 3.79)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  1-clients     1.00 [  0.00]( 0.14)     0.98 [ -1.66]( 0.64)
  2-clients     1.00 [  0.00]( 0.85)     0.98 [ -1.52]( 0.82)
  4-clients     1.00 [  0.00]( 0.77)     0.98 [ -1.72]( 0.77)
  8-clients     1.00 [  0.00]( 0.53)     0.98 [ -1.60]( 0.59)
16-clients     1.00 [  0.00]( 0.91)     0.98 [ -1.79]( 0.74)
32-clients     1.00 [  0.00]( 0.99)     0.99 [ -1.32]( 0.99)
64-clients     1.00 [  0.00]( 1.35)     0.99 [ -1.43]( 1.39)
128-clients     1.00 [  0.00]( 1.20)     0.99 [ -1.17]( 1.22)
256-clients     1.00 [  0.00]( 4.41)     0.99 [ -1.07]( 4.95)
512-clients     1.00 [  0.00](59.74)     1.00 [ -0.17](59.70)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [ -0.00]( 7.39)     1.02 [ -2.38](35.97)
   2     1.00 [ -0.00](10.14)     1.02 [ -2.22]( 7.22)
   4     1.00 [ -0.00]( 3.53)     1.08 [ -8.33]( 3.27)
   8     1.00 [ -0.00](11.48)     0.91 [  8.93]( 4.97)
  16     1.00 [ -0.00]( 7.02)     0.98 [  1.72]( 6.22)
  32     1.00 [ -0.00]( 3.79)     0.97 [  3.23]( 2.53)
  64     1.00 [ -0.00]( 8.22)     0.99 [  0.57]( 2.31)
128     1.00 [ -0.00]( 4.38)     0.92 [  8.25](87.57)
256     1.00 [ -0.00](19.81)     1.27 [-27.13](13.43)
512     1.00 [ -0.00]( 2.41)     1.00 [ -0.00]( 2.73)


==================================================================
Test          : new-schbench-requests-per-second
Units         : Normalized Requests per second
Interpretation: Higher is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [  0.00]( 0.00)     0.97 [ -2.64]( 0.68)
   2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
   4     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
   8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
  16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
  32     1.00 [  0.00]( 0.42)     0.99 [ -0.92]( 3.95)
  64     1.00 [  0.00]( 2.45)     1.03 [  3.09](15.04)
128     1.00 [  0.00]( 0.20)     1.00 [  0.00]( 0.00)
256     1.00 [  0.00]( 0.84)     1.01 [  0.92]( 0.54)
512     1.00 [  0.00]( 0.97)     0.99 [ -0.72]( 0.75)


==================================================================
Test          : new-schbench-wakeup-latency
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [ -0.00](12.81)     0.91 [  9.09](14.13)
   2     1.00 [ -0.00]( 8.85)     1.00 [ -0.00]( 4.84)
   4     1.00 [ -0.00](21.61)     0.86 [ 14.29]( 4.43)
   8     1.00 [ -0.00]( 8.13)     0.91 [  9.09](18.23)
  16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00]( 8.37)
  32     1.00 [ -0.00]( 4.43)     1.00 [ -0.00](21.56)
  64     1.00 [ -0.00]( 4.71)     1.00 [ -0.00](10.16)
128     1.00 [ -0.00]( 2.35)     0.93 [  7.11]( 6.69)
256     1.00 [ -0.00]( 1.52)     1.02 [ -1.60]( 1.51)
512     1.00 [ -0.00]( 0.40)     1.01 [ -1.17]( 0.34)


==================================================================
Test          : new-schbench-request-latency
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [ -0.00]( 2.46)     1.04 [ -3.67]( 0.35)
   2     1.00 [ -0.00]( 3.16)     1.00 [ -0.26]( 0.13)
   4     1.00 [ -0.00]( 3.16)     0.95 [  4.60]( 2.82)
   8     1.00 [ -0.00]( 1.00)     1.05 [ -4.81]( 0.00)
  16     1.00 [ -0.00]( 3.77)     1.01 [ -0.80]( 2.44)
  32     1.00 [ -0.00]( 1.94)     1.06 [ -6.24](27.22)
  64     1.00 [ -0.00]( 1.07)     0.99 [  1.29]( 0.68)
128     1.00 [ -0.00]( 0.44)     1.01 [ -0.62]( 0.32)
256     1.00 [ -0.00]( 7.02)     1.04 [ -4.45]( 7.53)
512     1.00 [ -0.00]( 1.10)     1.01 [ -1.02]( 2.59)


==================================================================
Test          : longer running benchmarks
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Median
==================================================================
Benchmark		pct imp
ycsb-cassandra          -1.14%
ycsb-mongodb            -0.84%
deathstarbench-1x       -4.13%
deathstarbench-2x       -3.93%
deathstarbench-3x       -1.27%
deathstarbench-6x       -0.10%
mysql-hammerdb-64VU     -0.37%
---

o Benchmark results (CONFIG_HZ=1000)

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  1-groups     1.00 [ -0.00]( 8.66)     1.05 [ -5.30](16.73)
  2-groups     1.00 [ -0.00]( 5.02)     1.07 [ -6.54]( 7.29)
  4-groups     1.00 [ -0.00]( 1.27)     1.02 [ -1.67]( 3.74)
  8-groups     1.00 [ -0.00]( 2.75)     0.99 [  0.78]( 2.61)
16-groups     1.00 [ -0.00]( 2.02)     0.97 [  2.97]( 1.19)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
     1     1.00 [  0.00]( 0.40)     1.00 [ -0.44]( 0.47)
     2     1.00 [  0.00]( 0.49)     0.99 [ -0.65]( 1.39)
     4     1.00 [  0.00]( 0.94)     1.00 [ -0.34]( 0.09)
     8     1.00 [  0.00]( 0.64)     0.99 [ -0.77]( 1.57)
    16     1.00 [  0.00]( 1.04)     0.98 [ -2.00]( 0.98)
    32     1.00 [  0.00]( 1.13)     1.00 [  0.34]( 1.31)
    64     1.00 [  0.00]( 0.58)     1.00 [ -0.28]( 0.80)
   128     1.00 [  0.00]( 1.40)     0.99 [ -0.91]( 0.51)
   256     1.00 [  0.00]( 1.14)     0.99 [ -1.48]( 1.17)
   512     1.00 [  0.00]( 0.51)     1.00 [ -0.25]( 0.66)
  1024     1.00 [  0.00]( 0.62)     0.99 [ -0.79]( 0.40)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  Copy     1.00 [  0.00](16.03)     0.98 [ -2.33](17.69)
Scale     1.00 [  0.00]( 6.26)     0.99 [ -0.60]( 7.94)
   Add     1.00 [  0.00]( 8.35)     1.01 [  0.50](11.49)
Triad     1.00 [  0.00]( 9.56)     1.01 [  0.66]( 9.25)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  Copy     1.00 [  0.00]( 6.03)     1.02 [  1.58]( 2.27)
Scale     1.00 [  0.00]( 5.78)     1.02 [  1.64]( 4.50)
   Add     1.00 [  0.00]( 5.25)     1.01 [  1.37]( 4.17)
Triad     1.00 [  0.00]( 5.25)     1.03 [  3.35]( 1.18)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
  1-clients     1.00 [  0.00]( 0.06)     1.01 [  0.66]( 0.75)
  2-clients     1.00 [  0.00]( 0.80)     1.01 [  0.79]( 0.31)
  4-clients     1.00 [  0.00]( 0.65)     1.01 [  0.56]( 0.73)
  8-clients     1.00 [  0.00]( 0.82)     1.01 [  0.70]( 0.59)
16-clients     1.00 [  0.00]( 0.68)     1.01 [  0.63]( 0.77)
32-clients     1.00 [  0.00]( 0.95)     1.01 [  0.87]( 1.06)
64-clients     1.00 [  0.00]( 1.55)     1.01 [  0.66]( 1.60)
128-clients     1.00 [  0.00]( 1.23)     1.00 [ -0.28]( 1.58)
256-clients     1.00 [  0.00]( 4.92)     1.00 [  0.25]( 4.47)
512-clients     1.00 [  0.00](57.12)     1.00 [  0.24](62.52)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [ -0.00](27.55)     0.81 [ 19.35](31.80)
   2     1.00 [ -0.00](19.98)     0.87 [ 12.82]( 9.17)
   4     1.00 [ -0.00](10.66)     1.09 [ -9.09]( 6.45)
   8     1.00 [ -0.00]( 4.06)     0.90 [  9.62]( 6.38)
  16     1.00 [ -0.00]( 5.33)     0.98 [  1.69]( 1.97)
  32     1.00 [ -0.00]( 8.92)     0.97 [  3.16]( 1.09)
  64     1.00 [ -0.00]( 6.06)     0.97 [  3.30]( 2.97)
128     1.00 [ -0.00](10.15)     1.05 [ -5.47]( 4.75)
256     1.00 [ -0.00](27.12)     1.00 [ -0.20](13.52)
512     1.00 [ -0.00]( 2.54)     0.80 [ 19.75]( 0.40)


==================================================================
Test          : new-schbench-requests-per-second
Units         : Normalized Requests per second
Interpretation: Higher is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.46)
   2     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
   4     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
   8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
  16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
  32     1.00 [  0.00]( 0.43)     1.01 [  0.63]( 0.28)
  64     1.00 [  0.00]( 1.17)     1.00 [  0.00]( 0.20)
128     1.00 [  0.00]( 0.20)     1.00 [  0.00]( 0.20)
256     1.00 [  0.00]( 0.27)     1.00 [  0.00]( 1.69)
512     1.00 [  0.00]( 0.21)     0.95 [ -4.70]( 0.34)


==================================================================
Test          : new-schbench-wakeup-latency
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [ -0.00](11.08)     1.33 [-33.33](15.78)
   2     1.00 [ -0.00]( 4.08)     1.08 [ -7.69](10.00)
   4     1.00 [ -0.00]( 6.39)     1.21 [-21.43](22.13)
   8     1.00 [ -0.00]( 6.88)     1.15 [-15.38](11.93)
  16     1.00 [ -0.00](13.62)     1.08 [ -7.69](10.33)
  32     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.87)
  64     1.00 [ -0.00]( 8.13)     1.00 [ -0.00]( 2.38)
128     1.00 [ -0.00]( 5.26)     0.98 [  2.11]( 1.92)
256     1.00 [ -0.00]( 1.00)     0.78 [ 22.36](14.65)
512     1.00 [ -0.00]( 0.48)     0.73 [ 27.15]( 6.75)


==================================================================
Test          : new-schbench-request-latency
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
   1     1.00 [ -0.00]( 1.53)     1.00 [ -0.00]( 1.77)
   2     1.00 [ -0.00]( 0.50)     1.01 [ -1.35]( 1.19)
   4     1.00 [ -0.00]( 0.14)     1.00 [ -0.00]( 0.42)
   8     1.00 [ -0.00]( 0.24)     1.00 [ -0.27]( 1.37)
  16     1.00 [ -0.00]( 0.00)     1.00 [  0.27]( 0.14)
  32     1.00 [ -0.00]( 0.66)     1.01 [ -1.48]( 2.65)
  64     1.00 [ -0.00]( 5.72)     0.96 [  4.32]( 5.64)
128     1.00 [ -0.00]( 0.10)     1.00 [ -0.20]( 0.18)
256     1.00 [ -0.00]( 2.52)     0.96 [  4.04]( 9.70)
512     1.00 [ -0.00]( 0.68)     1.06 [ -5.52]( 0.36)


==================================================================
Test          : longer running benchmarks
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Median
==================================================================
Benchmark		pct imp
ycsb-cassandra          -0.64%
ycsb-mongodb             0.56%
deathstarbench-1x        0.30%
deathstarbench-2x        3.21%
deathstarbench-3x        2.18%
deathstarbench-6x       -0.40%
mysql-hammerdb-64VU     -0.63%
---

If folks are interested in how CONFIG_HZ=250 vs CONFIG_HZ=1000 stack up,
here you go (Note, there is slight variation between the two kernels
since CONFIG_HZ=250 version is closer to v6.14-rc1 and CONFIG_HZ=1000
results are based on v6.14-rc2)

o Benchmark results (CONFIG_HZ=250 vs CONFIG_HZ=1000)

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
  1-groups     1.00 [ -0.00]( 9.88)     1.02 [ -1.57]( 8.66)
  2-groups     1.00 [ -0.00]( 3.49)     0.95 [  4.57]( 5.02)
  4-groups     1.00 [ -0.00]( 1.22)     0.99 [  0.62]( 1.27)
  8-groups     1.00 [ -0.00]( 0.80)     1.00 [ -0.31]( 2.75)
16-groups     1.00 [ -0.00]( 1.40)     0.99 [  1.17]( 2.02)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
     1     1.00 [  0.00]( 1.14)     1.00 [ -0.45]( 0.40)
     2     1.00 [  0.00]( 1.57)     1.01 [  1.40]( 0.49)
     4     1.00 [  0.00]( 1.16)     1.01 [  1.16]( 0.94)
     8     1.00 [  0.00]( 0.84)     1.01 [  1.24]( 0.64)
    16     1.00 [  0.00]( 0.63)     1.00 [ -0.33]( 1.04)
    32     1.00 [  0.00]( 0.96)     1.00 [ -0.30]( 1.13)
    64     1.00 [  0.00]( 0.52)     1.00 [  0.27]( 0.58)
   128     1.00 [  0.00]( 0.83)     1.00 [ -0.45]( 1.40)
   256     1.00 [  0.00]( 0.67)     1.00 [  0.15]( 1.14)
   512     1.00 [  0.00]( 0.03)     0.99 [ -0.73]( 0.51)
  1024     1.00 [  0.00]( 0.19)     0.99 [ -1.29]( 0.62)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
  Copy     1.00 [  0.00](15.75)     0.93 [ -6.67](16.03)
Scale     1.00 [  0.00]( 7.43)     0.97 [ -2.70]( 6.26)
   Add     1.00 [  0.00](10.35)     0.94 [ -6.42]( 8.35)
Triad     1.00 [  0.00]( 9.34)     0.92 [ -8.26]( 9.56)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
  Copy     1.00 [  0.00]( 2.19)     0.96 [ -3.52]( 6.03)
Scale     1.00 [  0.00]( 6.17)     1.00 [ -0.22]( 5.78)
   Add     1.00 [  0.00]( 5.88)     0.99 [ -1.05]( 5.25)
Triad     1.00 [  0.00]( 1.40)     0.96 [ -3.64]( 5.25)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
  1-clients     1.00 [  0.00]( 0.14)     0.99 [ -0.94]( 0.06)
  2-clients     1.00 [  0.00]( 0.85)     1.00 [ -0.43]( 0.80)
  4-clients     1.00 [  0.00]( 0.77)     0.99 [ -0.63]( 0.65)
  8-clients     1.00 [  0.00]( 0.53)     1.00 [ -0.49]( 0.82)
16-clients     1.00 [  0.00]( 0.91)     0.99 [ -0.55]( 0.68)
32-clients     1.00 [  0.00]( 0.99)     0.99 [ -1.01]( 0.95)
64-clients     1.00 [  0.00]( 1.35)     0.98 [ -1.58]( 1.55)
128-clients     1.00 [  0.00]( 1.20)     0.99 [ -1.38]( 1.23)
256-clients     1.00 [  0.00]( 4.41)     0.99 [ -0.68]( 4.92)
512-clients     1.00 [  0.00](59.74)     0.99 [ -1.16](57.12)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
   1     1.00 [ -0.00]( 7.39)     0.74 [ 26.19](27.55)
   2     1.00 [ -0.00](10.14)     0.87 [ 13.33](19.98)
   4     1.00 [ -0.00]( 3.53)     0.92 [  8.33](10.66)
   8     1.00 [ -0.00](11.48)     0.93 [  7.14]( 4.06)
  16     1.00 [ -0.00]( 7.02)     1.02 [ -1.72]( 5.33)
  32     1.00 [ -0.00]( 3.79)     1.02 [ -2.15]( 8.92)
  64     1.00 [ -0.00]( 8.22)     1.05 [ -4.60]( 6.06)
128     1.00 [ -0.00]( 4.38)     0.91 [  9.48](10.15)
256     1.00 [ -0.00](19.81)     1.01 [ -0.60](27.12)
512     1.00 [ -0.00]( 2.41)     0.91 [  9.45]( 2.54)


==================================================================
Test          : new-schbench-requests-per-second
Units         : Normalized Requests per second
Interpretation: Higher is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
   1     1.00 [  0.00]( 0.00)     0.99 [ -0.59]( 0.15)
   2     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
   4     1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.15)
   8     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)
  16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)
  32     1.00 [  0.00]( 0.42)     0.98 [ -1.54]( 0.43)
  64     1.00 [  0.00]( 2.45)     1.03 [  3.09]( 1.17)
128     1.00 [  0.00]( 0.20)     0.98 [ -1.51]( 0.20)
256     1.00 [  0.00]( 0.84)     1.02 [  1.53]( 0.27)
512     1.00 [  0.00]( 0.97)     1.02 [  2.16]( 0.21)


==================================================================
Test          : new-schbench-wakeup-latency
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
   1     1.00 [ -0.00](12.81)     1.09 [ -9.09](11.08)
   2     1.00 [ -0.00]( 8.85)     1.18 [-18.18]( 4.08)
   4     1.00 [ -0.00](21.61)     1.00 [ -0.00]( 6.39)
   8     1.00 [ -0.00]( 8.13)     1.18 [-18.18]( 6.88)
  16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](13.62)
  32     1.00 [ -0.00]( 4.43)     1.08 [ -8.33]( 0.00)
  64     1.00 [ -0.00]( 4.71)     1.16 [-15.79]( 8.13)
128     1.00 [ -0.00]( 2.35)     0.96 [  3.55]( 5.26)
256     1.00 [ -0.00]( 1.52)     0.80 [ 19.58]( 1.00)
512     1.00 [ -0.00]( 0.40)     0.92 [  8.09]( 0.48)


==================================================================
Test          : new-schbench-request-latency
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:      mainline[pct imp](CV)    mainline_1000HZ[pct imp](CV)
   1     1.00 [ -0.00]( 2.46)     0.99 [  0.52]( 1.53)
   2     1.00 [ -0.00]( 3.16)     0.95 [  5.11]( 0.50)
   4     1.00 [ -0.00]( 3.16)     0.94 [  5.62]( 0.14)
   8     1.00 [ -0.00]( 1.00)     0.99 [  0.80]( 0.24)
  16     1.00 [ -0.00]( 3.77)     0.99 [  0.53]( 0.00)
  32     1.00 [ -0.00]( 1.94)     1.01 [ -1.00]( 0.66)
  64     1.00 [ -0.00]( 1.07)     0.95 [  5.38]( 5.72)
128     1.00 [ -0.00]( 0.44)     1.02 [ -1.65]( 0.10)
256     1.00 [ -0.00]( 7.02)     1.19 [-19.01]( 2.52)
512     1.00 [ -0.00]( 1.10)     0.89 [ 10.56]( 0.68)

==================================================================
Test          : longer running benchmarks
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Median
==================================================================
Benchmark		pct imp
ycsb-cassandra          -1.25%
ycsb-mongodb            -1.33%
deathstarbench-1x       -2.27%
deathstarbench-2x       -4.85%
deathstarbench-3x       -0.25%
deathstarbench-6x       -0.86%
mysql-hammerdb-64VU     -1.78%
---

With that overwhelming amount of data out of the way, please feel free
to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
>   kernel/sched/fair.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e78caa21436..34e7d09320f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -74,10 +74,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
>   /*
>    * Minimal preemption granularity for CPU-bound tasks:
>    *
> - * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
> + * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
>    */
> -unsigned int sysctl_sched_base_slice			= 750000ULL;
> -static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
> +unsigned int sysctl_sched_base_slice			= 700000ULL;
> +static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
>   
>   const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
>   

-- 
Thanks and Regards,
Prateek
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by zihan zhou 10 months ago
Thank you for your reply, thank you for providing such a detailed test,
which also let me learn a lot.

> Hello Zhou,
> 
> I'll leave some testing data below but overall, in my testing with
> CONFIG_HZ=250 and CONFIG_HZ=10000, I cannot see any major regressions
> (at least not for any stable data point) There are few small regressions
> probably as a result of grater opportunity for wakeup preemption since
> RUN_TO_PARITY will work for a slightly shorter duration now but I
> haven't dug deeper to confirm if they are run to run variation or a
> result of the larger number of wakeup preemption.
> 
> Since most servers run with CONFIG_HZ=250, and the tick is anyways 4ms
> and with default base slice currently at 3ms, I don't think there will
> be any discernible difference in most workloads (fingers crossed)
> 
> Please find full data below.


This should be CONFIG_HZ=250 and CONFIG_HZ=1000, is it wrong?

It seems that no performance difference is good news. This change will not
affect performance. This problem was first found in the openeuler 6.6
kernel. If one task runs all the time and the other runs for 3ms and then
sleeps for 1us, the running time of the two tasks will become 4:3, but 1:1
on orig cfs. This problem has disappeared in the mainline kernel.

> o Benchmark results (CONFIG_HZ=1000)
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>   1-groups     1.00 [ -0.00]( 8.66)     1.05 [ -5.30](16.73)
>   2-groups     1.00 [ -0.00]( 5.02)     1.07 [ -6.54]( 7.29)
>   4-groups     1.00 [ -0.00]( 1.27)     1.02 [ -1.67]( 3.74)
>   8-groups     1.00 [ -0.00]( 2.75)     0.99 [  0.78]( 2.61)
> 16-groups     1.00 [ -0.00]( 2.02)     0.97 [  2.97]( 1.19)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>      1     1.00 [  0.00]( 0.40)     1.00 [ -0.44]( 0.47)
>      2     1.00 [  0.00]( 0.49)     0.99 [ -0.65]( 1.39)
>      4     1.00 [  0.00]( 0.94)     1.00 [ -0.34]( 0.09)
>      8     1.00 [  0.00]( 0.64)     0.99 [ -0.77]( 1.57)
>     16     1.00 [  0.00]( 1.04)     0.98 [ -2.00]( 0.98)
>     32     1.00 [  0.00]( 1.13)     1.00 [  0.34]( 1.31)
>     64     1.00 [  0.00]( 0.58)     1.00 [ -0.28]( 0.80)
>    128     1.00 [  0.00]( 1.40)     0.99 [ -0.91]( 0.51)
>    256     1.00 [  0.00]( 1.14)     0.99 [ -1.48]( 1.17)
>    512     1.00 [  0.00]( 0.51)     1.00 [ -0.25]( 0.66)
>   1024     1.00 [  0.00]( 0.62)     0.99 [ -0.79]( 0.40)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>   Copy     1.00 [  0.00](16.03)     0.98 [ -2.33](17.69)
> Scale     1.00 [  0.00]( 6.26)     0.99 [ -0.60]( 7.94)
>    Add     1.00 [  0.00]( 8.35)     1.01 [  0.50](11.49)
> Triad     1.00 [  0.00]( 9.56)     1.01 [  0.66]( 9.25)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>   Copy     1.00 [  0.00]( 6.03)     1.02 [  1.58]( 2.27)
> Scale     1.00 [  0.00]( 5.78)     1.02 [  1.64]( 4.50)
>    Add     1.00 [  0.00]( 5.25)     1.01 [  1.37]( 4.17)
> Triad     1.00 [  0.00]( 5.25)     1.03 [  3.35]( 1.18)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>   1-clients     1.00 [  0.00]( 0.06)     1.01 [  0.66]( 0.75)
>   2-clients     1.00 [  0.00]( 0.80)     1.01 [  0.79]( 0.31)
>   4-clients     1.00 [  0.00]( 0.65)     1.01 [  0.56]( 0.73)
>   8-clients     1.00 [  0.00]( 0.82)     1.01 [  0.70]( 0.59)
> 16-clients     1.00 [  0.00]( 0.68)     1.01 [  0.63]( 0.77)
> 32-clients     1.00 [  0.00]( 0.95)     1.01 [  0.87]( 1.06)
> 64-clients     1.00 [  0.00]( 1.55)     1.01 [  0.66]( 1.60)
> 128-clients     1.00 [  0.00]( 1.23)     1.00 [ -0.28]( 1.58)
> 256-clients     1.00 [  0.00]( 4.92)     1.00 [  0.25]( 4.47)
> 512-clients     1.00 [  0.00](57.12)     1.00 [  0.24](62.52)
> 
> 
> ==================================================================
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>    1     1.00 [ -0.00](27.55)     0.81 [ 19.35](31.80)
>    2     1.00 [ -0.00](19.98)     0.87 [ 12.82]( 9.17)
>    4     1.00 [ -0.00](10.66)     1.09 [ -9.09]( 6.45)
>    8     1.00 [ -0.00]( 4.06)     0.90 [  9.62]( 6.38)
>   16     1.00 [ -0.00]( 5.33)     0.98 [  1.69]( 1.97)
>   32     1.00 [ -0.00]( 8.92)     0.97 [  3.16]( 1.09)
>   64     1.00 [ -0.00]( 6.06)     0.97 [  3.30]( 2.97)
> 128     1.00 [ -0.00](10.15)     1.05 [ -5.47]( 4.75)
> 256     1.00 [ -0.00](27.12)     1.00 [ -0.20](13.52)
> 512     1.00 [ -0.00]( 2.54)     0.80 [ 19.75]( 0.40)
> 
> 
> ==================================================================
> Test          : new-schbench-requests-per-second
> Units         : Normalized Requests per second
> Interpretation: Higher is better
> Statistic     : Median
> ==================================================================
> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>    1     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.46)
>    2     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
>    4     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
>    8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
>   16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
>   32     1.00 [  0.00]( 0.43)     1.01 [  0.63]( 0.28)
>   64     1.00 [  0.00]( 1.17)     1.00 [  0.00]( 0.20)
> 128     1.00 [  0.00]( 0.20)     1.00 [  0.00]( 0.20)
> 256     1.00 [  0.00]( 0.27)     1.00 [  0.00]( 1.69)
> 512     1.00 [  0.00]( 0.21)     0.95 [ -4.70]( 0.34)
> 
> 
> ==================================================================
> Test          : new-schbench-wakeup-latency
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>    1     1.00 [ -0.00](11.08)     1.33 [-33.33](15.78)
>    2     1.00 [ -0.00]( 4.08)     1.08 [ -7.69](10.00)
>    4     1.00 [ -0.00]( 6.39)     1.21 [-21.43](22.13)
>    8     1.00 [ -0.00]( 6.88)     1.15 [-15.38](11.93)
>   16     1.00 [ -0.00](13.62)     1.08 [ -7.69](10.33)
>   32     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.87)
>   64     1.00 [ -0.00]( 8.13)     1.00 [ -0.00]( 2.38)
> 128     1.00 [ -0.00]( 5.26)     0.98 [  2.11]( 1.92)
> 256     1.00 [ -0.00]( 1.00)     0.78 [ 22.36](14.65)
> 512     1.00 [ -0.00]( 0.48)     0.73 [ 27.15]( 6.75)
> 
> 
> ==================================================================
> Test          : new-schbench-request-latency
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>    1     1.00 [ -0.00]( 1.53)     1.00 [ -0.00]( 1.77)
>    2     1.00 [ -0.00]( 0.50)     1.01 [ -1.35]( 1.19)
>    4     1.00 [ -0.00]( 0.14)     1.00 [ -0.00]( 0.42)
>    8     1.00 [ -0.00]( 0.24)     1.00 [ -0.27]( 1.37)
>   16     1.00 [ -0.00]( 0.00)     1.00 [  0.27]( 0.14)
>   32     1.00 [ -0.00]( 0.66)     1.01 [ -1.48]( 2.65)
>   64     1.00 [ -0.00]( 5.72)     0.96 [  4.32]( 5.64)
> 128     1.00 [ -0.00]( 0.10)     1.00 [ -0.20]( 0.18)
> 256     1.00 [ -0.00]( 2.52)     0.96 [  4.04]( 9.70)
> 512     1.00 [ -0.00]( 0.68)     1.06 [ -5.52]( 0.36)
> 
> 
> ==================================================================
> Test          : longer running benchmarks
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : Median
> ==================================================================
> Benchmark		pct imp
> ycsb-cassandra          -0.64%
> ycsb-mongodb             0.56%
> deathstarbench-1x        0.30%
> deathstarbench-2x        3.21%
> deathstarbench-3x        2.18%
> deathstarbench-6x       -0.40%
> mysql-hammerdb-64VU     -0.63%
> ---

It seems that new_base_slice has made some progress in high load/latency
and regressed a bit on low load.

It seems that slice should not only be related to the number of cpus, but
also to the corresponding relationship between the overall load and the
number of cpus. The load is relatively heavy, so the slice should be
smaller. The load is relatively light, so the slice should be larger.
Fixing it to a value may not be the optimal solution.

> With that overwhelming amount of data out of the way, please feel free
> to add:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I think you're worth it, but it seems a bit late. I have received the email
of tip-bot2, I am not sure if there can still add it.

Your email made me realize that I should establish a systematic testing
method. Can you give me some useful projects?

Thanks!
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by K Prateek Nayak 9 months, 2 weeks ago
Hello Zhou,

Sorry this slipped past me.

On 2/22/2025 8:32 AM, zihan zhou wrote:
> Thank you for your reply, thank you for providing such a detailed test,
> which also let me learn a lot.
> 
>> Hello Zhou,
>>
>> I'll leave some testing data below but overall, in my testing with
>> CONFIG_HZ=250 and CONFIG_HZ=10000, I cannot see any major regressions
>> (at least not for any stable data point) There are few small regressions
>> probably as a result of grater opportunity for wakeup preemption since
>> RUN_TO_PARITY will work for a slightly shorter duration now but I
>> haven't dug deeper to confirm if they are run to run variation or a
>> result of the larger number of wakeup preemption.
>>
>> Since most servers run with CONFIG_HZ=250, and the tick is anyways 4ms
>> and with default base slice currently at 3ms, I don't think there will
>> be any discernible difference in most workloads (fingers crossed)
>>
>> Please find full data below.
> 
> 
> This should be CONFIG_HZ=250 and CONFIG_HZ=1000, is it wrong?

That is correct! My bad.

> 
> It seems that no performance difference is good news. This change will not
> affect performance. This problem was first found in the openeuler 6.6
> kernel. If one task runs all the time and the other runs for 3ms and then
> sleeps for 1us, the running time of the two tasks will become 4:3, but 1:1
> on orig cfs. This problem has disappeared in the mainline kernel.
> 
>> o Benchmark results (CONFIG_HZ=1000)
>>
>> ==================================================================
>> Test          : hackbench
>> Units         : Normalized time in seconds
>> Interpretation: Lower is better
>> Statistic     : AMean
>> ==================================================================
>> Case:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>    1-groups     1.00 [ -0.00]( 8.66)     1.05 [ -5.30](16.73)
>>    2-groups     1.00 [ -0.00]( 5.02)     1.07 [ -6.54]( 7.29)
>>    4-groups     1.00 [ -0.00]( 1.27)     1.02 [ -1.67]( 3.74)
>>    8-groups     1.00 [ -0.00]( 2.75)     0.99 [  0.78]( 2.61)
>> 16-groups     1.00 [ -0.00]( 2.02)     0.97 [  2.97]( 1.19)
>>
>>
>> ==================================================================
>> Test          : tbench
>> Units         : Normalized throughput
>> Interpretation: Higher is better
>> Statistic     : AMean
>> ==================================================================
>> Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>       1     1.00 [  0.00]( 0.40)     1.00 [ -0.44]( 0.47)
>>       2     1.00 [  0.00]( 0.49)     0.99 [ -0.65]( 1.39)
>>       4     1.00 [  0.00]( 0.94)     1.00 [ -0.34]( 0.09)
>>       8     1.00 [  0.00]( 0.64)     0.99 [ -0.77]( 1.57)
>>      16     1.00 [  0.00]( 1.04)     0.98 [ -2.00]( 0.98)
>>      32     1.00 [  0.00]( 1.13)     1.00 [  0.34]( 1.31)
>>      64     1.00 [  0.00]( 0.58)     1.00 [ -0.28]( 0.80)
>>     128     1.00 [  0.00]( 1.40)     0.99 [ -0.91]( 0.51)
>>     256     1.00 [  0.00]( 1.14)     0.99 [ -1.48]( 1.17)
>>     512     1.00 [  0.00]( 0.51)     1.00 [ -0.25]( 0.66)
>>    1024     1.00 [  0.00]( 0.62)     0.99 [ -0.79]( 0.40)
>>
>>
>> ==================================================================
>> Test          : stream-10
>> Units         : Normalized Bandwidth, MB/s
>> Interpretation: Higher is better
>> Statistic     : HMean
>> ==================================================================
>> Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>    Copy     1.00 [  0.00](16.03)     0.98 [ -2.33](17.69)
>> Scale     1.00 [  0.00]( 6.26)     0.99 [ -0.60]( 7.94)
>>     Add     1.00 [  0.00]( 8.35)     1.01 [  0.50](11.49)
>> Triad     1.00 [  0.00]( 9.56)     1.01 [  0.66]( 9.25)
>>
>>
>> ==================================================================
>> Test          : stream-100
>> Units         : Normalized Bandwidth, MB/s
>> Interpretation: Higher is better
>> Statistic     : HMean
>> ==================================================================
>> Test:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>    Copy     1.00 [  0.00]( 6.03)     1.02 [  1.58]( 2.27)
>> Scale     1.00 [  0.00]( 5.78)     1.02 [  1.64]( 4.50)
>>     Add     1.00 [  0.00]( 5.25)     1.01 [  1.37]( 4.17)
>> Triad     1.00 [  0.00]( 5.25)     1.03 [  3.35]( 1.18)
>>
>>
>> ==================================================================
>> Test          : netperf
>> Units         : Normalized Througput
>> Interpretation: Higher is better
>> Statistic     : AMean
>> ==================================================================
>> Clients:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>    1-clients     1.00 [  0.00]( 0.06)     1.01 [  0.66]( 0.75)
>>    2-clients     1.00 [  0.00]( 0.80)     1.01 [  0.79]( 0.31)
>>    4-clients     1.00 [  0.00]( 0.65)     1.01 [  0.56]( 0.73)
>>    8-clients     1.00 [  0.00]( 0.82)     1.01 [  0.70]( 0.59)
>> 16-clients     1.00 [  0.00]( 0.68)     1.01 [  0.63]( 0.77)
>> 32-clients     1.00 [  0.00]( 0.95)     1.01 [  0.87]( 1.06)
>> 64-clients     1.00 [  0.00]( 1.55)     1.01 [  0.66]( 1.60)
>> 128-clients     1.00 [  0.00]( 1.23)     1.00 [ -0.28]( 1.58)
>> 256-clients     1.00 [  0.00]( 4.92)     1.00 [  0.25]( 4.47)
>> 512-clients     1.00 [  0.00](57.12)     1.00 [  0.24](62.52)
>>
>>
>> ==================================================================
>> Test          : schbench
>> Units         : Normalized 99th percentile latency in us
>> Interpretation: Lower is better
>> Statistic     : Median
>> ==================================================================
>> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>     1     1.00 [ -0.00](27.55)     0.81 [ 19.35](31.80)
>>     2     1.00 [ -0.00](19.98)     0.87 [ 12.82]( 9.17)
>>     4     1.00 [ -0.00](10.66)     1.09 [ -9.09]( 6.45)
>>     8     1.00 [ -0.00]( 4.06)     0.90 [  9.62]( 6.38)
>>    16     1.00 [ -0.00]( 5.33)     0.98 [  1.69]( 1.97)
>>    32     1.00 [ -0.00]( 8.92)     0.97 [  3.16]( 1.09)
>>    64     1.00 [ -0.00]( 6.06)     0.97 [  3.30]( 2.97)
>> 128     1.00 [ -0.00](10.15)     1.05 [ -5.47]( 4.75)
>> 256     1.00 [ -0.00](27.12)     1.00 [ -0.20](13.52)
>> 512     1.00 [ -0.00]( 2.54)     0.80 [ 19.75]( 0.40)
>>
>>
>> ==================================================================
>> Test          : new-schbench-requests-per-second
>> Units         : Normalized Requests per second
>> Interpretation: Higher is better
>> Statistic     : Median
>> ==================================================================
>> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>     1     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.46)
>>     2     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
>>     4     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
>>     8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
>>    16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
>>    32     1.00 [  0.00]( 0.43)     1.01 [  0.63]( 0.28)
>>    64     1.00 [  0.00]( 1.17)     1.00 [  0.00]( 0.20)
>> 128     1.00 [  0.00]( 0.20)     1.00 [  0.00]( 0.20)
>> 256     1.00 [  0.00]( 0.27)     1.00 [  0.00]( 1.69)
>> 512     1.00 [  0.00]( 0.21)     0.95 [ -4.70]( 0.34)
>>
>>
>> ==================================================================
>> Test          : new-schbench-wakeup-latency
>> Units         : Normalized 99th percentile latency in us
>> Interpretation: Lower is better
>> Statistic     : Median
>> ==================================================================
>> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>     1     1.00 [ -0.00](11.08)     1.33 [-33.33](15.78)
>>     2     1.00 [ -0.00]( 4.08)     1.08 [ -7.69](10.00)
>>     4     1.00 [ -0.00]( 6.39)     1.21 [-21.43](22.13)
>>     8     1.00 [ -0.00]( 6.88)     1.15 [-15.38](11.93)
>>    16     1.00 [ -0.00](13.62)     1.08 [ -7.69](10.33)
>>    32     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.87)
>>    64     1.00 [ -0.00]( 8.13)     1.00 [ -0.00]( 2.38)
>> 128     1.00 [ -0.00]( 5.26)     0.98 [  2.11]( 1.92)
>> 256     1.00 [ -0.00]( 1.00)     0.78 [ 22.36](14.65)
>> 512     1.00 [ -0.00]( 0.48)     0.73 [ 27.15]( 6.75)
>>
>>
>> ==================================================================
>> Test          : new-schbench-request-latency
>> Units         : Normalized 99th percentile latency in us
>> Interpretation: Lower is better
>> Statistic     : Median
>> ==================================================================
>> #workers:      mainline[pct imp](CV)    new_base_slice[pct imp](CV)
>>     1     1.00 [ -0.00]( 1.53)     1.00 [ -0.00]( 1.77)
>>     2     1.00 [ -0.00]( 0.50)     1.01 [ -1.35]( 1.19)
>>     4     1.00 [ -0.00]( 0.14)     1.00 [ -0.00]( 0.42)
>>     8     1.00 [ -0.00]( 0.24)     1.00 [ -0.27]( 1.37)
>>    16     1.00 [ -0.00]( 0.00)     1.00 [  0.27]( 0.14)
>>    32     1.00 [ -0.00]( 0.66)     1.01 [ -1.48]( 2.65)
>>    64     1.00 [ -0.00]( 5.72)     0.96 [  4.32]( 5.64)
>> 128     1.00 [ -0.00]( 0.10)     1.00 [ -0.20]( 0.18)
>> 256     1.00 [ -0.00]( 2.52)     0.96 [  4.04]( 9.70)
>> 512     1.00 [ -0.00]( 0.68)     1.06 [ -5.52]( 0.36)
>>
>>
>> ==================================================================
>> Test          : longer running benchmarks
>> Units         : Normalized throughput
>> Interpretation: Higher is better
>> Statistic     : Median
>> ==================================================================
>> Benchmark		pct imp
>> ycsb-cassandra          -0.64%
>> ycsb-mongodb             0.56%
>> deathstarbench-1x        0.30%
>> deathstarbench-2x        3.21%
>> deathstarbench-3x        2.18%
>> deathstarbench-6x       -0.40%
>> mysql-hammerdb-64VU     -0.63%
>> ---
> 
> It seems that new_base_slice has made some progress in high load/latency
> and regressed a bit on low load.
> 
> It seems that slice should not only be related to the number of cpus, but
> also to the corresponding relationship between the overall load and the
> number of cpus. The load is relatively heavy, so the slice should be
> smaller. The load is relatively light, so the slice should be larger.
> Fixing it to a value may not be the optimal solution.

We've seen that assumptions go wrong in our experiments; some benchmarks
really love their time on the CPU without any preemptions :)

> 
>> With that overwhelming amount of data out of the way, please feel free
>> to add:
>>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> I think you're worth it, but it seems a bit late. I have received the email
> of tip-bot2, I am not sure if there can still add it.

That is fine as long as there is a record on lore :)

> 
> Your email made me realize that I should establish a systematic testing
> method. Can you give me some useful projects?

We use selective benchmarks from LKP: https://github.com/intel/lkp-tests

Then there are some larger benchmarks we run based on previous regression
reports and debugs. some of them are:

YCSB: https://github.com/brianfrankcooper/YCSB
netperf: https://github.com/HewlettPackard/netperf
DeathStarBench: https://github.com/delimitrou/DeathStarBench
HammerDB: https://github.com/TPC-Council/HammerDB.git
tbench (part of dbench): https://dbench.samba.org/web/download.html
schbench: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
sched-messaging: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/sched-messaging.c?h=v6.14-rc4

Some of them are hard to setup the first time; we internally have some
tools that have made it easy to run these benchmarks in a way that
stresses the system but we keep an eye out for regression reports to
understand what benchmarks folks are running in the field.

Sorry again for the delay and thank you.

> 
> Thanks!

-- 
Thanks and Regards,
Prateek
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by zihan zhou 9 months, 1 week ago
Thank you for your reply! I don't mind at all, and I'm also sorry for the
slow response due to too many things lately.

> Hello Zhou,
> 
> Sorry this slipped past me.

Thank you very much for your guidance! I realize that without a good
benchmark, it is impossible to truly do a good job in scheduling. I will
try my best to make time to do this well.

> We use selective benchmarks from LKP: https://github.com/intel/lkp-tests
> 
> Then there are some larger benchmarks we run based on previous regression
> reports and debugs. some of them are:
> 
> YCSB: https://github.com/brianfrankcooper/YCSB
> netperf: https://github.com/HewlettPackard/netperf
> DeathStarBench: https://github.com/delimitrou/DeathStarBench
> HammerDB: https://github.com/TPC-Council/HammerDB.git
> tbench (part of dbench): https://dbench.samba.org/web/download.html
> schbench: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
> sched-messaging: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/sched-messaging.c?h=v6.14-rc4
> 
> Some of them are hard to setup the first time; we internally have some
> tools that have made it easy to run these benchmarks in a way that
> stresses the system but we keep an eye out for regression reports to
> understand what benchmarks folks are running in the field.
> 
> Sorry again for the delay and thank you.

Thank you for your support! Wishing you all the best.
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Qais Yousef 10 months, 1 week ago
On 02/08/25 15:53, zihan zhou wrote:
> The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
> means that we have a default slice of
> 0.75 for 1 cpu
> 1.50 up to 3 cpus
> 2.25 up to 7 cpus
> 3.00 for 8 cpus and above.

I brought the topic up of these magic values with Peter and Vincent in LPC as
I think this logic is confusing. I have nothing against your patch, but if the
maintainers agree I am in favour of removing it completely in favour of setting
it to a single value that is the same across all systems.

I do think 1ms makes more sense as a default value given how modern workloads
need faster responsiveness across the board. But keeping it 3ms to avoid much
disturbance would be fine. We could also make it equal to TICK_MSEC (this
define doesn't exist) if it is higher than 3ms.

Do you use HZ=100 by the way? If yes, are you able to share the reasons? This
configuration is too aggressive and bad for latencies and I doubt this tweak of
the formula will make things better to them anyway.
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Peter Zijlstra 10 months, 1 week ago
On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:

> I brought the topic up of these magic values with Peter and Vincent in LPC as
> I think this logic is confusing. I have nothing against your patch, but if the
> maintainers agree I am in favour of removing it completely in favour of setting
> it to a single value that is the same across all systems.

You're talking about the scaling, right?

Yeah, it is of limited use. The cap at 8, combined with the fact that
its really hard to find a machine with less than 8 CPUs on, makes the
whole thing mostly useless.

Back when we did this, we still had dual-core laptops. Now phones have
8 or more CPUs on.

So I don't think I mind ripping it out.
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Vincent Guittot 9 months, 4 weeks ago
On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:
>
> > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > I think this logic is confusing. I have nothing against your patch, but if the
> > maintainers agree I am in favour of removing it completely in favour of setting
> > it to a single value that is the same across all systems.
>
> You're talking about the scaling, right?
>
> Yeah, it is of limited use. The cap at 8, combined with the fact that
> its really hard to find a machine with less than 8 CPUs on, makes the
> whole thing mostly useless.
>
> Back when we did this, we still had dual-core laptops. Now phones have
> 8 or more CPUs on.
>
> So I don't think I mind ripping it out.

Beside the question of ripping it out or not. We still have a number
of devices with less than 8 cores but they are not targeting phones,
laptops or servers ...
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Qais Yousef 9 months, 3 weeks ago
On 02/24/25 15:15, Vincent Guittot wrote:
> On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:
> >
> > > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > > I think this logic is confusing. I have nothing against your patch, but if the
> > > maintainers agree I am in favour of removing it completely in favour of setting
> > > it to a single value that is the same across all systems.
> >
> > You're talking about the scaling, right?
> >
> > Yeah, it is of limited use. The cap at 8, combined with the fact that
> > its really hard to find a machine with less than 8 CPUs on, makes the
> > whole thing mostly useless.
> >
> > Back when we did this, we still had dual-core laptops. Now phones have
> > 8 or more CPUs on.
> >
> > So I don't think I mind ripping it out.
> 
> Beside the question of ripping it out or not. We still have a number
> of devices with less than 8 cores but they are not targeting phones,
> laptops or servers ...

I'm not sure if this is in favour or against the rip out, or highlighting a new
problem. But in case it is against the rip-out, hopefully my answer in [1]
highlights why the relationship to CPU number is actually weak and not really
helping much - I think it is making implicit assumption about the workloads and
I don't think this holds anymore. Ignore me otherwise :-)

FWIW a raspberry PI can be used as a server, a personal computer, a multimedia
entertainment system, a dumb sensor recorder/relayer or anything else. I think
most systems expect to run a variety of workloads and IMHO the fact the system
is overloaded and we need a reasonable default base_slice to ensure timely
progress of all running tasks has little relation to NR_CPUs nowadays.

[1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Vincent Guittot 9 months, 3 weeks ago
On Tue, 25 Feb 2025 at 01:25, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 02/24/25 15:15, Vincent Guittot wrote:
> > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:
> > >
> > > > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > > > I think this logic is confusing. I have nothing against your patch, but if the
> > > > maintainers agree I am in favour of removing it completely in favour of setting
> > > > it to a single value that is the same across all systems.
> > >
> > > You're talking about the scaling, right?
> > >
> > > Yeah, it is of limited use. The cap at 8, combined with the fact that
> > > its really hard to find a machine with less than 8 CPUs on, makes the
> > > whole thing mostly useless.
> > >
> > > Back when we did this, we still had dual-core laptops. Now phones have
> > > 8 or more CPUs on.
> > >
> > > So I don't think I mind ripping it out.
> >
> > Beside the question of ripping it out or not. We still have a number
> > of devices with less than 8 cores but they are not targeting phones,
> > laptops or servers ...
>
> I'm not sure if this is in favour or against the rip out, or highlighting a new
> problem. But in case it is against the rip-out, hopefully my answer in [1]

My comment was only about the fact that assuming that systems now have
8 cpus or more so scaling doesn't make any real diff at the end is not
really true.

> highlights why the relationship to CPU number is actually weak and not really
> helping much - I think it is making implicit assumptions about the workloads and
> I don't think this holds anymore. Ignore me otherwise :-)

Then regarding the scaling factor, I don't have a strong opinion but I
would not be so definitive about its uselessness as there are few
things to take into account:
- From a scheduling PoV, the scheduling delay is impacted by largeer
slices on devices with small number of CPUs even for light loaded
cases
- 1000 HZ with 1ms slice will generate 3 times more context switch
than 2.8ms in a steady loaded case and if some people were concerned
but using 1000hz by default, we will not feel better with 1ms slice
- 1ms is not a good value. In fact anything which is a multiple of the
tick is not a good number as the actual time accounted to the task is
usually less than the tick
- And you can always set the scaling to none with tunable_scaling to
get a fixed 0.7ms default slice whatever the number of CPUs

>
> FWIW a raspberry PI can be used as a server, a personal computer, a multimedia
> entertainment system, a dumb sensor recorder/relayer or anything else. I think
> most systems expect to run a variety of workloads and IMHO the fact the system
> is overloaded and we need a reasonable default base_slice to ensure timely
> progress of all running tasks has little relation to NR_CPUs nowadays.
>
> [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Vincent Guittot 9 months, 3 weeks ago
On Tue, 25 Feb 2025 at 02:29, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 25 Feb 2025 at 01:25, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 02/24/25 15:15, Vincent Guittot wrote:
> > > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:
> > > >
> > > > > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > > > > I think this logic is confusing. I have nothing against your patch, but if the
> > > > > maintainers agree I am in favour of removing it completely in favour of setting
> > > > > it to a single value that is the same across all systems.
> > > >
> > > > You're talking about the scaling, right?
> > > >
> > > > Yeah, it is of limited use. The cap at 8, combined with the fact that
> > > > its really hard to find a machine with less than 8 CPUs on, makes the
> > > > whole thing mostly useless.
> > > >
> > > > Back when we did this, we still had dual-core laptops. Now phones have
> > > > 8 or more CPUs on.
> > > >
> > > > So I don't think I mind ripping it out.
> > >
> > > Beside the question of ripping it out or not. We still have a number
> > > of devices with less than 8 cores but they are not targeting phones,
> > > laptops or servers ...
> >
> > I'm not sure if this is in favour or against the rip out, or highlighting a new
> > problem. But in case it is against the rip-out, hopefully my answer in [1]
>
> My comment was only about the fact that assuming that systems now have
> 8 cpus or more so scaling doesn't make any real diff at the end is not
> really true.
>
> > highlights why the relationship to CPU number is actually weak and not really
> > helping much - I think it is making implicit assumptions about the workloads and
> > I don't think this holds anymore. Ignore me otherwise :-)
>
> Then regarding the scaling factor, I don't have a strong opinion but I
> would not be so definitive about its uselessness as there are few
> things to take into account:
> - From a scheduling PoV, the scheduling delay is impacted by largeer
> slices on devices with small number of CPUs even for light loaded
> cases
> - 1000 HZ with 1ms slice will generate 3 times more context switch
> than 2.8ms in a steady loaded case and if some people were concerned
> but using 1000hz by default, we will not feel better with 1ms slice

Figures showing that there is no major regression to use a base slice
< 1ms everywhere would be a good starting point.
Some slight performance regression has just been reported for this
patch which moves base slice from 3ms down to 2.8ms [1].

[1] https://lore.kernel.org/lkml/202502251026.bb927780-lkp@intel.com/


> - 1ms is not a good value. In fact anything which is a multiple of the
> tick is not a good number as the actual time accounted to the task is
> usually less than the tick
> - And you can always set the scaling to none with tunable_scaling to
> get a fixed 0.7ms default slice whatever the number of CPUs
>
> >
> > FWIW a raspberry PI can be used as a server, a personal computer, a multimedia
> > entertainment system, a dumb sensor recorder/relayer or anything else. I think
> > most systems expect to run a variety of workloads and IMHO the fact the system
> > is overloaded and we need a reasonable default base_slice to ensure timely
> > progress of all running tasks has little relation to NR_CPUs nowadays.
> >
> > [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Qais Yousef 9 months, 3 weeks ago
On 02/25/25 11:13, Vincent Guittot wrote:
> On Tue, 25 Feb 2025 at 02:29, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 25 Feb 2025 at 01:25, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 02/24/25 15:15, Vincent Guittot wrote:
> > > > On Mon, 10 Feb 2025 at 10:13, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > >
> > > > > On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:
> > > > >
> > > > > > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > > > > > I think this logic is confusing. I have nothing against your patch, but if the
> > > > > > maintainers agree I am in favour of removing it completely in favour of setting
> > > > > > it to a single value that is the same across all systems.
> > > > >
> > > > > You're talking about the scaling, right?
> > > > >
> > > > > Yeah, it is of limited use. The cap at 8, combined with the fact that
> > > > > its really hard to find a machine with less than 8 CPUs on, makes the
> > > > > whole thing mostly useless.
> > > > >
> > > > > Back when we did this, we still had dual-core laptops. Now phones have
> > > > > 8 or more CPUs on.
> > > > >
> > > > > So I don't think I mind ripping it out.
> > > >
> > > > Beside the question of ripping it out or not. We still have a number
> > > > of devices with less than 8 cores but they are not targeting phones,
> > > > laptops or servers ...
> > >
> > > I'm not sure if this is in favour or against the rip out, or highlighting a new
> > > problem. But in case it is against the rip-out, hopefully my answer in [1]
> >
> > My comment was only about the fact that assuming that systems now have
> > 8 cpus or more so scaling doesn't make any real diff at the end is not
> > really true.
> >
> > > highlights why the relationship to CPU number is actually weak and not really
> > > helping much - I think it is making implicit assumptions about the workloads and
> > > I don't think this holds anymore. Ignore me otherwise :-)
> >
> > Then regarding the scaling factor, I don't have a strong opinion but I
> > would not be so definitive about its uselessness as there are few
> > things to take into account:
> > - From a scheduling PoV, the scheduling delay is impacted by largeer
> > slices on devices with small number of CPUs even for light loaded
> > cases
> > - 1000 HZ with 1ms slice will generate 3 times more context switch
> > than 2.8ms in a steady loaded case and if some people were concerned
> > but using 1000hz by default, we will not feel better with 1ms slice

Oh I was thinking of keeping the 3ms base_slice for all systems instead.
While I think 3ms is a bit too high, but this is more contentious topic and
needs more thinking/experimenting.

> 
> Figures showing that there is no major regression to use a base slice
> < 1ms everywhere would be a good starting point.

I haven't tried less than 1ms. Worth experimenting with. Given our fastest tick
is 1ms, then without HRTICK this will not be helpful. Except for helping wakeup
preemption. I do strongly believe a shorter base slice (than 3ms) and HRTICK
are the right defaults. But this needs more data and evaluation. And fixing x86
(and similar archs) expensive HRTIMERs.

> Some slight performance regression has just been reported for this
> patch which moves base slice from 3ms down to 2.8ms [1].
> 
> [1] https://lore.kernel.org/lkml/202502251026.bb927780-lkp@intel.com/

Oh I didn't realize this patch was already picked up. Let me send a patch
myself then, assuming we agree ripping the scaling logic out and keeping
base_slice a constant 3ms for all systems is fine. This will undo this patch
though.. I do want to encourage people to think more about their workloads and
their requirements. The kernel won't ever have a default that is optimum across
the board. They shouldn't be shy to tweak it via the task runtime or debugfs
instead. But the default should be representative for modern systems/workloads
still as the world moves on. It would be great if we get feedback outside of
these synthetic benchmarks though. I really don't think they represent reality
that much (not saying they are completely useless).

> 
> 
> > - 1ms is not a good value. In fact anything which is a multiple of the
> > tick is not a good number as the actual time accounted to the task is
> > usually less than the tick
> > - And you can always set the scaling to none with tunable_scaling to
> > get a fixed 0.7ms default slice whatever the number of CPUs
> >
> > >
> > > FWIW a raspberry PI can be used as a server, a personal computer, a multimedia
> > > entertainment system, a dumb sensor recorder/relayer or anything else. I think
> > > most systems expect to run a variety of workloads and IMHO the fact the system
> > > is overloaded and we need a reasonable default base_slice to ensure timely
> > > progress of all running tasks has little relation to NR_CPUs nowadays.
> > >
> > > [1] https://lore.kernel.org/all/20250210230500.53mybtyvzhdagot5@airbuntu/
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Qais Yousef 10 months, 1 week ago
On 02/10/25 10:13, Peter Zijlstra wrote:
> On Mon, Feb 10, 2025 at 01:29:31AM +0000, Qais Yousef wrote:
> 
> > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > I think this logic is confusing. I have nothing against your patch, but if the
> > maintainers agree I am in favour of removing it completely in favour of setting
> > it to a single value that is the same across all systems.
> 
> You're talking about the scaling, right?

Right.

> 
> Yeah, it is of limited use. The cap at 8, combined with the fact that
> its really hard to find a machine with less than 8 CPUs on, makes the
> whole thing mostly useless.

Yes. The minimum bar of modern hardware is higher now. And generally IMHO this
value depends on workload. NR_CPUs can make an overloaded case harder, but it
really wouldn't take much to saturate 8 CPUs compared to 2 CPUs. And from
experience the larger the machine the larger the workload, so the worst case
scenario of having to slice won't be in practice too much different. Especially
many programmers look at NR_CPUs and spawn as many threads..

Besides with EAS we force packing, so we artificially force contention to save
power.

Dynamically depending on rq->hr_nr_runnable looks attractive but I think this
is a recipe for more confusion. We sort of had this with sched_period, the new
fixed model is better IMHO.

> 
> Back when we did this, we still had dual-core laptops. Now phones have
> 8 or more CPUs on.
> 
> So I don't think I mind ripping it out.

Great!
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by zihan zhou 10 months ago
> Yes. The minimum bar of modern hardware is higher now. And generally IMHO this
> value depends on workload. NR_CPUs can make an overloaded case harder, but it
> really wouldn't take much to saturate 8 CPUs compared to 2 CPUs. And from
> experience the larger the machine the larger the workload, so the worst case
> scenario of having to slice won't be in practice too much different. Especially
> many programmers look at NR_CPUs and spawn as many threads..
> 
> Besides with EAS we force packing, so we artificially force contention to save
> power.
> 
> Dynamically depending on rq->hr_nr_runnable looks attractive but I think this
> is a recipe for more confusion. We sort of had this with sched_period, the new
> fixed model is better IMHO.

Hi, It seems that I have been thinking less about things. I have been re
reading these emails recently. Can you give me the LPC links for these
discussions? I want to relearn this part seriously, such as why we don't
dynamically adjust the slice.

Thanks!
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Qais Yousef 9 months, 4 weeks ago
On 02/22/25 11:19, zihan zhou wrote:
> > Yes. The minimum bar of modern hardware is higher now. And generally IMHO this
> > value depends on workload. NR_CPUs can make an overloaded case harder, but it
> > really wouldn't take much to saturate 8 CPUs compared to 2 CPUs. And from
> > experience the larger the machine the larger the workload, so the worst case
> > scenario of having to slice won't be in practice too much different. Especially
> > many programmers look at NR_CPUs and spawn as many threads..
> > 
> > Besides with EAS we force packing, so we artificially force contention to save
> > power.
> > 
> > Dynamically depending on rq->hr_nr_runnable looks attractive but I think this
> > is a recipe for more confusion. We sort of had this with sched_period, the new
> > fixed model is better IMHO.
> 
> Hi, It seems that I have been thinking less about things. I have been re
> reading these emails recently. Can you give me the LPC links for these
> discussions? I want to relearn this part seriously, such as why we don't
> dynamically adjust the slice.

No LPC talks. It was just something I noticed and brought up during LPC offline
and was planning to send a patch with that effect. The reasons above is pretty
much is all of it. We are simply better off having a constant base_slice.
debugfs allows modifying it if users think they know better and need to use
another default. But the scaling factor doesn't hold great (or any) value
anymore and can create confusions for our users.
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by zihan zhou 10 months, 1 week ago
Thank you for your comments!

> I brought the topic up of these magic values with Peter and Vincent in LPC as
> I think this logic is confusing. I have nothing against your patch, but if the
> maintainers agree I am in favour of removing it completely in favour of setting
> it to a single value that is the same across all systems.

Here is my shallow understanding:
I think when the number of cpus is small, this type of machine is usually
a desktop. If the slice is still relatively large, a task has to have a
longer wake-up delay, which may result in a poorer user experience. When
there are a large number of cpus, it is likely to mean that the machine is
a server, its tasks often are batch workloads, a slight increase in slice
is acceptable. And a server often has idle cpus or cpus with low load, So
even if there are larger slice, the interaction experience is also not bad.

> I do think 1ms makes more sense as a default value given how modern workloads
> need faster responsiveness across the board. But keeping it 3ms to avoid much
> disturbance would be fine. We could also make it equal to TICK_MSEC (this
> define doesn't exist) if it is higher than 3ms.

I don't quite understand this. What is TICK_MSEC? If HZ=1000, then
TICK_MSEC=1ms? Why is it said that more than 3ms (slice) equals 1ms (tick)?

It seems that this value was originally designed for
sysctl_sched_wakeup_granularity. CFS does not force tasks to switch after
running for this time, but EEVDF does require it, So if slice is too small
like 1ms, it looks not conducive to cache, and is not good for batch
workloads.

> Do you use HZ=100 by the way? If yes, are you able to share the reasons? This
> configuration is too aggressive and bad for latencies and I doubt this tweak of
> the formula will make things better to them anyway.

I don't use HZ=100, in fact, all the machines I use have HZ=1000 and more
than 8 cpus, so I'm not familiar with some scenarios.

I think that if the slice is smaller than tick (10ms), there is not much
difference between 3ms slice and 1ms slice in tick preemption, but the two
are still different in wake-up preemption. After all, when waking up
preemption, there also has update_curr->update_deadline, and the wake-up
latency should be slightly lower with 1ms slice. So I think, when HZ=100,
different slices still have an impact on latency.
Re: [PATCH V3 1/2] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by Qais Yousef 10 months, 1 week ago
On 02/10/25 14:18, zihan zhou wrote:
> Thank you for your comments!
> 
> > I brought the topic up of these magic values with Peter and Vincent in LPC as
> > I think this logic is confusing. I have nothing against your patch, but if the
> > maintainers agree I am in favour of removing it completely in favour of setting
> > it to a single value that is the same across all systems.
> 
> Here is my shallow understanding:
> I think when the number of cpus is small, this type of machine is usually
> a desktop. If the slice is still relatively large, a task has to have a
> longer wake-up delay, which may result in a poorer user experience. When
> there are a large number of cpus, it is likely to mean that the machine is
> a server, its tasks often are batch workloads, a slight increase in slice
> is acceptable. And a server often has idle cpus or cpus with low load, So
> even if there are larger slice, the interaction experience is also not bad.

I think the logic has served its purpose and it's time to retire it. Any larger
than 8 CPUs will have the same mapping anyway. So let's simplify and make it
3ms by default for everyone.

So the suggestion is to remove this logic and always set base_slice to 3ms for
all systems instead. No need to do the scaling anymore.

> 
> > I do think 1ms makes more sense as a default value given how modern workloads
> > need faster responsiveness across the board. But keeping it 3ms to avoid much
> > disturbance would be fine. We could also make it equal to TICK_MSEC (this
> > define doesn't exist) if it is higher than 3ms.
> 
> I don't quite understand this. What is TICK_MSEC? If HZ=1000, then
> TICK_MSEC=1ms? Why is it said that more than 3ms (slice) equals 1ms (tick)?

I meant

	base_slice = max(3ms, TICK_USEC * USEC_PER_MSEC)

I was lazy to type TICK_USEC * USEC_PER_MSEC and used TICK_MSEC instead.

But this is a bad idea. Please ignore it. With HZ=100 still selectable, doing
that will wreck havoc on wake up preemption on those systems.

> 
> It seems that this value was originally designed for
> sysctl_sched_wakeup_granularity. CFS does not force tasks to switch after
> running for this time, but EEVDF does require it, So if slice is too small
> like 1ms, it looks not conducive to cache, and is not good for batch
> workloads.
> 
> > Do you use HZ=100 by the way? If yes, are you able to share the reasons? This
> > configuration is too aggressive and bad for latencies and I doubt this tweak of
> > the formula will make things better to them anyway.
> 
> I don't use HZ=100, in fact, all the machines I use have HZ=1000 and more
> than 8 cpus, so I'm not familiar with some scenarios.
> 
> I think that if the slice is smaller than tick (10ms), there is not much
> difference between 3ms slice and 1ms slice in tick preemption, but the two
> are still different in wake-up preemption. After all, when waking up
> preemption, there also has update_curr->update_deadline, and the wake-up
> latency should be slightly lower with 1ms slice. So I think, when HZ=100,
> different slices still have an impact on latency.

I am trying to argue elsewhere to remove HZ=100. Just was curious if you
actually use this value and if yes why. Sorry a bit of a tangent :)


Thanks!

--
Qais Yousef
[tip: sched/core] sched: Reduce the default slice to avoid tasks getting an extra tick
Posted by tip-bot2 for zihan zhou 10 months, 1 week ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2ae891b826958b60919ea21c727f77bcd6ffcc2c
Gitweb:        https://git.kernel.org/tip/2ae891b826958b60919ea21c727f77bcd6ffcc2c
Author:        zihan zhou <15645113830zzh@gmail.com>
AuthorDate:    Sat, 08 Feb 2025 15:53:23 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 14 Feb 2025 10:32:00 +01:00

sched: Reduce the default slice to avoid tasks getting an extra tick

The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:

  0.75 for 1 cpu
  1.50 up to 3 cpus
  2.25 up to 7 cpus
  3.00 for 8 cpus and above.

For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.

For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.

In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:

  0.70 for 1 cpu
  1.40 up to 3 cpus
  2.10 up to 7 cpus
  2.8 for 8 cpus and above.

This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.

Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61b826f..1784752 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -74,10 +74,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
 /*
  * Minimal preemption granularity for CPU-bound tasks:
  *
- * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_base_slice			= 750000ULL;
-static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
+unsigned int sysctl_sched_base_slice			= 700000ULL;
+static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;