DO NOT MERGE: Breaking down the experimantal diff

[RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by K Prateek Nayak 2 years, 5 months ago

Even with the two patches, I still observe the following lock
contention when profiling the tbench 128-clients run with IBS:

  -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
     - 10.94% native_queued_spin_lock_slowpath
        - 10.73% _raw_spin_lock
           - 9.57% __schedule
                schedule_idle
                do_idle
              + cpu_startup_entry
           - 0.82% task_rq_lock
                newidle_balance
                pick_next_task_fair
                __schedule
                schedule_idle
                do_idle
              + cpu_startup_entry

Since David mentioned rq->avg_idle check is probably not the right step
towards the solution, this experiment introduces a per-shard
"overload" flag. Similar to "rq->rd->overload", per-shard overload flag
notifies of the possibility of one or more rq covered in the shard's
domain having a queued task. shard's overload flag is set at the same
time as "rq->rd->overload", and is cleared when shard's list is found
to be empty.

With these changes, following are the results for tbench 128-clients:

tip				: 1.00 (var: 1.00%)
tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 13 +++++++++++--
 kernel/sched/sched.h | 17 +++++++++++++++++
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 446ffdad49e1..31fe109fdaf0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -186,6 +186,7 @@ static void shared_runq_reassign_domains(void)
 		rq->cfs.shared_runq = shared_runq;
 		rq->cfs.shard = &shared_runq->shards[shard_idx];
 		rq_unlock(rq, &rf);
+		WRITE_ONCE(rq->cfs.shard->overload, 0);
 	}
 }
 
@@ -202,6 +203,7 @@ static void __shared_runq_drain(struct shared_runq *shared_runq)
 		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
 			list_del_init(&p->shared_runq_node);
 		raw_spin_unlock(&shard->lock);
+		WRITE_ONCE(shard->overload, 0);
 	}
 }
 
@@ -258,13 +260,20 @@ shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	struct task_struct *p;
 
-	if (list_empty(&shard->list))
+	if (!READ_ONCE(shard->overload))
 		return NULL;
 
+	if (list_empty(&shard->list)) {
+		WRITE_ONCE(shard->overload, 0);
+		return NULL;
+	}
+
 	raw_spin_lock(&shard->lock);
 	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, target))
+	if (!p)
+		WRITE_ONCE(shard->overload, 0);
+	else if (is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f50176f720b1..e8d4d948f742 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -601,6 +601,20 @@ do {									\
 struct shared_runq_shard {
 	struct list_head list;
 	raw_spinlock_t lock;
+	/*
+	 * shared_runq_shard can contain running tasks.
+	 * In such cases where all the tasks are running,
+	 * it is futile to attempt to pull tasks from the
+	 * list. Overload flag is used to indicate case
+	 * where one or more rq in the shard domain may
+	 * have a queued task. If the flag is 0, it is
+	 * very likely that all tasks in the shard are
+	 * running and cannot be migrated. This is not
+	 * guarded by the shard lock, and since it may
+	 * be updated often, it is placed into its own
+	 * cacheline.
+	 */
+	int overload ____cacheline_aligned;
 } ____cacheline_aligned;
 
 /* This would likely work better as a configurable knob via debugfs */
@@ -2585,6 +2599,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 		if (!READ_ONCE(rq->rd->overload))
 			WRITE_ONCE(rq->rd->overload, 1);
+
+		if (rq->cfs.shard && !READ_ONCE(rq->cfs.shard->overload))
+			WRITE_ONCE(rq->cfs.shard->overload, 1);
 	}
 #endif
 
-- 
2.34.1

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by David Vernet 2 years, 5 months ago

On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:

Hi Prateek,

> Even with the two patches, I still observe the following lock
> contention when profiling the tbench 128-clients run with IBS:
> 
>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      - 10.94% native_queued_spin_lock_slowpath
>         - 10.73% _raw_spin_lock
>            - 9.57% __schedule
>                 schedule_idle
>                 do_idle
>               + cpu_startup_entry
>            - 0.82% task_rq_lock
>                 newidle_balance
>                 pick_next_task_fair
>                 __schedule
>                 schedule_idle
>                 do_idle
>               + cpu_startup_entry
> 
> Since David mentioned rq->avg_idle check is probably not the right step
> towards the solution, this experiment introduces a per-shard
> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> notifies of the possibility of one or more rq covered in the shard's
> domain having a queued task. shard's overload flag is set at the same
> time as "rq->rd->overload", and is cleared when shard's list is found
> to be empty.

I think this is an interesting idea, but I feel that it's still working
against the core proposition of SHARED_RUNQ, which is to enable work
conservation.

> With these changes, following are the results for tbench 128-clients:

Just to make sure I understand, this is to address the contention we're
observing on tbench with 64 - 256 clients, right?  That's my
understanding from Gautham's reply in [0].

[0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/

If so, are we sure this change won't regress other workloads that would
have benefited from the work conservation?

Also, I assume that you don't see the improved contention without this,
even if you include your fix to the newidle_balance() that has us skip
over the <= LLC domain?

Thanks,
David

P.S. Taking off on vacation now, so any replies will be very delayed.
Thanks again for working on this!

> 
> tip				: 1.00 (var: 1.00%)
> tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
> tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c  | 13 +++++++++++--
>  kernel/sched/sched.h | 17 +++++++++++++++++
>  2 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 446ffdad49e1..31fe109fdaf0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -186,6 +186,7 @@ static void shared_runq_reassign_domains(void)
>  		rq->cfs.shared_runq = shared_runq;
>  		rq->cfs.shard = &shared_runq->shards[shard_idx];
>  		rq_unlock(rq, &rf);
> +		WRITE_ONCE(rq->cfs.shard->overload, 0);
>  	}
>  }
>  
> @@ -202,6 +203,7 @@ static void __shared_runq_drain(struct shared_runq *shared_runq)
>  		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
>  			list_del_init(&p->shared_runq_node);
>  		raw_spin_unlock(&shard->lock);
> +		WRITE_ONCE(shard->overload, 0);
>  	}
>  }
>  
> @@ -258,13 +260,20 @@ shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>  {
>  	struct task_struct *p;
>  
> -	if (list_empty(&shard->list))
> +	if (!READ_ONCE(shard->overload))
>  		return NULL;
>  
> +	if (list_empty(&shard->list)) {
> +		WRITE_ONCE(shard->overload, 0);
> +		return NULL;
> +	}
> +
>  	raw_spin_lock(&shard->lock);
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +	if (!p)
> +		WRITE_ONCE(shard->overload, 0);
> +	else if (is_cpu_allowed(p, target))
>  		list_del_init(&p->shared_runq_node);
>  	else
>  		p = NULL;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f50176f720b1..e8d4d948f742 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -601,6 +601,20 @@ do {									\
>  struct shared_runq_shard {
>  	struct list_head list;
>  	raw_spinlock_t lock;
> +	/*
> +	 * shared_runq_shard can contain running tasks.
> +	 * In such cases where all the tasks are running,
> +	 * it is futile to attempt to pull tasks from the
> +	 * list. Overload flag is used to indicate case
> +	 * where one or more rq in the shard domain may
> +	 * have a queued task. If the flag is 0, it is
> +	 * very likely that all tasks in the shard are
> +	 * running and cannot be migrated. This is not
> +	 * guarded by the shard lock, and since it may
> +	 * be updated often, it is placed into its own
> +	 * cacheline.
> +	 */
> +	int overload ____cacheline_aligned;
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> @@ -2585,6 +2599,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>  	if (prev_nr < 2 && rq->nr_running >= 2) {
>  		if (!READ_ONCE(rq->rd->overload))
>  			WRITE_ONCE(rq->rd->overload, 1);
> +
> +		if (rq->cfs.shard && !READ_ONCE(rq->cfs.shard->overload))
> +			WRITE_ONCE(rq->cfs.shard->overload, 1);
>  	}
>  #endif
>  
> -- 
> 2.34.1
>

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by K Prateek Nayak 2 years, 4 months ago

Hello David,

Some more test results (although this might be slightly irrelevant with
next version around the corner)

On 9/1/2023 12:41 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> 
> Hi Prateek,
> 
>> Even with the two patches, I still observe the following lock
>> contention when profiling the tbench 128-clients run with IBS:
>>
>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>      - 10.94% native_queued_spin_lock_slowpath
>>         - 10.73% _raw_spin_lock
>>            - 9.57% __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>            - 0.82% task_rq_lock
>>                 newidle_balance
>>                 pick_next_task_fair
>>                 __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>
>> Since David mentioned rq->avg_idle check is probably not the right step
>> towards the solution, this experiment introduces a per-shard
>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>> notifies of the possibility of one or more rq covered in the shard's
>> domain having a queued task. shard's overload flag is set at the same
>> time as "rq->rd->overload", and is cleared when shard's list is found
>> to be empty.
> 
> I think this is an interesting idea, but I feel that it's still working
> against the core proposition of SHARED_RUNQ, which is to enable work
> conservation.
> 
I have some more numbers. This time I'm accounting the cost for peeking
into the shared-runq and have two variants - one that keeps the current
vanilla flow from your v3 and the other that moves the rq->avg_idle
check before looking at the shared-runq. Following are the results:

-> Without EEVDF

o tl;dr

- With avg_idle check, the improvements observed with shared-runq
  aren't as large but they are still noticeable.

- Most regressions are gone nad the others aren't as bad with the
  introduction of shared-runq

o Kernels

base			: tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
shared_runq		: base + correct time accounting with v3 of the series without any other changes
shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
			  (the rd->overload check still remains below the shared_runq access)

o Benchmarks

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           base[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.64)     0.90 [ 10.20]( 8.79)             0.93 [  7.08]( 3.87)
 2-groups     1.00 [ -0.00]( 2.97)     0.85 [ 15.06]( 4.86)             0.96 [  4.47]( 2.22)
 4-groups     1.00 [ -0.00]( 1.84)     0.93 [  7.38]( 2.63)             0.94 [  6.07]( 1.02)
 8-groups     1.00 [ -0.00]( 1.24)     0.97 [  2.83]( 2.69)             0.98 [  1.82]( 1.01)
16-groups     1.00 [ -0.00]( 3.31)     1.03 [ -2.93]( 2.46)             1.02 [ -1.61]( 1.34)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:   base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
    1     1.00 [  0.00]( 1.08)     0.98 [ -1.89]( 0.48)              0.99 [ -0.73]( 0.70)
    2     1.00 [  0.00]( 0.69)     0.99 [ -1.48]( 0.24)              0.98 [ -1.62]( 0.85)
    4     1.00 [  0.00]( 0.70)     0.97 [ -2.87]( 1.34)              0.98 [ -2.15]( 0.48)
    8     1.00 [  0.00]( 0.85)     0.97 [ -3.17]( 1.56)              0.99 [ -1.32]( 1.09)
   16     1.00 [  0.00]( 2.18)     0.91 [ -8.70]( 0.27)              0.98 [ -2.03]( 1.28)
   32     1.00 [  0.00]( 3.84)     0.51 [-48.53]( 2.52)              1.01 [  1.41]( 3.83)
   64     1.00 [  0.00]( 7.06)     0.38 [-62.49]( 1.89)              1.05 [  5.33]( 4.09)
  128     1.00 [  0.00]( 0.88)     0.41 [-58.92]( 0.28)              1.01 [  0.54]( 1.65)
  256     1.00 [  0.00]( 0.88)     0.97 [ -2.56]( 1.78)              1.00 [ -0.48]( 0.33)
  512     1.00 [  0.00]( 0.07)     1.00 [  0.06]( 0.04)              0.98 [ -1.51]( 0.44)
 1024     1.00 [  0.00]( 0.30)     0.99 [ -1.35]( 0.90)              1.00 [ -0.24]( 0.41)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 8.87)     1.00 [  0.31]( 5.27)                1.09 [  9.11]( 0.58)
Scale     1.00 [  0.00]( 6.80)     0.99 [ -0.81]( 7.20)                1.00 [  0.49]( 5.67)
  Add     1.00 [  0.00]( 7.24)     0.99 [ -1.13]( 7.02)                1.02 [  2.06]( 6.36)
Triad     1.00 [  0.00]( 5.00)     0.96 [ -4.11]( 9.37)                1.03 [  3.46]( 4.41)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 0.45)     1.00 [  0.32]( 1.88)              1.04 [  4.02]( 1.45)
Scale     1.00 [  0.00]( 4.40)     0.98 [ -1.76]( 6.46)              1.01 [  1.28]( 1.00)
  Add     1.00 [  0.00]( 4.97)     0.98 [ -1.85]( 6.01)              1.03 [  2.75]( 0.24)
Triad     1.00 [  0.00]( 0.24)     0.96 [ -3.82]( 6.41)              0.99 [ -1.10]( 4.47)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:        base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     0.98 [ -2.37]( 0.08)              0.99 [ -1.32]( 0.37)
 2-clients     1.00 [  0.00]( 0.75)     0.98 [ -2.04]( 0.33)              0.98 [ -1.57]( 0.50)
 4-clients     1.00 [  0.00]( 0.84)     0.97 [ -3.25]( 1.01)              0.99 [ -0.77]( 0.54)
 8-clients     1.00 [  0.00]( 0.78)     0.96 [ -4.18]( 0.68)              0.99 [ -0.77]( 0.63)
16-clients     1.00 [  0.00]( 2.56)     0.84 [-15.71]( 6.33)              1.00 [ -0.35]( 0.58)
32-clients     1.00 [  0.00]( 1.03)     0.35 [-64.92]( 8.90)              0.98 [ -1.92]( 1.67)
64-clients     1.00 [  0.00]( 2.69)     0.26 [-74.05]( 6.56)              0.98 [ -2.46]( 2.42)
128-clients    1.00 [  0.00]( 1.91)     0.25 [-74.81]( 3.67)              0.99 [ -1.50]( 2.15)
256-clients    1.00 [  0.00]( 2.21)     0.92 [ -7.73]( 2.29)              0.98 [ -1.51]( 1.85)
512-clients    1.00 [  0.00](45.18)     0.96 [ -4.06](52.89)              0.98 [ -2.49](49.22)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:        base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
  1             1.00 [ -0.00](12.03)     1.04 [ -4.35](34.64)              1.13 [-13.04]( 2.25)
  2             1.00 [ -0.00]( 9.36)     1.00 [ -0.00]( 4.56)              1.12 [-11.54](12.83)
  4             1.00 [ -0.00]( 1.95)     1.00 [ -0.00](13.36)              0.93 [  6.67]( 9.10)
  8             1.00 [ -0.00]( 9.01)     0.97 [  2.70]( 4.68)              1.03 [ -2.70](12.11)
 16             1.00 [ -0.00]( 3.08)     1.02 [ -2.00]( 3.01)              1.00 [ -0.00]( 7.33)
 32             1.00 [ -0.00]( 0.75)     1.03 [ -2.60]( 8.20)              1.09 [ -9.09]( 4.24)
 64             1.00 [ -0.00]( 2.15)     0.91 [  9.20]( 1.03)              1.01 [ -0.61]( 7.14)
128             1.00 [ -0.00]( 5.18)     1.05 [ -4.57]( 7.74)              1.01 [ -0.57]( 5.62)
256             1.00 [ -0.00]( 4.18)     1.06 [ -5.87](51.02)              1.10 [ -9.51](15.82)
512             1.00 [ -0.00]( 2.10)     1.03 [ -3.36]( 2.88)              1.06 [ -5.87]( 1.10)


==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
                                                base                  shared_runq             shared_runq_idle_check
Hmean     unixbench-dhry2reg-1            41407024.82 (   0.00%)    41211208.57 (  -0.47%)     41354094.87 (  -0.13%)
Hmean     unixbench-dhry2reg-512        6249629291.88 (   0.00%)  6245782129.00 (  -0.06%)   6236514875.97 (  -0.21%)
Amean     unixbench-syscall-1              2922580.63 (   0.00%)     2928021.57 *  -0.19%*      2895742.17 *   0.92%*
Amean     unixbench-syscall-512            7606400.73 (   0.00%)     8390396.33 * -10.31%*      8236409.00 *  -8.28%*
Hmean     unixbench-pipe-1                 2574942.54 (   0.00%)     2610825.75 *   1.39%*      2531492.38 *  -1.69%*
Hmean     unixbench-pipe-512             364489246.31 (   0.00%)   366388360.22 *   0.52%*    360160487.66 *  -1.19%*
Hmean     unixbench-spawn-1                   4428.94 (   0.00%)        4391.20 (  -0.85%)         4541.06 (   2.53%)
Hmean     unixbench-spawn-512                68883.47 (   0.00%)       69143.38 (   0.38%)        69776.01 *   1.30%*
Hmean     unixbench-execl-1                   3878.47 (   0.00%)        3861.63 (  -0.43%)         3873.96 (  -0.12%)
Hmean     unixbench-execl-512                11638.84 (   0.00%)       12758.38 *   9.62%*        14001.23 *  20.30%*


==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
tip                     : 1.00 (var: 1.41%)
shared_runq             : 0.99 (var: 0.84%)  (diff: -1.40%)
shared_runq_idle_check  : 1.00 (var: 0.79%)  (diff:  0.00%)


==================================================================
Test          : DeathStarBench
Units         : %diff, relative to base
Interpretation: Higher is better
Statistic     : AMean
==================================================================
pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
1CDD            1       0%       -0.39%              -0.09%
2CDD            2       0%       -0.31%              -1.73%
4CDD            4       0%        3.28%               0.60%
8CDD            8       0%        4.35%               2.98% 
 

-> With EEVDF

o tl;dr

- Same as what was observed without EEVDF  but shared_runq shows
  serious regression with multiple more variants of tbench and
  netperf now.

o Kernels

eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
			  (the rd->overload check still remains below the shared_runq access)

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.89)     0.95 [  4.72]( 8.98)         0.99 [  0.83]( 3.77)
 2-groups     1.00 [ -0.00]( 2.04)     0.86 [ 13.87]( 2.59)         0.95 [  4.92]( 1.98)
 4-groups     1.00 [ -0.00]( 2.38)     0.96 [  4.50]( 3.44)         0.98 [  2.45]( 1.93)
 8-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.95]( 1.36)         0.96 [  3.97]( 0.89)
16-groups     1.00 [ -0.00]( 3.44)     1.00 [ -0.00]( 1.59)         0.96 [  3.91]( 3.36)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:  eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
    1     1.00 [  0.00]( 0.18)     1.00 [  0.15]( 0.59)         0.98 [ -1.76]( 0.74)
    2     1.00 [  0.00]( 0.63)     0.97 [ -3.44]( 0.91)         0.98 [ -1.93]( 1.27)
    4     1.00 [  0.00]( 0.86)     0.95 [ -4.86]( 0.85)         0.99 [ -1.15]( 0.77)
    8     1.00 [  0.00]( 0.22)     0.94 [ -6.44]( 1.31)         0.99 [ -1.00]( 0.97)
   16     1.00 [  0.00]( 1.99)     0.86 [-13.68]( 0.38)         1.00 [ -0.47]( 0.99)
   32     1.00 [  0.00]( 4.29)     0.48 [-52.21]( 0.53)         1.01 [  1.24]( 6.66)
   64     1.00 [  0.00]( 1.71)     0.35 [-64.68]( 0.44)         0.99 [ -0.66]( 0.70)
  128     1.00 [  0.00]( 0.65)     0.40 [-60.32]( 0.95)         0.98 [ -2.15]( 1.25)
  256     1.00 [  0.00]( 0.19)     0.72 [-28.28]( 1.88)         0.99 [ -1.39]( 2.50)
  512     1.00 [  0.00]( 0.20)     0.79 [-20.59]( 4.40)         1.00 [ -0.42]( 0.38)
 1024     1.00 [  0.00]( 0.29)     0.80 [-20.24]( 0.64)         0.99 [ -0.51]( 0.20)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)   shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 4.32)     0.94 [ -6.40]( 8.05)          1.01 [  1.39]( 4.58)
Scale     1.00 [  0.00]( 5.21)     0.98 [ -2.15]( 6.79)          0.95 [ -4.54]( 7.35)
  Add     1.00 [  0.00]( 6.25)     0.97 [ -2.64]( 6.47)          0.97 [ -3.08]( 7.49)
Triad     1.00 [  0.00](10.74)     1.01 [  0.92]( 7.40)          1.01 [  1.25]( 8.76)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 0.70)     1.00 [ -0.07]( 0.70)         1.00 [  0.47]( 0.94)
Scale     1.00 [  0.00]( 6.55)     1.02 [  1.72]( 4.83)         1.03 [  2.96]( 1.00)
  Add     1.00 [  0.00]( 6.53)     1.02 [  1.53]( 4.77)         1.03 [  3.19]( 0.90)
Triad     1.00 [  0.00]( 6.66)     1.00 [  0.06]( 6.29)         0.99 [ -0.70]( 5.79)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:       eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     1.02 [  1.73]( 0.31)           0.99 [ -0.81]( 0.24)
 2-clients     1.00 [  0.00]( 0.38)     0.99 [ -0.68]( 1.17)           0.99 [ -0.87]( 0.45)
 4-clients     1.00 [  0.00]( 0.72)     0.97 [ -3.38]( 1.38)           0.99 [ -1.26]( 0.47)
 8-clients     1.00 [  0.00]( 0.98)     0.94 [ -6.30]( 1.84)           1.00 [ -0.44]( 0.45)
16-clients     1.00 [  0.00]( 0.70)     0.56 [-44.08]( 5.11)           0.99 [ -0.83]( 0.49)
32-clients     1.00 [  0.00]( 0.74)     0.35 [-64.92]( 1.98)           0.98 [ -2.14]( 2.14)
64-clients     1.00 [  0.00]( 2.24)     0.26 [-73.79]( 5.72)           0.97 [ -2.57]( 2.44)
128-clients    1.00 [  0.00]( 1.72)     0.25 [-74.91]( 6.72)           0.96 [ -3.66]( 1.48)
256-clients    1.00 [  0.00]( 4.44)     0.68 [-31.60]( 5.42)           0.96 [ -3.61]( 3.64)
512-clients    1.00 [  0.00](52.42)     0.67 [-32.81](48.45)           0.96 [ -3.80](55.24)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:   eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
  1         1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 6.19)          0.84 [ 16.00](20.83)
  2         1.00 [ -0.00]( 6.42)     0.89 [ 10.71]( 2.34)          0.96 [  3.57]( 4.17)
  4         1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.35)          1.00 [ -0.00]( 9.12)
  8         1.00 [ -0.00](13.83)     1.03 [ -2.63]( 6.96)          0.95 [  5.26]( 6.93)
 16         1.00 [ -0.00]( 4.37)     1.02 [ -2.13]( 4.17)          1.02 [ -2.13]( 3.53)
 32         1.00 [ -0.00]( 8.69)     0.96 [  3.70]( 5.23)          0.98 [  2.47]( 4.43)
 64         1.00 [ -0.00]( 2.30)     0.96 [  3.85]( 2.34)          0.92 [  7.69]( 4.14)
128         1.00 [ -0.00](12.12)     0.97 [  3.12]( 3.31)          0.93 [  6.53]( 5.31)
256         1.00 [ -0.00](26.04)     1.87 [-86.57](33.02)          1.63 [-62.73](40.63)
512         1.00 [ -0.00]( 5.62)     1.04 [ -3.80]( 0.35)          1.09 [ -8.78]( 2.56)
 
==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================

                                             eevdf                   shared_runq             shared_runq_idle_check
Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41245183.04 (  -0.01%)    41297801.58 (   0.12%)
Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6236534715.56 (  -0.06%)  6237356670.12 (  -0.04%)
Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2893792.10 *   2.52%*     2799609.00 *   5.69%*
Amean     unixbench-syscall-512            7790656.20 (   0.00%)     8489302.67 *  -8.97%*     7685974.47 *   1.34%*
Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2554662.39 *   0.75%*     2521853.23 *  -0.55%*
Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   365752991.35 *   1.21%*   358310503.28 *  -0.85%*
Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4566.00 (   1.33%)        4242.52 *  -5.85%*
Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69554.52 (   0.25%)       69413.14 (   0.05%)
Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3782.82 *  -1.09%*        3832.10 (   0.20%)
Hmean     unixbench-execl-512                12288.64 (   0.00%)       13248.40 (   7.81%)       12661.78 (   3.04%)
 
==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
eevdf                   : 1.00 (var: 1.41%)
shared_runq             : 0.98 (var: 0.84%)  (diff: -2.40%)
shared_runq_idle_check  : 0.97 (var: 0.79%)  (diff: -3.06%)


==================================================================
Test          : DeathStarBench
Units         : %diff, relative to eevdf
Interpretation: Higher is better
Statistic     : AMean
==================================================================
pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
1CDD            1       0%       -0.85%             -1.56%
2CDD            2       0%       -0.60%             -1.22%
4CDD            4       0%        2.87%              0.02%
8CDD            8       0%        0.36%              1.57%


--
Thanks and Regards,
Prateek

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by David Vernet 2 years, 4 months ago

On Wed, Sep 27, 2023 at 09:53:13AM +0530, K Prateek Nayak wrote:
> Hello David,

Hi Prateek,

> Some more test results (although this might be slightly irrelevant with
> next version around the corner)

Excellent, thanks for running these tests. The results are certainly
encouraging, and it seems clear that we could really improve the feature
by adding some of the logic you've experimented with to v5 (whether it's
the rq->avg_idle check, the per-shard overload, etc). I know that I owe
at least you and Chenyu more substantive responses on this and a few of
other emails that have been sent over the last week or two. I apologize
for the latency in my responses. I'm still at Kernel Recipes, but plan
to focus on this for the next couple of weeks after I'm back stateside.
I originally intended to revisit this _last_ week after my PTO, but got
caught up in helping with some sched_ext related stuff.

Just wanted to give you and everyone else a heads up that I haven't been
ignoring this intentionally, I've just been preempted a lot with travel
this month.

All of the work you folks are putting in is extremely helpful and
appreciated. I'm excited by the improvements we're making to
SHARED_RUNQ, and think that a lot of this can and should be folded into
v5.

So with that said, please feel free to keep experimenting and
discussing, and I'll rejoin the convo full time as soon as I can (which
should be either Friday or next week).

- David

> 
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > 
> > Hi Prateek,
> > 
> >> Even with the two patches, I still observe the following lock
> >> contention when profiling the tbench 128-clients run with IBS:
> >>
> >>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >>      - 10.94% native_queued_spin_lock_slowpath
> >>         - 10.73% _raw_spin_lock
> >>            - 9.57% __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>            - 0.82% task_rq_lock
> >>                 newidle_balance
> >>                 pick_next_task_fair
> >>                 __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>
> >> Since David mentioned rq->avg_idle check is probably not the right step
> >> towards the solution, this experiment introduces a per-shard
> >> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >> notifies of the possibility of one or more rq covered in the shard's
> >> domain having a queued task. shard's overload flag is set at the same
> >> time as "rq->rd->overload", and is cleared when shard's list is found
> >> to be empty.
> > 
> > I think this is an interesting idea, but I feel that it's still working
> > against the core proposition of SHARED_RUNQ, which is to enable work
> > conservation.
> > 
> I have some more numbers. This time I'm accounting the cost for peeking
> into the shared-runq and have two variants - one that keeps the current
> vanilla flow from your v3 and the other that moves the rq->avg_idle
> check before looking at the shared-runq. Following are the results:
> 
> -> Without EEVDF
> 
> o tl;dr
> 
> - With avg_idle check, the improvements observed with shared-runq
>   aren't as large but they are still noticeable.
> 
> - Most regressions are gone nad the others aren't as bad with the
>   introduction of shared-runq
> 
> o Kernels
> 
> base			: tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
> shared_runq		: base + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> 			  (the rd->overload check still remains below the shared_runq access)
> 
> o Benchmarks
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:           base[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 2.64)     0.90 [ 10.20]( 8.79)             0.93 [  7.08]( 3.87)
>  2-groups     1.00 [ -0.00]( 2.97)     0.85 [ 15.06]( 4.86)             0.96 [  4.47]( 2.22)
>  4-groups     1.00 [ -0.00]( 1.84)     0.93 [  7.38]( 2.63)             0.94 [  6.07]( 1.02)
>  8-groups     1.00 [ -0.00]( 1.24)     0.97 [  2.83]( 2.69)             0.98 [  1.82]( 1.01)
> 16-groups     1.00 [ -0.00]( 3.31)     1.03 [ -2.93]( 2.46)             1.02 [ -1.61]( 1.34)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:   base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>     1     1.00 [  0.00]( 1.08)     0.98 [ -1.89]( 0.48)              0.99 [ -0.73]( 0.70)
>     2     1.00 [  0.00]( 0.69)     0.99 [ -1.48]( 0.24)              0.98 [ -1.62]( 0.85)
>     4     1.00 [  0.00]( 0.70)     0.97 [ -2.87]( 1.34)              0.98 [ -2.15]( 0.48)
>     8     1.00 [  0.00]( 0.85)     0.97 [ -3.17]( 1.56)              0.99 [ -1.32]( 1.09)
>    16     1.00 [  0.00]( 2.18)     0.91 [ -8.70]( 0.27)              0.98 [ -2.03]( 1.28)
>    32     1.00 [  0.00]( 3.84)     0.51 [-48.53]( 2.52)              1.01 [  1.41]( 3.83)
>    64     1.00 [  0.00]( 7.06)     0.38 [-62.49]( 1.89)              1.05 [  5.33]( 4.09)
>   128     1.00 [  0.00]( 0.88)     0.41 [-58.92]( 0.28)              1.01 [  0.54]( 1.65)
>   256     1.00 [  0.00]( 0.88)     0.97 [ -2.56]( 1.78)              1.00 [ -0.48]( 0.33)
>   512     1.00 [  0.00]( 0.07)     1.00 [  0.06]( 0.04)              0.98 [ -1.51]( 0.44)
>  1024     1.00 [  0.00]( 0.30)     0.99 [ -1.35]( 0.90)              1.00 [ -0.24]( 0.41)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:      base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 8.87)     1.00 [  0.31]( 5.27)                1.09 [  9.11]( 0.58)
> Scale     1.00 [  0.00]( 6.80)     0.99 [ -0.81]( 7.20)                1.00 [  0.49]( 5.67)
>   Add     1.00 [  0.00]( 7.24)     0.99 [ -1.13]( 7.02)                1.02 [  2.06]( 6.36)
> Triad     1.00 [  0.00]( 5.00)     0.96 [ -4.11]( 9.37)                1.03 [  3.46]( 4.41)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:      base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.45)     1.00 [  0.32]( 1.88)              1.04 [  4.02]( 1.45)
> Scale     1.00 [  0.00]( 4.40)     0.98 [ -1.76]( 6.46)              1.01 [  1.28]( 1.00)
>   Add     1.00 [  0.00]( 4.97)     0.98 [ -1.85]( 6.01)              1.03 [  2.75]( 0.24)
> Triad     1.00 [  0.00]( 0.24)     0.96 [ -3.82]( 6.41)              0.99 [ -1.10]( 4.47)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:        base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     0.98 [ -2.37]( 0.08)              0.99 [ -1.32]( 0.37)
>  2-clients     1.00 [  0.00]( 0.75)     0.98 [ -2.04]( 0.33)              0.98 [ -1.57]( 0.50)
>  4-clients     1.00 [  0.00]( 0.84)     0.97 [ -3.25]( 1.01)              0.99 [ -0.77]( 0.54)
>  8-clients     1.00 [  0.00]( 0.78)     0.96 [ -4.18]( 0.68)              0.99 [ -0.77]( 0.63)
> 16-clients     1.00 [  0.00]( 2.56)     0.84 [-15.71]( 6.33)              1.00 [ -0.35]( 0.58)
> 32-clients     1.00 [  0.00]( 1.03)     0.35 [-64.92]( 8.90)              0.98 [ -1.92]( 1.67)
> 64-clients     1.00 [  0.00]( 2.69)     0.26 [-74.05]( 6.56)              0.98 [ -2.46]( 2.42)
> 128-clients    1.00 [  0.00]( 1.91)     0.25 [-74.81]( 3.67)              0.99 [ -1.50]( 2.15)
> 256-clients    1.00 [  0.00]( 2.21)     0.92 [ -7.73]( 2.29)              0.98 [ -1.51]( 1.85)
> 512-clients    1.00 [  0.00](45.18)     0.96 [ -4.06](52.89)              0.98 [ -2.49](49.22)
> 
> 
> ==================================================================
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:        base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>   1             1.00 [ -0.00](12.03)     1.04 [ -4.35](34.64)              1.13 [-13.04]( 2.25)
>   2             1.00 [ -0.00]( 9.36)     1.00 [ -0.00]( 4.56)              1.12 [-11.54](12.83)
>   4             1.00 [ -0.00]( 1.95)     1.00 [ -0.00](13.36)              0.93 [  6.67]( 9.10)
>   8             1.00 [ -0.00]( 9.01)     0.97 [  2.70]( 4.68)              1.03 [ -2.70](12.11)
>  16             1.00 [ -0.00]( 3.08)     1.02 [ -2.00]( 3.01)              1.00 [ -0.00]( 7.33)
>  32             1.00 [ -0.00]( 0.75)     1.03 [ -2.60]( 8.20)              1.09 [ -9.09]( 4.24)
>  64             1.00 [ -0.00]( 2.15)     0.91 [  9.20]( 1.03)              1.01 [ -0.61]( 7.14)
> 128             1.00 [ -0.00]( 5.18)     1.05 [ -4.57]( 7.74)              1.01 [ -0.57]( 5.62)
> 256             1.00 [ -0.00]( 4.18)     1.06 [ -5.87](51.02)              1.10 [ -9.51](15.82)
> 512             1.00 [ -0.00]( 2.10)     1.03 [ -3.36]( 2.88)              1.06 [ -5.87]( 1.10)
> 
> 
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
>                                                 base                  shared_runq             shared_runq_idle_check
> Hmean     unixbench-dhry2reg-1            41407024.82 (   0.00%)    41211208.57 (  -0.47%)     41354094.87 (  -0.13%)
> Hmean     unixbench-dhry2reg-512        6249629291.88 (   0.00%)  6245782129.00 (  -0.06%)   6236514875.97 (  -0.21%)
> Amean     unixbench-syscall-1              2922580.63 (   0.00%)     2928021.57 *  -0.19%*      2895742.17 *   0.92%*
> Amean     unixbench-syscall-512            7606400.73 (   0.00%)     8390396.33 * -10.31%*      8236409.00 *  -8.28%*
> Hmean     unixbench-pipe-1                 2574942.54 (   0.00%)     2610825.75 *   1.39%*      2531492.38 *  -1.69%*
> Hmean     unixbench-pipe-512             364489246.31 (   0.00%)   366388360.22 *   0.52%*    360160487.66 *  -1.19%*
> Hmean     unixbench-spawn-1                   4428.94 (   0.00%)        4391.20 (  -0.85%)         4541.06 (   2.53%)
> Hmean     unixbench-spawn-512                68883.47 (   0.00%)       69143.38 (   0.38%)        69776.01 *   1.30%*
> Hmean     unixbench-execl-1                   3878.47 (   0.00%)        3861.63 (  -0.43%)         3873.96 (  -0.12%)
> Hmean     unixbench-execl-512                11638.84 (   0.00%)       12758.38 *   9.62%*        14001.23 *  20.30%*
> 
> 
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> tip                     : 1.00 (var: 1.41%)
> shared_runq             : 0.99 (var: 0.84%)  (diff: -1.40%)
> shared_runq_idle_check  : 1.00 (var: 0.79%)  (diff:  0.00%)
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : %diff, relative to base
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
> 1CDD            1       0%       -0.39%              -0.09%
> 2CDD            2       0%       -0.31%              -1.73%
> 4CDD            4       0%        3.28%               0.60%
> 8CDD            8       0%        4.35%               2.98% 
>  
> 
> -> With EEVDF
> 
> o tl;dr
> 
> - Same as what was observed without EEVDF  but shared_runq shows
>   serious regression with multiple more variants of tbench and
>   netperf now.
> 
> o Kernels
> 
> eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> 			  (the rd->overload check still remains below the shared_runq access)
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:          eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 1.89)     0.95 [  4.72]( 8.98)         0.99 [  0.83]( 3.77)
>  2-groups     1.00 [ -0.00]( 2.04)     0.86 [ 13.87]( 2.59)         0.95 [  4.92]( 1.98)
>  4-groups     1.00 [ -0.00]( 2.38)     0.96 [  4.50]( 3.44)         0.98 [  2.45]( 1.93)
>  8-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.95]( 1.36)         0.96 [  3.97]( 0.89)
> 16-groups     1.00 [ -0.00]( 3.44)     1.00 [ -0.00]( 1.59)         0.96 [  3.91]( 3.36)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:  eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
>     1     1.00 [  0.00]( 0.18)     1.00 [  0.15]( 0.59)         0.98 [ -1.76]( 0.74)
>     2     1.00 [  0.00]( 0.63)     0.97 [ -3.44]( 0.91)         0.98 [ -1.93]( 1.27)
>     4     1.00 [  0.00]( 0.86)     0.95 [ -4.86]( 0.85)         0.99 [ -1.15]( 0.77)
>     8     1.00 [  0.00]( 0.22)     0.94 [ -6.44]( 1.31)         0.99 [ -1.00]( 0.97)
>    16     1.00 [  0.00]( 1.99)     0.86 [-13.68]( 0.38)         1.00 [ -0.47]( 0.99)
>    32     1.00 [  0.00]( 4.29)     0.48 [-52.21]( 0.53)         1.01 [  1.24]( 6.66)
>    64     1.00 [  0.00]( 1.71)     0.35 [-64.68]( 0.44)         0.99 [ -0.66]( 0.70)
>   128     1.00 [  0.00]( 0.65)     0.40 [-60.32]( 0.95)         0.98 [ -2.15]( 1.25)
>   256     1.00 [  0.00]( 0.19)     0.72 [-28.28]( 1.88)         0.99 [ -1.39]( 2.50)
>   512     1.00 [  0.00]( 0.20)     0.79 [-20.59]( 4.40)         1.00 [ -0.42]( 0.38)
>  1024     1.00 [  0.00]( 0.29)     0.80 [-20.24]( 0.64)         0.99 [ -0.51]( 0.20)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)   shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 4.32)     0.94 [ -6.40]( 8.05)          1.01 [  1.39]( 4.58)
> Scale     1.00 [  0.00]( 5.21)     0.98 [ -2.15]( 6.79)          0.95 [ -4.54]( 7.35)
>   Add     1.00 [  0.00]( 6.25)     0.97 [ -2.64]( 6.47)          0.97 [ -3.08]( 7.49)
> Triad     1.00 [  0.00](10.74)     1.01 [  0.92]( 7.40)          1.01 [  1.25]( 8.76)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.70)     1.00 [ -0.07]( 0.70)         1.00 [  0.47]( 0.94)
> Scale     1.00 [  0.00]( 6.55)     1.02 [  1.72]( 4.83)         1.03 [  2.96]( 1.00)
>   Add     1.00 [  0.00]( 6.53)     1.02 [  1.53]( 4.77)         1.03 [  3.19]( 0.90)
> Triad     1.00 [  0.00]( 6.66)     1.00 [  0.06]( 6.29)         0.99 [ -0.70]( 5.79)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:       eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     1.02 [  1.73]( 0.31)           0.99 [ -0.81]( 0.24)
>  2-clients     1.00 [  0.00]( 0.38)     0.99 [ -0.68]( 1.17)           0.99 [ -0.87]( 0.45)
>  4-clients     1.00 [  0.00]( 0.72)     0.97 [ -3.38]( 1.38)           0.99 [ -1.26]( 0.47)
>  8-clients     1.00 [  0.00]( 0.98)     0.94 [ -6.30]( 1.84)           1.00 [ -0.44]( 0.45)
> 16-clients     1.00 [  0.00]( 0.70)     0.56 [-44.08]( 5.11)           0.99 [ -0.83]( 0.49)
> 32-clients     1.00 [  0.00]( 0.74)     0.35 [-64.92]( 1.98)           0.98 [ -2.14]( 2.14)
> 64-clients     1.00 [  0.00]( 2.24)     0.26 [-73.79]( 5.72)           0.97 [ -2.57]( 2.44)
> 128-clients    1.00 [  0.00]( 1.72)     0.25 [-74.91]( 6.72)           0.96 [ -3.66]( 1.48)
> 256-clients    1.00 [  0.00]( 4.44)     0.68 [-31.60]( 5.42)           0.96 [ -3.61]( 3.64)
> 512-clients    1.00 [  0.00](52.42)     0.67 [-32.81](48.45)           0.96 [ -3.80](55.24)
> 
> 
> ==================================================================
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:   eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>   1         1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 6.19)          0.84 [ 16.00](20.83)
>   2         1.00 [ -0.00]( 6.42)     0.89 [ 10.71]( 2.34)          0.96 [  3.57]( 4.17)
>   4         1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.35)          1.00 [ -0.00]( 9.12)
>   8         1.00 [ -0.00](13.83)     1.03 [ -2.63]( 6.96)          0.95 [  5.26]( 6.93)
>  16         1.00 [ -0.00]( 4.37)     1.02 [ -2.13]( 4.17)          1.02 [ -2.13]( 3.53)
>  32         1.00 [ -0.00]( 8.69)     0.96 [  3.70]( 5.23)          0.98 [  2.47]( 4.43)
>  64         1.00 [ -0.00]( 2.30)     0.96 [  3.85]( 2.34)          0.92 [  7.69]( 4.14)
> 128         1.00 [ -0.00](12.12)     0.97 [  3.12]( 3.31)          0.93 [  6.53]( 5.31)
> 256         1.00 [ -0.00](26.04)     1.87 [-86.57](33.02)          1.63 [-62.73](40.63)
> 512         1.00 [ -0.00]( 5.62)     1.04 [ -3.80]( 0.35)          1.09 [ -8.78]( 2.56)
>  
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
> 
>                                              eevdf                   shared_runq             shared_runq_idle_check
> Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41245183.04 (  -0.01%)    41297801.58 (   0.12%)
> Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6236534715.56 (  -0.06%)  6237356670.12 (  -0.04%)
> Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2893792.10 *   2.52%*     2799609.00 *   5.69%*
> Amean     unixbench-syscall-512            7790656.20 (   0.00%)     8489302.67 *  -8.97%*     7685974.47 *   1.34%*
> Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2554662.39 *   0.75%*     2521853.23 *  -0.55%*
> Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   365752991.35 *   1.21%*   358310503.28 *  -0.85%*
> Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4566.00 (   1.33%)        4242.52 *  -5.85%*
> Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69554.52 (   0.25%)       69413.14 (   0.05%)
> Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3782.82 *  -1.09%*        3832.10 (   0.20%)
> Hmean     unixbench-execl-512                12288.64 (   0.00%)       13248.40 (   7.81%)       12661.78 (   3.04%)
>  
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> eevdf                   : 1.00 (var: 1.41%)
> shared_runq             : 0.98 (var: 0.84%)  (diff: -2.40%)
> shared_runq_idle_check  : 0.97 (var: 0.79%)  (diff: -3.06%)
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : %diff, relative to eevdf
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
> 1CDD            1       0%       -0.85%             -1.56%
> 2CDD            2       0%       -0.60%             -1.22%
> 4CDD            4       0%        2.87%              0.02%
> 8CDD            8       0%        0.36%              1.57%
> 
> 
> --
> Thanks and Regards,
> Prateek

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by Chen Yu 2 years, 4 months ago

Hi Prateek,

On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> Hello David,
> 
> Some more test results (although this might be slightly irrelevant with
> next version around the corner)
> 
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > 
> -> With EEVDF
> 
> o tl;dr
> 
> - Same as what was observed without EEVDF  but shared_runq shows
>   serious regression with multiple more variants of tbench and
>   netperf now.
> 
> o Kernels
> 
> eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> 			  (the rd->overload check still remains below the shared_runq access)
>

I did not see any obvious regression on a Sapphire Rapids server and it seems that
the result on your platform suggests that C/S workload could be impacted
by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
(no shared resource between tasks if I understand correctly) could benefit from
shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.
The question would be how to define C/S tasks. At first thought:
A only wakes up B, and B only wakes up A, then they could be regarded as a pair
of C/S
 (A->last_wakee == B && B->last_wakee == A &&
  A->wakee_flips <= 1 && B->wakee_flips <= 1)
But for netperf/tbench, this does not apply, because netperf client leverages kernel
thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
wakes up B. Unless we have a chain, we can not detect this wakeup behavior.

thanks,
Chenyu

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by David Vernet 2 years, 4 months ago

On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> Hi Prateek,

Hi Chenyu,

> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > Hello David,
> > 
> > Some more test results (although this might be slightly irrelevant with
> > next version around the corner)
> > 
> > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > > 
> > -> With EEVDF
> > 
> > o tl;dr
> > 
> > - Same as what was observed without EEVDF  but shared_runq shows
> >   serious regression with multiple more variants of tbench and
> >   netperf now.
> > 
> > o Kernels
> > 
> > eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> > shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > 			  (the rd->overload check still remains below the shared_runq access)
> >
> 
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from

Correct, hhvmworkers are largely independent, though they do sometimes
synchronize, and they also sometimes rely on I/O happening in other
tasks.

> shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.

I'm also open to this possibility, but I worry that we'd be going down
the same rabbit hole as what fair.c does already, which is use
heuristics to determine when something should or shouldn't be migrated,
etc. I really do feel that there's value in SHARED_RUNQ providing
consistent and predictable work conservation behavior.

On the other hand, it's clear that there are things we can do to improve
performance for some of these client/server workloads that hammer the
runqueue on larger CCXs / sockets. If we can avoid those regressions
while still having reasonably high confidence that work conservation
won't disproportionately suffer, I'm open to us making some tradeoffs
and/or adding a bit of complexity to avoid some of this unnecessary
contention.

I think it's probably about time for v4 to be sent out. What do you
folks think about including:

1. A few various fixes / tweaks from v3, e.g. avoiding using the wrong
   shard on the task_dead_fair() path if the feature is disabled before
   a dying task is dequeued from a shard, fixing the build issues
   pointed out by lkp, etc.
2. Fix the issue that Prateek pointed out in [0] where we're not
   properly skipping the LLC domain due to using the for_each_domain()
   macro (this is also addressed by (3)).
3. Apply Prateek's suggestions (in some form) in [1] and [2]. For [2],
   I'm inclined to just avoid enqueuing a task on a shard if the rq it's on
   has nr_running == 0. Or, we can just add his patch to the series
   directly if it turns out that just looking at rq->nr_running is
   insufficient.

[0]: https://lore.kernel.org/all/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/
[1]: https://lore.kernel.org/all/20230831104508.7619-3-kprateek.nayak@amd.com/
[2]: https://lore.kernel.org/all/20230831104508.7619-4-kprateek.nayak@amd.com/

Prateek -- what do you think about this? I want to make sure you get
credit for your contributions to this series, so let me know how you'd
like to apply these changes. [1] essentially just improves much of the
logic from [3], so I'm not sure it would make sense to include it as a
separate patch. I'm happy to include a Co-authored-by tag, or to just
explicitly credit your contributions in the commit summary if you'd
prefer that.

[3]: https://lore.kernel.org/all/20230809221218.163894-7-void@manifault.com/

Thanks,
David

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by Chen Yu 2 years, 4 months ago

Hi David,

On 2023-10-03 at 16:05:11 -0500, David Vernet wrote:
> On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> > Hi Prateek,
> 
> Hi Chenyu,
> 
> > On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > > Hello David,
> > > 
> > > Some more test results (although this might be slightly irrelevant with
> > > next version around the corner)
> > > 
> > > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > > > 
> > > -> With EEVDF
> > > 
> > > o tl;dr
> > > 
> > > - Same as what was observed without EEVDF  but shared_runq shows
> > >   serious regression with multiple more variants of tbench and
> > >   netperf now.
> > > 
> > > o Kernels
> > > 
> > > eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > > shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> > > shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > > 			  (the rd->overload check still remains below the shared_runq access)
> > >
> > 
> > I did not see any obvious regression on a Sapphire Rapids server and it seems that
> > the result on your platform suggests that C/S workload could be impacted
> > by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> > (no shared resource between tasks if I understand correctly) could benefit from
> 
> Correct, hhvmworkers are largely independent, though they do sometimes
> synchronize, and they also sometimes rely on I/O happening in other
> tasks.
> 
> > shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.
> 
> I'm also open to this possibility, but I worry that we'd be going down
> the same rabbit hole as what fair.c does already, which is use
> heuristics to determine when something should or shouldn't be migrated,
> etc. I really do feel that there's value in SHARED_RUNQ providing
> consistent and predictable work conservation behavior.
> 
> On the other hand, it's clear that there are things we can do to improve
> performance for some of these client/server workloads that hammer the
> runqueue on larger CCXs / sockets. If we can avoid those regressions
> while still having reasonably high confidence that work conservation
> won't disproportionately suffer, I'm open to us making some tradeoffs
> and/or adding a bit of complexity to avoid some of this unnecessary
> contention.
> 

Since I did not observe any regression(although I did not test hackbench
yet) on the latest version you sent to me, I'm OK with postponing the
client/server optimization to make the patchset simple, and Prateek
has other proposal to deal with the regression.

> I think it's probably about time for v4 to be sent out. What do you
> folks think about including:
>

It's OK for me and I can launch the test once the latest version is released.

thanks,
Chenyu

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by K Prateek Nayak 2 years, 4 months ago

Hello Chenyu,

On 9/27/2023 12:29 PM, Chen Yu wrote:
> Hi Prateek,
> 
> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> Some more test results (although this might be slightly irrelevant with
>> next version around the corner)
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>> -> With EEVDF
>>
>> o tl;dr
>>
>> - Same as what was observed without EEVDF  but shared_runq shows
>>   serious regression with multiple more variants of tbench and
>>   netperf now.
>>
>> o Kernels
>>
>> eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
>> shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
>> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
>> 			  (the rd->overload check still remains below the shared_runq access)
>>
> 
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from
> shared_runq a lot.

Yup that would be my guess too since HHVM seems to benefit purely from
more aggressive work conservation. (unless it leads to some second order
effect)

> This makes me wonder if we can let shared_runq skip the C/S tasks.
> The question would be how to define C/S tasks. At first thought:
> A only wakes up B, and B only wakes up A, then they could be regarded as a pair
> of C/S
>  (A->last_wakee == B && B->last_wakee == A &&
>   A->wakee_flips <= 1 && B->wakee_flips <= 1)
> But for netperf/tbench, this does not apply, because netperf client leverages kernel
> thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
> wakes up B. Unless we have a chain, we can not detect this wakeup behavior.

Yup, unless we have a notion of chain/flow, or until we can somehow
account the wakeup of client using the kthread to the server, this will
be hard to detect.

I can give it a try with the SIS_PAIR condition you shared above. Let
me know.

> 
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by Chen Yu 2 years, 4 months ago

On 2023-09-27 at 14:06:41 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/27/2023 12:29 PM, Chen Yu wrote:
> > Hi Prateek,
> > 
> > On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> >> Hello David,
> >>
> >> Some more test results (although this might be slightly irrelevant with
> >> next version around the corner)
> >>
> >> On 9/1/2023 12:41 AM, David Vernet wrote:
> >>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >>>

[snip]

> > This makes me wonder if we can let shared_runq skip the C/S tasks.
> > The question would be how to define C/S tasks. At first thought:
> > A only wakes up B, and B only wakes up A, then they could be regarded as a pair
> > of C/S
> >  (A->last_wakee == B && B->last_wakee == A &&
> >   A->wakee_flips <= 1 && B->wakee_flips <= 1)
> > But for netperf/tbench, this does not apply, because netperf client leverages kernel
> > thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
> > wakes up B. Unless we have a chain, we can not detect this wakeup behavior.
> 
> Yup, unless we have a notion of chain/flow, or until we can somehow
> account the wakeup of client using the kthread to the server, this will
> be hard to detect.
> 
> I can give it a try with the SIS_PAIR condition you shared above. Let
> me know.

Thanks Krateek, but I don't think SIS_PAIR could bring benefit to the netperf/tbench
since SIS_PAIR can not detect the chain wakeup.

thanks,
Chenyu

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by K Prateek Nayak 2 years, 5 months ago

Hello David,

On 9/1/2023 12:41 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> 
> Hi Prateek,
> 
>> Even with the two patches, I still observe the following lock
>> contention when profiling the tbench 128-clients run with IBS:
>>
>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>      - 10.94% native_queued_spin_lock_slowpath
>>         - 10.73% _raw_spin_lock
>>            - 9.57% __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>            - 0.82% task_rq_lock
>>                 newidle_balance
>>                 pick_next_task_fair
>>                 __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>
>> Since David mentioned rq->avg_idle check is probably not the right step
>> towards the solution, this experiment introduces a per-shard
>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>> notifies of the possibility of one or more rq covered in the shard's
>> domain having a queued task. shard's overload flag is set at the same
>> time as "rq->rd->overload", and is cleared when shard's list is found
>> to be empty.
> 
> I think this is an interesting idea, but I feel that it's still working
> against the core proposition of SHARED_RUNQ, which is to enable work
> conservation.

I don't think so! Work conservation is possible if there is an
imbalance. Consider the case where we 15 tasks in the shared_runq but we
have 16 CPUs, 15 of which are running these 15 tasks, and one going
idle. Work is conserved. What we need to worry about is tasks being in
the shared_runq that are queued on their respective CPU. This can only
happen if any one of the rq has nr_running >= 2, which is also the point
we are setting "shard->overload".

Now situation can change later and all tasks in the shared_runq might be
running on respective CPUs but "shard->overload" is only cleared when
the shared_runq becomes empty. If this is too late, maybe we can clear
it if periodic load balancing finds no queuing (somewhere around the
time we update nr_idle_scan).

So the window where we do not go ahead with popping a task from the
shared_runq_shard->list is between the list being empty and at least one
of the CPU associated with the shard reporting nr_running >= 2, which is
when work conservation is needed.

> 
>> With these changes, following are the results for tbench 128-clients:
> 
> Just to make sure I understand, this is to address the contention we're
> observing on tbench with 64 - 256 clients, right?  That's my
> understanding from Gautham's reply in [0].
> 
> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/

I'm not sure if Gautham saw a contention with IBS but he did see an
abnormal blowup in newidle_balance() counts which he suspected were the
cause for the regression. I noticed the rq lock contention after doing a
fair bit of surgery. Let me go check if that was the case with vanilla
v3 too.

> 
> If so, are we sure this change won't regress other workloads that would
> have benefited from the work conservation?

I don't think we'll regress any workloads as I explained above because
the "overload" flag being 0 almost (since update/access is not atomic)
always indicate a case where the tasks cannot be pulled. However, that
needs to be tested since there is a small behavior change in
shared_runq_pick_next_task(). Where previously if the task is running
on CPU, we would have popped it from shared_runq, did some lock
fiddling before finding out it is running, some more lock fiddling
before the function returned "-1", now with the changes here, it'll
simply return a "0" and although that is correct, we have seen some
interesting cases in past [1] where a random lock contention actually
helps certain benchmarks ¯\_(ツ)_/¯

[1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 

> 
> Also, I assume that you don't see the improved contention without this,
> even if you include your fix to the newidle_balance() that has us skip
> over the <= LLC domain?

No improvements! The lock is still very much contended for. I wonder if
it could be because of the unlocking and locking the rq again in
shared_runq_pick_next_task() even when task is on CPU. Also since it
return -1 for this case, pick_next_task_fair() will return RETRY_TASK
which can have further implications.

> 
> Thanks,
> David
> 
> P.S. Taking off on vacation now, so any replies will be very delayed.
> Thanks again for working on this!

Hope you have a great vacation :)

> 
>>
>> tip				: 1.00 (var: 1.00%)
>> tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
>> tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> [..snip..]
>>

--
Thanks and Regards,
Prateek

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by David Vernet 2 years, 4 months ago

On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> Hello David,
> 
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > 
> > Hi Prateek,
> > 
> >> Even with the two patches, I still observe the following lock
> >> contention when profiling the tbench 128-clients run with IBS:
> >>
> >>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >>      - 10.94% native_queued_spin_lock_slowpath
> >>         - 10.73% _raw_spin_lock
> >>            - 9.57% __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>            - 0.82% task_rq_lock
> >>                 newidle_balance
> >>                 pick_next_task_fair
> >>                 __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>
> >> Since David mentioned rq->avg_idle check is probably not the right step
> >> towards the solution, this experiment introduces a per-shard
> >> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >> notifies of the possibility of one or more rq covered in the shard's
> >> domain having a queued task. shard's overload flag is set at the same
> >> time as "rq->rd->overload", and is cleared when shard's list is found
> >> to be empty.
> > 
> > I think this is an interesting idea, but I feel that it's still working
> > against the core proposition of SHARED_RUNQ, which is to enable work
> > conservation.
> 
> I don't think so! Work conservation is possible if there is an
> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> have 16 CPUs, 15 of which are running these 15 tasks, and one going

I'm not sure I'm fully following. Those 15 tasks would not be enqueued
in the shared runq if they were being run. They would be dequeued from
the shared_runq in __dequeue_entity(), which would be called from
set_next_entity() before they were run. In this case, the
shard->overload check should be equivalent to the
!list_empty(&shard->list) check.

Oh, or is the idea that we're not bothering to pull them from the
shared_runq if they're being woken up and enqueued on an idle core that
will immediately run them on the next resched path? If so, I wonder if
we would instead just want to not enqueue the task in the shared_runq at
all? Consider that if another task comes in on an rq with
rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
were being woken up on idle cores (nor take the overhead of inserting
and then immediately removing them from the shared_runq).

> idle. Work is conserved. What we need to worry about is tasks being in
> the shared_runq that are queued on their respective CPU. This can only
> happen if any one of the rq has nr_running >= 2, which is also the point
> we are setting "shard->overload".

Assuming this is about the "wakeup / enqueue to idle core" case, ok,
this makes sense. I still think it probably makes more sense to just not
enqueue in the shared_runq for this case though, which would allow us to
instead just rely on list_empty(&shard->list).

> Now situation can change later and all tasks in the shared_runq might be
> running on respective CPUs but "shard->overload" is only cleared when
> the shared_runq becomes empty. If this is too late, maybe we can clear
> it if periodic load balancing finds no queuing (somewhere around the
> time we update nr_idle_scan).
> 
> So the window where we do not go ahead with popping a task from the
> shared_runq_shard->list is between the list being empty and at least one
> of the CPU associated with the shard reporting nr_running >= 2, which is
> when work conservation is needed.

So, I misread your patch the first time I reviewed it, and for some
reason thought you were only setting shard->overload on the
load_balance(). That's obviously not the case, and I now understand it
better, modulo my points above being clarified.

> > 
> >> With these changes, following are the results for tbench 128-clients:
> > 
> > Just to make sure I understand, this is to address the contention we're
> > observing on tbench with 64 - 256 clients, right?  That's my
> > understanding from Gautham's reply in [0].
> > 
> > [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/
> 
> I'm not sure if Gautham saw a contention with IBS but he did see an
> abnormal blowup in newidle_balance() counts which he suspected were the
> cause for the regression. I noticed the rq lock contention after doing a
> fair bit of surgery. Let me go check if that was the case with vanilla
> v3 too.
> 
> > 
> > If so, are we sure this change won't regress other workloads that would
> > have benefited from the work conservation?
> 
> I don't think we'll regress any workloads as I explained above because
> the "overload" flag being 0 almost (since update/access is not atomic)
> always indicate a case where the tasks cannot be pulled. However, that
> needs to be tested since there is a small behavior change in
> shared_runq_pick_next_task(). Where previously if the task is running
> on CPU, we would have popped it from shared_runq, did some lock
> fiddling before finding out it is running, some more lock fiddling
> before the function returned "-1", now with the changes here, it'll
> simply return a "0" and although that is correct, we have seen some
> interesting cases in past [1] where a random lock contention actually
> helps certain benchmarks ¯\_(ツ)_/¯

I don't think we need to worry about less lock contention possibly
hurting benchmarks :-)

> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 
> 
> > 
> > Also, I assume that you don't see the improved contention without this,
> > even if you include your fix to the newidle_balance() that has us skip
> > over the <= LLC domain?
> 
> No improvements! The lock is still very much contended for. I wonder if
> it could be because of the unlocking and locking the rq again in
> shared_runq_pick_next_task() even when task is on CPU. Also since it
> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
> which can have further implications.

Yeah, I could see it being an issue if we're essentially thrashing tasks
in the shared_runq that are just temporarily enqueued in the shared_runq
between activate and doing a resched pass on an idle core.

Unfortunately, I don't think we have any choice but to drop and
reacquire the rq lock. It's not safe to look at task_cpu(p) without
pi_lock, and we can't safely acquire that without dropping the rq lock.

Thanks,
David

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by K Prateek Nayak 2 years, 4 months ago

Hello David,

Thank you for answering my queries, I'll leave some data below to
answer yours.

On 9/29/2023 10:31 PM, David Vernet wrote:
> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>>> Hi Prateek,
>>>
>>>> Even with the two patches, I still observe the following lock
>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>
>>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>      - 10.94% native_queued_spin_lock_slowpath
>>>>         - 10.73% _raw_spin_lock
>>>>            - 9.57% __schedule
>>>>                 schedule_idle
>>>>                 do_idle
>>>>               + cpu_startup_entry
>>>>            - 0.82% task_rq_lock
>>>>                 newidle_balance
>>>>                 pick_next_task_fair
>>>>                 __schedule
>>>>                 schedule_idle
>>>>                 do_idle
>>>>               + cpu_startup_entry
>>>>
>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>> towards the solution, this experiment introduces a per-shard
>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>> notifies of the possibility of one or more rq covered in the shard's
>>>> domain having a queued task. shard's overload flag is set at the same
>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>> to be empty.
>>>
>>> I think this is an interesting idea, but I feel that it's still working
>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>> conservation.
>>
>> I don't think so! Work conservation is possible if there is an
>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
> 
> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> in the shared runq if they were being run. They would be dequeued from
> the shared_runq in __dequeue_entity(), which would be called from
> set_next_entity() before they were run. In this case, the
> shard->overload check should be equivalent to the
> !list_empty(&shard->list) check.
> 
> Oh, or is the idea that we're not bothering to pull them from the
> shared_runq if they're being woken up and enqueued on an idle core that
> will immediately run them on the next resched path? If so, I wonder if
> we would instead just want to not enqueue the task in the shared_runq at
> all? Consider that if another task comes in on an rq with
> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> were being woken up on idle cores (nor take the overhead of inserting
> and then immediately removing them from the shared_runq).

So this is the breakdown of outcomes after peeking into the shared_runq
during newidle_balance:

                                                SHARED_RUNQ                     SHARED_RUNQ
                                        + correct cost accounting       + correct cost accounting
                                                                        + rq->avg_idle early bail

tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)

attempts                                :       6,560,413                  2,273,334           (-65.35%)
shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)

As you can see, there are more attempts and a greater chance of success
in the case without the rq->avg_idle check upfront. Where the problem
lies (at least what I believe is) a task is waiting to be enqueued / has
been enqueued while we are trying to migrate a task fetched from the
shared_runq. Thus, instead of just being idle for a short duration and
running the task, we are now making it wait till we fetch another task
onto the CPU.

I think the scenario changes as follows with shared_runq:

- Current


      [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
	+-------+	+-------+                       +-------+	+-------+
	|	|	|	|        wakeup         |	|	|	|
	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
	|	|	|	|       -------->       |	|	|	|
	+-------+	+-------+                       +-------+	+-------+

- With shared_runq

      [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
	+-------+	+-------+                       +-------+	+-------+
	|	|	|	|        wakeup         |	|	|	|
	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
	|	|	|	|       -------->       |	|	|	|
	+-------+	+-------+                       +-------+	+-------+

We reach a similar final state but with shared_runq we've paid a price
for task migration. Worst case, the following timeline can happen:

        |
  CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
        |
        |                  pull T1 \             pull T4 /
        |
  CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
        |            [T4 TTWU]
        |

With the rq->avg_idle bailout, it might end up looking like:

        |
  CPU0  | [          T0 R, T1 Q          ] [T1 R ...
        |
        |
  CPU1  | [T3 R] [ I ] [T4 R ...
        |            
        |

If possible, can you check how long is the avg_idle running your
workload? Meanwhile, I believe there are a few workloads that
exhibit same behavior as tbench (large scale idling for short
duration) Let me go check if I can see tbench like issue there.

> 
>> idle. Work is conserved. What we need to worry about is tasks being in
>> the shared_runq that are queued on their respective CPU. This can only
>> happen if any one of the rq has nr_running >= 2, which is also the point
>> we are setting "shard->overload".
> 
> Assuming this is about the "wakeup / enqueue to idle core" case, ok,
> this makes sense. I still think it probably makes more sense to just not
> enqueue in the shared_runq for this case though, which would allow us to
> instead just rely on list_empty(&shard->list).
> 
>> Now situation can change later and all tasks in the shared_runq might be
>> running on respective CPUs but "shard->overload" is only cleared when
>> the shared_runq becomes empty. If this is too late, maybe we can clear
>> it if periodic load balancing finds no queuing (somewhere around the
>> time we update nr_idle_scan).
>>
>> So the window where we do not go ahead with popping a task from the
>> shared_runq_shard->list is between the list being empty and at least one
>> of the CPU associated with the shard reporting nr_running >= 2, which is
>> when work conservation is needed.
> 
> So, I misread your patch the first time I reviewed it, and for some
> reason thought you were only setting shard->overload on the
> load_balance(). That's obviously not the case, and I now understand it
> better, modulo my points above being clarified.
> 
>>>
>>>> With these changes, following are the results for tbench 128-clients:
>>>
>>> Just to make sure I understand, this is to address the contention we're
>>> observing on tbench with 64 - 256 clients, right?  That's my
>>> understanding from Gautham's reply in [0].
>>>
>>> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/
>>
>> I'm not sure if Gautham saw a contention with IBS but he did see an
>> abnormal blowup in newidle_balance() counts which he suspected were the
>> cause for the regression. I noticed the rq lock contention after doing a
>> fair bit of surgery. Let me go check if that was the case with vanilla
>> v3 too.
>>
>>>
>>> If so, are we sure this change won't regress other workloads that would
>>> have benefited from the work conservation?
>>
>> I don't think we'll regress any workloads as I explained above because
>> the "overload" flag being 0 almost (since update/access is not atomic)
>> always indicate a case where the tasks cannot be pulled. However, that
>> needs to be tested since there is a small behavior change in
>> shared_runq_pick_next_task(). Where previously if the task is running
>> on CPU, we would have popped it from shared_runq, did some lock
>> fiddling before finding out it is running, some more lock fiddling
>> before the function returned "-1", now with the changes here, it'll
>> simply return a "0" and although that is correct, we have seen some
>> interesting cases in past [1] where a random lock contention actually
>> helps certain benchmarks ¯\_(ツ)_/¯
> 
> I don't think we need to worry about less lock contention possibly
> hurting benchmarks :-)

Yup :)

> 
>> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 
>>
>>>
>>> Also, I assume that you don't see the improved contention without this,
>>> even if you include your fix to the newidle_balance() that has us skip
>>> over the <= LLC domain?
>>
>> No improvements! The lock is still very much contended for. I wonder if
>> it could be because of the unlocking and locking the rq again in
>> shared_runq_pick_next_task() even when task is on CPU. Also since it
>> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
>> which can have further implications.
> 
> Yeah, I could see it being an issue if we're essentially thrashing tasks
> in the shared_runq that are just temporarily enqueued in the shared_runq
> between activate and doing a resched pass on an idle core.
> 
> Unfortunately, I don't think we have any choice but to drop and
> reacquire the rq lock. It's not safe to look at task_cpu(p) without
> pi_lock, and we can't safely acquire that without dropping the rq lock.

True that. We wouldn't want to run into a deadlock scenario or cause
more lock contention with double locking :(

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by David Vernet 2 years, 4 months ago

On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
> Hello David,

Hello Prateek,

> 
> Thank you for answering my queries, I'll leave some data below to
> answer yours.
> 
> On 9/29/2023 10:31 PM, David Vernet wrote:
> > On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> >> Hello David,
> >>
> >> On 9/1/2023 12:41 AM, David Vernet wrote:
> >>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >>>
> >>> Hi Prateek,
> >>>
> >>>> Even with the two patches, I still observe the following lock
> >>>> contention when profiling the tbench 128-clients run with IBS:
> >>>>
> >>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >>>>      - 10.94% native_queued_spin_lock_slowpath
> >>>>         - 10.73% _raw_spin_lock
> >>>>            - 9.57% __schedule
> >>>>                 schedule_idle
> >>>>                 do_idle
> >>>>               + cpu_startup_entry
> >>>>            - 0.82% task_rq_lock
> >>>>                 newidle_balance
> >>>>                 pick_next_task_fair
> >>>>                 __schedule
> >>>>                 schedule_idle
> >>>>                 do_idle
> >>>>               + cpu_startup_entry
> >>>>
> >>>> Since David mentioned rq->avg_idle check is probably not the right step
> >>>> towards the solution, this experiment introduces a per-shard
> >>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >>>> notifies of the possibility of one or more rq covered in the shard's
> >>>> domain having a queued task. shard's overload flag is set at the same
> >>>> time as "rq->rd->overload", and is cleared when shard's list is found
> >>>> to be empty.
> >>>
> >>> I think this is an interesting idea, but I feel that it's still working
> >>> against the core proposition of SHARED_RUNQ, which is to enable work
> >>> conservation.
> >>
> >> I don't think so! Work conservation is possible if there is an
> >> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> >> have 16 CPUs, 15 of which are running these 15 tasks, and one going
> > 
> > I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> > in the shared runq if they were being run. They would be dequeued from
> > the shared_runq in __dequeue_entity(), which would be called from
> > set_next_entity() before they were run. In this case, the
> > shard->overload check should be equivalent to the
> > !list_empty(&shard->list) check.
> > 
> > Oh, or is the idea that we're not bothering to pull them from the
> > shared_runq if they're being woken up and enqueued on an idle core that
> > will immediately run them on the next resched path? If so, I wonder if
> > we would instead just want to not enqueue the task in the shared_runq at
> > all? Consider that if another task comes in on an rq with
> > rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> > were being woken up on idle cores (nor take the overhead of inserting
> > and then immediately removing them from the shared_runq).

Friendly ping on this point. This is the only scenario where I could see
the overload check helping, so I want to make sure I'm understanding it
and am correct in that just avoiding enqueueing the task in the shard in
this scenario would give us the same benefit.

> So this is the breakdown of outcomes after peeking into the shared_runq
> during newidle_balance:
> 
>                                                 SHARED_RUNQ                     SHARED_RUNQ
>                                         + correct cost accounting       + correct cost accounting
>                                                                         + rq->avg_idle early bail
> 
> tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)
> 
> attempts                                :       6,560,413                  2,273,334           (-65.35%)
> shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
> successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
> unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)
> 
> As you can see, there are more attempts and a greater chance of success
> in the case without the rq->avg_idle check upfront. Where the problem
> lies (at least what I believe is) a task is waiting to be enqueued / has
> been enqueued while we are trying to migrate a task fetched from the
> shared_runq. Thus, instead of just being idle for a short duration and
> running the task, we are now making it wait till we fetch another task
> onto the CPU.
>
> I think the scenario changes as follows with shared_runq:
> 
> - Current
> 
> 
>       [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
> 	+-------+	+-------+                       +-------+	+-------+
> 	|	|	|	|        wakeup         |	|	|	|
> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
> 	|	|	|	|       -------->       |	|	|	|
> 	+-------+	+-------+                       +-------+	+-------+
> 
> - With shared_runq
> 
>       [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
> 	+-------+	+-------+                       +-------+	+-------+
> 	|	|	|	|        wakeup         |	|	|	|
> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
> 	|	|	|	|       -------->       |	|	|	|
> 	+-------+	+-------+                       +-------+	+-------+
> 
> We reach a similar final state but with shared_runq we've paid a price
> for task migration. Worst case, the following timeline can happen:
> 
>         |
>   CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
>         |
>         |                  pull T1 \             pull T4 /
>         |
>   CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
>         |            [T4 TTWU]
>         |
> 
> With the rq->avg_idle bailout, it might end up looking like:
> 
>         |
>   CPU0  | [          T0 R, T1 Q          ] [T1 R ...
>         |
>         |
>   CPU1  | [T3 R] [ I ] [T4 R ...
>         |            
>         |

This certainly seems possible, and wouldn't be terribly surprising or
unexpected. Taking a step back here, I want to be clear that I do
understand the motivation for including the rq->avg_idle check for
SHARED_RUNQ; even just conceptually, and regardless of the numbers you
and others have observed for workloads that do these short sleeps. The
whole idea behind that check is that we want to avoid doing
newidle_balance() if the overhead of doing newidle_balance() would
exceed the amount of time that a task was blocked. Makes sense. Why
would you take the overhead of balancing if you have reason to believe
that a task is likely to be idle for less time than it takes to do a
migration?

There's certainly a reasonable argument for why that should also apply
to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
greater than the amount of time that an sd is expected to be idle, then
it's not worth bothering with SHARED_RUNQ either. On the other hand, the
claim of SHARED_RUNQ is that it's faster than doing a regular balance
pass, because we're doing an O(# shards) iteration to find tasks (before
sharding it was O(1)), rather than O(# CPUs). So if we also do the
rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
cache for a full load_balance() call.

Maybe that makes sense and is ultimately the correct design /
implementation for the feature. I'm not fundamentally opposed to that,
but I think we should be cognizant of the tradeoff we're making. If we
don't include this rq->avg_idle check, then some workloads will regress
because we're doing excessive migrations, but if we do check it, then
others will also regress because we're doing insufficient migrations due
to incorrectly assuming that an rq won't be idle for long. On yet
another hand, maybe it's fine to allow users to work around that by
setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
because we ignore that and set rq->max_idle_balance_cost = curr_cost in
newidle_balance() if we end up doing a balance pass. I also know that
Peter and others discourage the use of these debugfs knobs, so I'm not
sure it's even applicable to point that out as a workaround.

And so hopefully the problem starts to become clear. It doesn't take
long for for us to get mired in heuristics that make it difficult to
reason about the expected behavior of the feature, and also difficult to
reason about future changes as these heuristics have now all crossed
streams. Maybe that's OK, and is preferable to the alternative. My
personal opinion, however, is that it's preferable to provide users with
knobs that do straightforward things that are independent from existing
heuristics and knobs which were added for other circumstances. I'd
rather have confidence that I understand how a feature is supposed to
work, and can easily reason about when it's stupid (or not) to use it,
vs. have an expectation for it to not regress workloads in any scenario.

Note that this doesn't mean we can't make my patches less dumb. I think
your suggestions to e.g. check the overload flag (or possibly even
better to just not enqueue in a shard if the rq isn't overloaded),
re-check ttwu->pending after failing to find a task in the shard, etc
make complete sense. There's no downside -- we're just avoiding
pointless work. It's the heuristics like checking rq->avg_idle that
really worry me.

Peter -- I think it would be helpful if you could weigh in here just to
provide your thoughts on this more "philosophical" question.

> If possible, can you check how long is the avg_idle running your
> workload? Meanwhile, I believe there are a few workloads that
> exhibit same behavior as tbench (large scale idling for short
> duration) Let me go check if I can see tbench like issue there.

Sure thing, in the meantime I'll test this out on HHVM. I've actually
been working on getting a build + testbed ready for a few days, so
hopefully it won't take much longer to get some results. Even if it
turns out that this works great for HHVM, I'd ideally like to get
Peter's and others' thoughts on the above.

Thanks,
David

Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

Posted by K Prateek Nayak 2 years, 4 months ago

Hello David,

On 10/4/2023 10:50 PM, David Vernet wrote:
> On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> Thank you for answering my queries, I'll leave some data below to
>> answer yours.
>>
>> On 9/29/2023 10:31 PM, David Vernet wrote:
>>> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>>>> Hello David,
>>>>
>>>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>>>
>>>>> Hi Prateek,
>>>>>
>>>>>> Even with the two patches, I still observe the following lock
>>>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>>>
>>>>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>>>      - 10.94% native_queued_spin_lock_slowpath
>>>>>>         - 10.73% _raw_spin_lock
>>>>>>            - 9.57% __schedule
>>>>>>                 schedule_idle
>>>>>>                 do_idle
>>>>>>               + cpu_startup_entry
>>>>>>            - 0.82% task_rq_lock
>>>>>>                 newidle_balance
>>>>>>                 pick_next_task_fair
>>>>>>                 __schedule
>>>>>>                 schedule_idle
>>>>>>                 do_idle
>>>>>>               + cpu_startup_entry
>>>>>>
>>>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>>>> towards the solution, this experiment introduces a per-shard
>>>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>>>> notifies of the possibility of one or more rq covered in the shard's
>>>>>> domain having a queued task. shard's overload flag is set at the same
>>>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>>>> to be empty.
>>>>>
>>>>> I think this is an interesting idea, but I feel that it's still working
>>>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>>>> conservation.
>>>>
>>>> I don't think so! Work conservation is possible if there is an
>>>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>>>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>>>
>>> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
>>> in the shared runq if they were being run. They would be dequeued from
>>> the shared_runq in __dequeue_entity(), which would be called from
>>> set_next_entity() before they were run. In this case, the
>>> shard->overload check should be equivalent to the
>>> !list_empty(&shard->list) check.
>>>
>>> Oh, or is the idea that we're not bothering to pull them from the
>>> shared_runq if they're being woken up and enqueued on an idle core that
>>> will immediately run them on the next resched path? If so, I wonder if
>>> we would instead just want to not enqueue the task in the shared_runq at
>>> all? Consider that if another task comes in on an rq with
>>> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
>>> were being woken up on idle cores (nor take the overhead of inserting
>>> and then immediately removing them from the shared_runq).
> 
> Friendly ping on this point. This is the only scenario where I could see
> the overload check helping, so I want to make sure I'm understanding it
> and am correct in that just avoiding enqueueing the task in the shard in
> this scenario would give us the same benefit.

Woops! Missed answering this. So the original motivation for
'shard->overload' was that there is a rq lock contention, very likely as
a result of shared_runq_pick_next_task() trying to grab a remote rq's
lock. Looking at shared_runq_pick_next_task(), the criteria
"!task_on_cpu(src_rq, p)" led me to believe we might end up enqueuing a
task that is running on the CPU but now that I take a look at
shared_runq_enqueue_task() being called from __enqueue_entity(), this
should be a very rare scenario. 

However, if a running task was enqueued often into a shared_runq, the
'shard->overload' is an indication that all the runqueues covered by
the shard are not overloaded and hence, peeking into the shard can be
skipped. Let me see if I can grab some more stats to verify what
exactly is happening.

> 
>> So this is the breakdown of outcomes after peeking into the shared_runq
>> during newidle_balance:
>>
>>                                                 SHARED_RUNQ                     SHARED_RUNQ
>>                                         + correct cost accounting       + correct cost accounting
>>                                                                         + rq->avg_idle early bail
>>
>> tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)
>>
>> attempts                                :       6,560,413                  2,273,334           (-65.35%)
>> shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
>> successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
>> unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)
>>
>> As you can see, there are more attempts and a greater chance of success
>> in the case without the rq->avg_idle check upfront. Where the problem
>> lies (at least what I believe is) a task is waiting to be enqueued / has
>> been enqueued while we are trying to migrate a task fetched from the
>> shared_runq. Thus, instead of just being idle for a short duration and
>> running the task, we are now making it wait till we fetch another task
>> onto the CPU.
>>
>> I think the scenario changes as follows with shared_runq:
>>
>> - Current
>>
>>
>>       [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
>> 	+-------+	+-------+                       +-------+	+-------+
>> 	|	|	|	|        wakeup         |	|	|	|
>> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
>> 	|	|	|	|       -------->       |	|	|	|
>> 	+-------+	+-------+                       +-------+	+-------+
>>
>> - With shared_runq
>>
>>       [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
>> 	+-------+	+-------+                       +-------+	+-------+
>> 	|	|	|	|        wakeup         |	|	|	|
>> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
>> 	|	|	|	|       -------->       |	|	|	|
>> 	+-------+	+-------+                       +-------+	+-------+
>>
>> We reach a similar final state but with shared_runq we've paid a price
>> for task migration. Worst case, the following timeline can happen:
>>
>>         |
>>   CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
>>         |
>>         |                  pull T1 \             pull T4 /
>>         |
>>   CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
>>         |            [T4 TTWU]
>>         |
>>
>> With the rq->avg_idle bailout, it might end up looking like:
>>
>>         |
>>   CPU0  | [          T0 R, T1 Q          ] [T1 R ...
>>         |
>>         |
>>   CPU1  | [T3 R] [ I ] [T4 R ...
>>         |            
>>         |
> 
> This certainly seems possible, and wouldn't be terribly surprising or
> unexpected. Taking a step back here, I want to be clear that I do
> understand the motivation for including the rq->avg_idle check for
> SHARED_RUNQ; even just conceptually, and regardless of the numbers you
> and others have observed for workloads that do these short sleeps. The
> whole idea behind that check is that we want to avoid doing
> newidle_balance() if the overhead of doing newidle_balance() would
> exceed the amount of time that a task was blocked. Makes sense. Why
> would you take the overhead of balancing if you have reason to believe
> that a task is likely to be idle for less time than it takes to do a
> migration?
> 
> There's certainly a reasonable argument for why that should also apply
> to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
> greater than the amount of time that an sd is expected to be idle, then
> it's not worth bothering with SHARED_RUNQ either. On the other hand, the
> claim of SHARED_RUNQ is that it's faster than doing a regular balance
> pass, because we're doing an O(# shards) iteration to find tasks (before
> sharding it was O(1)), rather than O(# CPUs). So if we also do the
> rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
> cache for a full load_balance() call.
> 
> Maybe that makes sense and is ultimately the correct design /
> implementation for the feature. I'm not fundamentally opposed to that,
> but I think we should be cognizant of the tradeoff we're making. If we
> don't include this rq->avg_idle check, then some workloads will regress
> because we're doing excessive migrations, but if we do check it, then
> others will also regress because we're doing insufficient migrations due
> to incorrectly assuming that an rq won't be idle for long. On yet
> another hand, maybe it's fine to allow users to work around that by
> setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
> because we ignore that and set rq->max_idle_balance_cost = curr_cost in
> newidle_balance() if we end up doing a balance pass. I also know that
> Peter and others discourage the use of these debugfs knobs, so I'm not
> sure it's even applicable to point that out as a workaround.
> 
> And so hopefully the problem starts to become clear. It doesn't take
> long for for us to get mired in heuristics that make it difficult to
> reason about the expected behavior of the feature, and also difficult to
> reason about future changes as these heuristics have now all crossed
> streams. Maybe that's OK, and is preferable to the alternative. My
> personal opinion, however, is that it's preferable to provide users with
> knobs that do straightforward things that are independent from existing
> heuristics and knobs which were added for other circumstances. I'd
> rather have confidence that I understand how a feature is supposed to
> work, and can easily reason about when it's stupid (or not) to use it,
> vs. have an expectation for it to not regress workloads in any scenario.
> 
> Note that this doesn't mean we can't make my patches less dumb. I think
> your suggestions to e.g. check the overload flag (or possibly even
> better to just not enqueue in a shard if the rq isn't overloaded),
> re-check ttwu->pending after failing to find a task in the shard, etc
> make complete sense. There's no downside -- we're just avoiding
> pointless work. It's the heuristics like checking rq->avg_idle that
> really worry me.

I agree since avg_idle is merely a prediction that may or may not be
true.

> 
> Peter -- I think it would be helpful if you could weigh in here just to
> provide your thoughts on this more "philosophical" question.
> 
>> If possible, can you check how long is the avg_idle running your
>> workload? Meanwhile, I believe there are a few workloads that
>> exhibit same behavior as tbench (large scale idling for short
>> duration) Let me go check if I can see tbench like issue there.
> 
> Sure thing, in the meantime I'll test this out on HHVM. I've actually
> been working on getting a build + testbed ready for a few days, so
> hopefully it won't take much longer to get some results. Even if it
> turns out that this works great for HHVM, I'd ideally like to get
> Peter's and others' thoughts on the above.

I'll gather some more data too in the meantime :)

> 
> Thanks,
> David
 
--
Thanks and Regards,
Prateek

[RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h
[RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
[RFC PATCH 3/3] sched/fair: Add a per-shard overload flag