[RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Chen Yu posted 2 patches 2 years, 3 months ago
[RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inspired by Mathieu's idea[1], consider the sleep time of the task.
If that task is a short sleeping one, keep p's previous CPU idle
for a short while. Later when p is woken up, it can choose its
previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
other wakee is not allowed to choose this CPU in select_idle_idle().
The reservation period is set to the task's average sleep time. That
is to say, if p is a short sleeping task, there is no need to migrate
p to another idle CPU.

This does not break the work conservation of the scheduler,
because wakee will still try its best to find an idle CPU.
The difference is that, different idle CPUs might have different
priorities. On the other hand, in theory this extra check could
increase the failure ratio of select_idle_cpu(), but per the
initial test result, no regression is detected.

Baseline: tip/sched/core, on top of:
Commit 3f4feb58037a ("sched: Misc cleanups")

Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
cpufreq governor is performance, turbo boost is disabled, C-states deeper
than C1 are disabled, Numa balancing is disabled.

netperf
=======
case                    load            baseline(std%)  compare%( std%)
UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)

Noticeable improvements of netperf is observed in 168 and 224 threads
cases.

hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)

schbench(old)
========
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)

No much difference is observed in hackbench/schbench, due to the
run-to-run variance.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20f50726ab8..fe3b760c9654 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 	now = sched_clock_cpu(cpu_of(rq));
 	p->se.prev_sleep_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+	/*
+	 * If this rq will become idle, and dequeued task is
+	 * a short sleeping one, check if we can reserve
+	 * this idle CPU for that task for a short while.
+	 * During this reservation period, other wakees will
+	 * skip this 'idle' CPU in select_idle_cpu(), and this
+	 * short sleeping task can pick its previous CPU in
+	 * select_idle_sibling(), which brings better cache
+	 * locality.
+	 */
+	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
+		rq->cache_hot_timeout = now + p->se.sleep_avg;
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
-	    sched_cpu_cookie_match(cpu_rq(cpu), p))
+	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
+		if (sched_feat(SIS_CACHE) &&
+		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
+			return -1;
+
 		return cpu;
+	}
 
 	return -1;
 }
@@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	int cpu;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
-		if (!available_idle_cpu(cpu)) {
+		bool cache_hot = sched_feat(SIS_CACHE) ?
+			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
+
+		if (!available_idle_cpu(cpu) || cache_hot) {
 			idle = false;
 			if (*idle_cpu == -1) {
-				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
+				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
+				    !cache_hot) {
 					*idle_cpu = cpu;
 					break;
 				}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index f770168230ae..04ed9fcf67f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CACHE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62013c49c451..7a2c12c3b6d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,6 +1078,7 @@ struct rq {
 #endif
 	u64			idle_stamp;
 	u64			avg_idle;
+	u64			cache_hot_timeout;
 
 	unsigned long		wake_stamp;
 	u64			wake_avg_idle;
-- 
2.25.1
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by K Prateek Nayak 2 years, 3 months ago
Hello Chenyu,

One question ...

On 9/11/2023 8:20 AM, Chen Yu wrote:
> [..snip..]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> [..more snip..]
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>  	int cpu;
>  
>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
> -		if (!available_idle_cpu(cpu)) {
> +		bool cache_hot = sched_feat(SIS_CACHE) ?
> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> +		if (!available_idle_cpu(cpu) || cache_hot) {
>  			idle = false;
>  			if (*idle_cpu == -1) {
> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +				    !cache_hot) {

Here, the CPU is running a SCHED_IDLE task ...

>  					*idle_cpu = cpu;
>  					break;
>  				}

... but just below this, there are following lines to cache the idle_cpu:

		}
		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
			*idle_cpu = cpu;

Would it make sense to also add the same "cache_hot" check here when we
come across an idle CPU during the search for an idle core? Something
like:

-		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+		if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
			*idle_cpu = cpu;

Implications with the above change:

If the entire core is idle, "select_idle_core()" will return the core
and the search will bail out in "select_idle_cpu()". Otherwise, the
cache-hot idle CPUs encountered during the search for idle core will be
ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
not cache-hot.

Thoughts?

> [..snip..]

--
Thanks and Regards,
Prateek
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
Hi Prateek,

On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> One question ...
> 
> On 9/11/2023 8:20 AM, Chen Yu wrote:
> > [..snip..]
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > [..more snip..]
> > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> >  	int cpu;
> >  
> >  	for_each_cpu(cpu, cpu_smt_mask(core)) {
> > -		if (!available_idle_cpu(cpu)) {
> > +		bool cache_hot = sched_feat(SIS_CACHE) ?
> > +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> > +
> > +		if (!available_idle_cpu(cpu) || cache_hot) {
> >  			idle = false;
> >  			if (*idle_cpu == -1) {
> > -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> > +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> > +				    !cache_hot) {
> 
> Here, the CPU is running a SCHED_IDLE task ...
> 
> >  					*idle_cpu = cpu;
> >  					break;
> >  				}
> 
> ... but just below this, there are following lines to cache the idle_cpu:
> 
> 		}
> 		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> 			*idle_cpu = cpu;
> 
> Would it make sense to also add the same "cache_hot" check here when we
> come across an idle CPU during the search for an idle core? Something
> like:
> 
> -		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))

When we reached above code, the following condition should be true:
 (available_idle_cpu(cpu) && !cache_hot)
because the previous 'if' statement is false. So I guess we already
has !cache_hot ?

> +		if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
> 			*idle_cpu = cpu;
> 
> Implications with the above change:
> 
> If the entire core is idle, "select_idle_core()" will return the core
> and the search will bail out in "select_idle_cpu()". Otherwise, the
> cache-hot idle CPUs encountered during the search for idle core will be
> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
> not cache-hot.
> 
> Thoughts?
> 

Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle
in your case.

thanks,
Chenyu
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by K Prateek Nayak 2 years, 3 months ago
Hello Chenyu,

On 9/14/2023 4:13 PM, Chen Yu wrote:
> Hi Prateek,
> 
> On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> One question ...
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>> [..snip..]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> [..more snip..]
>>> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>>>  	int cpu;
>>>  
>>>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
>>> -		if (!available_idle_cpu(cpu)) {
>>> +		bool cache_hot = sched_feat(SIS_CACHE) ?
>>> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
>>> +
>>> +		if (!available_idle_cpu(cpu) || cache_hot) {
>>>  			idle = false;
>>>  			if (*idle_cpu == -1) {
>>> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
>>> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
>>> +				    !cache_hot) {
>>
>> Here, the CPU is running a SCHED_IDLE task ...
>>
>>>  					*idle_cpu = cpu;
>>>  					break;
>>>  				}
>>
>> ... but just below this, there are following lines to cache the idle_cpu:
>>
>> 		}
>> 		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
>> 			*idle_cpu = cpu;
>>
>> Would it make sense to also add the same "cache_hot" check here when we
>> come across an idle CPU during the search for an idle core? Something
>> like:
>>
>> -		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> 
> When we reached above code, the following condition should be true:
>  (available_idle_cpu(cpu) && !cache_hot)
> because the previous 'if' statement is false. So I guess we already
> has !cache_hot ?

Ah! You are right. I missed the break at end of the if block. Thank you
for pointing it out to me :)

> 
>> +		if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
>> 			*idle_cpu = cpu;
>>
>> Implications with the above change:
>>
>> If the entire core is idle, "select_idle_core()" will return the core
>> and the search will bail out in "select_idle_cpu()". Otherwise, the
>> cache-hot idle CPUs encountered during the search for idle core will be
>> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
>> not cache-hot.
>>
>> Thoughts?
>>
> 
> Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle
> in your case.
> 
> thanks,
> Chenyu
 
--
Thanks and Regards,
Prateek
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by K Prateek Nayak 2 years, 3 months ago
Hello Chenyu,

On 9/11/2023 8:20 AM, Chen Yu wrote:
>  [..snip..]
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/features.h |  1 +
>  kernel/sched/sched.h    |  1 +
>  3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  	now = sched_clock_cpu(cpu_of(rq));
>  	p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this rq will become idle, and dequeued task is
> +	 * a short sleeping one, check if we can reserve
> +	 * this idle CPU for that task for a short while.
> +	 * During this reservation period, other wakees will
> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> +	 * short sleeping task can pick its previous CPU in
> +	 * select_idle_sibling(), which brings better cache
> +	 * locality.
> +	 */
> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> +#endif
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>  {
>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> +		if (sched_feat(SIS_CACHE) &&
> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> +			return -1;

Just wondering,

Similar to how select_idle_core() caches the "idle_cpu" if it ends up
finding one in its search for an idle core, would returning a "cache-hot
idle CPU" be better than returning previous CPU / current CPU if all
idle CPUs found during the search in select_idle_cpu() are marked
cache-hot?

Speaking of cache-hot idle CPU, is netperf actually more happy with
piling on current CPU? I ask this because the logic seems to be
reserving the previous CPU for a task that dislikes migration but I
do not see anything in the wake_affine_idle() path that would make the
short sleeper proactively choose the previous CPU when the wakeup is
marked with the WF_SYNC flag. Let me know if I'm missing something?

To confirm this can you look at the trend in migration count with and
without the series? Also the ratio of cache-hot idle CPUs to number
of CPUs searched can help estimate overheads of additional search - I
presume SIS_UTIL is efficient at curbing the additional search in
a busy system.

In the meantime, I'll queue this series for testing on my end too.

> +
>  		return cpu;
> +	}
>  
>  	return -1;
>  }
> [..snip..]


--
Thanks and Regards,
Prateek
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Mike Galbraith 2 years, 3 months ago
On Mon, 2023-09-11 at 13:59 +0530, K Prateek Nayak wrote:
>
> Speaking of cache-hot idle CPU, is netperf actually more happy with
> piling on current CPU?

Some tests would be happier, others not at all, some numbers below.

I doubt much in the real world can perform better stacked, to be a win,
stacked task overlap induced service latency and utilization loss has
to be less than cache population cost of an idle CPU, something that
modern CPUs have become darn good at, making for a high bar.

> I ask this because the logic seems to be
> reserving the previous CPU for a task that dislikes migration but I
> do not see anything in the wake_affine_idle() path that would make the
> short sleeper proactively choose the previous CPU when the wakeup is
> marked with the WF_SYNC flag. Let me know if I'm missing something?

If select_idle_sibling() didn't intervene, the wake affine logic would
indeed routinely step all over working sets, and at one time briefly
did so due to a silly bug. (see kernel/sched/fair.c.today:7292)

The sync hint stems from the bad old days of SMP when cross-cpu latency
was horrid, and has lost much of its value, but its bias toward the
waker CPU still helps reduce man-in-the-middle latency in a busy box,
which can do even more damage than that done by stacking of not really
synchronous tasks that can be seen below.

The TCP/UDP_RR tests are very close to synchronous, and the numbers
reflect that, stacking is unbeatable for them [1], but for the other
tests, hopefully doing something a bit more realistic than tiny ball
ping-pong, stacking is a demonstrable loser.

Not super carefully run script output:

homer:/root # netperf.sh
TCP_SENDFILE-1  unbound    Avg:  87889  Sum:    87889
TCP_SENDFILE-1  stacked    Avg:  62885  Sum:    62885
TCP_SENDFILE-1  cross-smt  Avg:  58887  Sum:    58887
TCP_SENDFILE-1  cross-core Avg:  90673  Sum:    90673

TCP_STREAM-1    unbound    Avg:  71858  Sum:    71858
TCP_STREAM-1    stacked    Avg:  58883  Sum:    58883
TCP_STREAM-1    cross-smt  Avg:  49345  Sum:    49345
TCP_STREAM-1    cross-core Avg:  72346  Sum:    72346

TCP_MAERTS-1    unbound    Avg:  73890  Sum:    73890
TCP_MAERTS-1    stacked    Avg:  60682  Sum:    60682
TCP_MAERTS-1    cross-smt  Avg:  49868  Sum:    49868
TCP_MAERTS-1    cross-core Avg:  73343  Sum:    73343

UDP_STREAM-1    unbound    Avg:  99442  Sum:    99442
UDP_STREAM-1    stacked    Avg:  85319  Sum:    85319
UDP_STREAM-1    cross-smt  Avg:  63239  Sum:    63239
UDP_STREAM-1    cross-core Avg:  99102  Sum:    99102

TCP_RR-1        unbound    Avg: 200833  Sum:   200833
TCP_RR-1        stacked    Avg: 243733  Sum:   243733
TCP_RR-1        cross-smt  Avg: 138507  Sum:   138507
TCP_RR-1        cross-core Avg: 210404  Sum:   210404

UDP_RR-1        unbound    Avg: 252575  Sum:   252575
UDP_RR-1        stacked    Avg: 273081  Sum:   273081
UDP_RR-1        cross-smt  Avg: 168448  Sum:   168448
UDP_RR-1        cross-core Avg: 264124  Sum:   264124

1. nearly unbeatable - shared L2 CPUS can by a wee bit.
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
Hi Prateek,

thanks for your review,

On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/11/2023 8:20 AM, Chen Yu wrote:
> >  [..snip..]
> >  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >  kernel/sched/features.h |  1 +
> >  kernel/sched/sched.h    |  1 +
> >  3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	hrtick_update(rq);
> >  	now = sched_clock_cpu(cpu_of(rq));
> >  	p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this rq will become idle, and dequeued task is
> > +	 * a short sleeping one, check if we can reserve
> > +	 * this idle CPU for that task for a short while.
> > +	 * During this reservation period, other wakees will
> > +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> > +	 * short sleeping task can pick its previous CPU in
> > +	 * select_idle_sibling(), which brings better cache
> > +	 * locality.
> > +	 */
> > +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> > +#endif
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >  {
> >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > +		if (sched_feat(SIS_CACHE) &&
> > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > +			return -1;
> 
> Just wondering,
> 
> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
> finding one in its search for an idle core, would returning a "cache-hot
> idle CPU" be better than returning previous CPU / current CPU if all
> idle CPUs found during the search in select_idle_cpu() are marked
> cache-hot?
> 

This is a good point, we can optimize this further. Currently I only
send a simple version to desmonstrate how we can leverage the task's
sleep time.

> Speaking of cache-hot idle CPU, is netperf actually more happy with
> piling on current CPU?

Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
put the waker and wakee together.

> I ask this because the logic seems to be
> reserving the previous CPU for a task that dislikes migration but I
> do not see anything in the wake_affine_idle() path that would make the
> short sleeper proactively choose the previous CPU when the wakeup is
> marked with the WF_SYNC flag. Let me know if I'm missing something?
> 

If I understand correctly, WF_SYNC is to let the wakee to woken up
on the waker's CPU, rather than the wakee's previous CPU, because
the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
wakee's previous CPU. We can only restrict that other wakee does not
occupy the previous CPU, but do not enhance the possibility that
wake_affine_idle() chooses the previous CPU.

Say, there are two tasks t1, t2. t1's previous CPU is p1.
We don't enhance that when t1 is woken up, wake_affine_idle() will
choose p1 or not, but we makes sure t2 will not choose p1.

> To confirm this can you look at the trend in migration count with and
> without the series? Also the ratio of cache-hot idle CPUs to number
> of CPUs searched can help estimate overheads of additional search - I
> presume SIS_UTIL is efficient at curbing the additional search in
> a busy system.

OK, I'll collect these statistics.

> 
> In the meantime, I'll queue this series for testing on my end too.

Thanks again for your time.


thanks,
Chenyu
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Mike Galbraith 2 years, 3 months ago
On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote:
>
> > Speaking of cache-hot idle CPU, is netperf actually more happy with
> > piling on current CPU?
>
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.

Hm, seems there's at least one shared L2 case where that's untrue by
more than a tiny margin, which surprised me rather a lot.

For grins, I tested netperf on my dinky rpi4b, and while its RR numbers
seem kinda odd, they're also seemingly repeatable (ergo showing them).
I measured a very modest cross-core win on a shared L2 Intel CPU some
years ago (when Q6600 was shiny/new) but nothing close to these deltas.

Makes me wonder what (a tad beefier) Bulldog RR numbers look like.

root@rpi4:~# ONLY=TCP_RR netperf.sh
TCP_RR-1        unbound    Avg:  29611  Sum:    29611
TCP_RR-1        stacked    Avg:  22540  Sum:    22540
TCP_RR-1        cross-core Avg:  30181  Sum:    30181

root@rpi4:~# netperf.sh
TCP_SENDFILE-1  unbound    Avg:  15572  Sum:    15572
TCP_SENDFILE-1  stacked    Avg:  11533  Sum:    11533
TCP_SENDFILE-1  cross-core Avg:  15751  Sum:    15751

TCP_STREAM-1    unbound    Avg:   6331  Sum:     6331
TCP_STREAM-1    stacked    Avg:   6031  Sum:     6031
TCP_STREAM-1    cross-core Avg:   6211  Sum:     6211

TCP_MAERTS-1    unbound    Avg:   6306  Sum:     6306
TCP_MAERTS-1    stacked    Avg:   6094  Sum:     6094
TCP_MAERTS-1    cross-core Avg:   9393  Sum:     9393

UDP_STREAM-1    unbound    Avg:  22277  Sum:    22277
UDP_STREAM-1    stacked    Avg:  18844  Sum:    18844
UDP_STREAM-1    cross-core Avg:  24749  Sum:    24749

TCP_RR-1        unbound    Avg:  29674  Sum:    29674
TCP_RR-1        stacked    Avg:  22267  Sum:    22267
TCP_RR-1        cross-core Avg:  30237  Sum:    30237

UDP_RR-1        unbound    Avg:  36189  Sum:    36189
UDP_RR-1        stacked    Avg:  27129  Sum:    27129
UDP_RR-1        cross-core Avg:  37033  Sum:    37033
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
Hi Mike,

thanks for taking a look,

On 2023-09-12 at 11:39:55 +0200, Mike Galbraith wrote:
> On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote:
> >
> > > Speaking of cache-hot idle CPU, is netperf actually more happy with
> > > piling on current CPU?
> >
> > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> > put the waker and wakee together.
> 
> Hm, seems there's at least one shared L2 case where that's untrue by
> more than a tiny margin, which surprised me rather a lot.
> 

Yes, the task stacking is in theory against the work conservation of the
scheduler, and it depends on how much the resource(l1/l2 cache, dsb) locallity
is, and it is workload and hardware specific.

> For grins, I tested netperf on my dinky rpi4b, and while its RR numbers
> seem kinda odd, they're also seemingly repeatable (ergo showing them).
> I measured a very modest cross-core win on a shared L2 Intel CPU some
> years ago (when Q6600 was shiny/new) but nothing close to these deltas.
> 

This is interesting, I have a Jacobsville which also has shared L2, I'll
run some tests to check what the difference between task stacking vs spreading task
on that platform. But I guess that is another topic because current patch
avoids stacking tasks.

thanks,
Chenyu

> Makes me wonder what (a tad beefier) Bulldog RR numbers look like.
> 
> root@rpi4:~# ONLY=TCP_RR netperf.sh
> TCP_RR-1        unbound    Avg:  29611  Sum:    29611
> TCP_RR-1        stacked    Avg:  22540  Sum:    22540
> TCP_RR-1        cross-core Avg:  30181  Sum:    30181
> 
> root@rpi4:~# netperf.sh
> TCP_SENDFILE-1  unbound    Avg:  15572  Sum:    15572
> TCP_SENDFILE-1  stacked    Avg:  11533  Sum:    11533
> TCP_SENDFILE-1  cross-core Avg:  15751  Sum:    15751
> 
> TCP_STREAM-1    unbound    Avg:   6331  Sum:     6331
> TCP_STREAM-1    stacked    Avg:   6031  Sum:     6031
> TCP_STREAM-1    cross-core Avg:   6211  Sum:     6211
> 
> TCP_MAERTS-1    unbound    Avg:   6306  Sum:     6306
> TCP_MAERTS-1    stacked    Avg:   6094  Sum:     6094
> TCP_MAERTS-1    cross-core Avg:   9393  Sum:     9393
> 
> UDP_STREAM-1    unbound    Avg:  22277  Sum:    22277
> UDP_STREAM-1    stacked    Avg:  18844  Sum:    18844
> UDP_STREAM-1    cross-core Avg:  24749  Sum:    24749
> 
> TCP_RR-1        unbound    Avg:  29674  Sum:    29674
> TCP_RR-1        stacked    Avg:  22267  Sum:    22267
> TCP_RR-1        cross-core Avg:  30237  Sum:    30237
> 
> UDP_RR-1        unbound    Avg:  36189  Sum:    36189
> UDP_RR-1        stacked    Avg:  27129  Sum:    27129
> UDP_RR-1        cross-core Avg:  37033  Sum:    37033
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by K Prateek Nayak 2 years, 3 months ago
Hello Chenyu,

On 9/11/2023 3:49 PM, Chen Yu wrote:
> Hi Prateek,
> 
> thanks for your review,
> 
> On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>>  [..snip..]
>>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>>  kernel/sched/features.h |  1 +
>>>  kernel/sched/sched.h    |  1 +
>>>  3 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>  	hrtick_update(rq);
>>>  	now = sched_clock_cpu(cpu_of(rq));
>>>  	p->se.prev_sleep_time = task_sleep ? now : 0;
>>> +#ifdef CONFIG_SMP
>>> +	/*
>>> +	 * If this rq will become idle, and dequeued task is
>>> +	 * a short sleeping one, check if we can reserve
>>> +	 * this idle CPU for that task for a short while.
>>> +	 * During this reservation period, other wakees will
>>> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
>>> +	 * short sleeping task can pick its previous CPU in
>>> +	 * select_idle_sibling(), which brings better cache
>>> +	 * locality.
>>> +	 */
>>> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
>>> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
>>> +#endif
>>>  }
>>>  
>>>  #ifdef CONFIG_SMP
>>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>>  {
>>>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
>>> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
>>> +		if (sched_feat(SIS_CACHE) &&
>>> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
>>> +			return -1;
>>
>> Just wondering,
>>
>> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
>> finding one in its search for an idle core, would returning a "cache-hot
>> idle CPU" be better than returning previous CPU / current CPU if all
>> idle CPUs found during the search in select_idle_cpu() are marked
>> cache-hot?
>>
> 
> This is a good point, we can optimize this further. Currently I only
> send a simple version to desmonstrate how we can leverage the task's
> sleep time.
> 
>> Speaking of cache-hot idle CPU, is netperf actually more happy with
>> piling on current CPU?
> 
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.
> 
>> I ask this because the logic seems to be
>> reserving the previous CPU for a task that dislikes migration but I
>> do not see anything in the wake_affine_idle() path that would make the
>> short sleeper proactively choose the previous CPU when the wakeup is
>> marked with the WF_SYNC flag. Let me know if I'm missing something?
>>
> 
> If I understand correctly, WF_SYNC is to let the wakee to woken up
> on the waker's CPU, rather than the wakee's previous CPU, because
> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> wakee's previous CPU. We can only restrict that other wakee does not
> occupy the previous CPU, but do not enhance the possibility that
> wake_affine_idle() chooses the previous CPU.

Correct me if I'm wrong here,

Say a short sleeper, is always woken up using WF_SYNC flag. When the
task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
and restrict any wakeup happening until the "cache_hot_timeout" is
crossed. Let us assume a perfect world where the task wakes up before
the "cache_hot_timeout" expires. Logically this CPU was reserved all
this while for the short sleeper but since the wakeup bears WF_SYNC
flag, the whole reservation is ignored and waker's LLC is explored.

Should the timeout be cleared if the wakeup decides to not target the
previous CPU? (The default "sysctl_sched_migration_cost" is probably
small enough to curb any side effect that could possibly show here but
if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
a larger value, the wakeup path might be affected where lot of idle
targets are overlooked since the CPUs are marked cache-hot forr longer
duration)

Let me know what you think.

> 
> Say, there are two tasks t1, t2. t1's previous CPU is p1.
> We don't enhance that when t1 is woken up, wake_affine_idle() will
> choose p1 or not, but we makes sure t2 will not choose p1.
> 
>> To confirm this can you look at the trend in migration count with and
>> without the series? Also the ratio of cache-hot idle CPUs to number
>> of CPUs searched can help estimate overheads of additional search - I
>> presume SIS_UTIL is efficient at curbing the additional search in
>> a busy system.
> 
> OK, I'll collect these statistics.

Thank you :)

> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
Hi Prateek,

On 2023-09-12 at 08:35:12 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/11/2023 3:49 PM, Chen Yu wrote:
> > Hi Prateek,
> > 
> > thanks for your review,
> > 
> > On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/11/2023 8:20 AM, Chen Yu wrote:
> >>>  [..snip..]
> >>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >>>  kernel/sched/features.h |  1 +
> >>>  kernel/sched/sched.h    |  1 +
> >>>  3 files changed, 29 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index e20f50726ab8..fe3b760c9654 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>>  	hrtick_update(rq);
> >>>  	now = sched_clock_cpu(cpu_of(rq));
> >>>  	p->se.prev_sleep_time = task_sleep ? now : 0;
> >>> +#ifdef CONFIG_SMP
> >>> +	/*
> >>> +	 * If this rq will become idle, and dequeued task is
> >>> +	 * a short sleeping one, check if we can reserve
> >>> +	 * this idle CPU for that task for a short while.
> >>> +	 * During this reservation period, other wakees will
> >>> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> >>> +	 * short sleeping task can pick its previous CPU in
> >>> +	 * select_idle_sibling(), which brings better cache
> >>> +	 * locality.
> >>> +	 */
> >>> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> >>> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> >>> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> >>> +#endif
> >>>  }
> >>>  
> >>>  #ifdef CONFIG_SMP
> >>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >>>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >>>  {
> >>>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> >>> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> >>> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> >>> +		if (sched_feat(SIS_CACHE) &&
> >>> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> >>> +			return -1;
> >>
> >> Just wondering,
> >>
> >> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
> >> finding one in its search for an idle core, would returning a "cache-hot
> >> idle CPU" be better than returning previous CPU / current CPU if all
> >> idle CPUs found during the search in select_idle_cpu() are marked
> >> cache-hot?
> >>
> > 
> > This is a good point, we can optimize this further. Currently I only
> > send a simple version to desmonstrate how we can leverage the task's
> > sleep time.
> > 
> >> Speaking of cache-hot idle CPU, is netperf actually more happy with
> >> piling on current CPU?
> > 
> > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> > put the waker and wakee together.
> > 
> >> I ask this because the logic seems to be
> >> reserving the previous CPU for a task that dislikes migration but I
> >> do not see anything in the wake_affine_idle() path that would make the
> >> short sleeper proactively choose the previous CPU when the wakeup is
> >> marked with the WF_SYNC flag. Let me know if I'm missing something?
> >>
> > 
> > If I understand correctly, WF_SYNC is to let the wakee to woken up
> > on the waker's CPU, rather than the wakee's previous CPU, because
> > the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> > wakee's previous CPU. We can only restrict that other wakee does not
> > occupy the previous CPU, but do not enhance the possibility that
> > wake_affine_idle() chooses the previous CPU.
> 
> Correct me if I'm wrong here,
> 
> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
> and restrict any wakeup happening until the "cache_hot_timeout" is
> crossed. Let us assume a perfect world where the task wakes up before
> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> this while for the short sleeper but since the wakeup bears WF_SYNC
> flag, the whole reservation is ignored and waker's LLC is explored.
> 

Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
forces the short sleeping wakee to be woken up on waker's CPU rather the
wakee's previous CPU, but wakee's previous has been marked as cache hot
for nothing?

> Should the timeout be cleared if the wakeup decides to not target the
> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> small enough to curb any side effect that could possibly show here but
> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> a larger value, the wakeup path might be affected where lot of idle
> targets are overlooked since the CPUs are marked cache-hot forr longer
> duration)
> 
> Let me know what you think.
> 

This makes sense. In theory the above logic can be added in
select_idle_sibling(), if target CPU is chosen rather than
the previous CPU, the previous CPU's cache hot flag should be
cleared.

But this might bring overhead. Because we need to grab the rq
lock and write to other CPU's rq, which could be costly. It
seems to be a trade-off of current implementation. On the other
hand, if the user sets the sysctl_sched_migration_cost to a quite
large value:
1. Without SIS_CACHE, there is no task migration.
2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped
   in select_idle_cpu(), the wakee will be woken up locally.
It seems to be of the same effect, so there is no much impact
to wakeup behavior I suppose.

thanks,
Chenyu
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by K Prateek Nayak 2 years, 3 months ago
Hello Chenyu,

On 9/12/2023 6:02 PM, Chen Yu wrote:
> [..snip..]
>
>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>> wakee's previous CPU. We can only restrict that other wakee does not
>>> occupy the previous CPU, but do not enhance the possibility that
>>> wake_affine_idle() chooses the previous CPU.
>>
>> Correct me if I'm wrong here,
>>
>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
>> and restrict any wakeup happening until the "cache_hot_timeout" is
>> crossed. Let us assume a perfect world where the task wakes up before
>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>> this while for the short sleeper but since the wakeup bears WF_SYNC
>> flag, the whole reservation is ignored and waker's LLC is explored.
>>
> 
> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> forces the short sleeping wakee to be woken up on waker's CPU rather the
> wakee's previous CPU, but wakee's previous has been marked as cache hot
> for nothing?

Precisely :)

> 
>> Should the timeout be cleared if the wakeup decides to not target the
>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>> small enough to curb any side effect that could possibly show here but
>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>> a larger value, the wakeup path might be affected where lot of idle
>> targets are overlooked since the CPUs are marked cache-hot forr longer
>> duration)
>>
>> Let me know what you think.
>>
> 
> This makes sense. In theory the above logic can be added in
> select_idle_sibling(), if target CPU is chosen rather than
> the previous CPU, the previous CPU's cache hot flag should be
> cleared.
> 
> But this might bring overhead. Because we need to grab the rq
> lock and write to other CPU's rq, which could be costly. It
> seems to be a trade-off of current implementation.

I agree, it will not be pretty. Maybe the other way is to have a
history of the type of wakeup the task experiences (similar to
wakee_flips but for sync and non-syn wakeups) and only reserve
the CPU if the task wakes up more via non-sync wakeups? Thinking
out loud here.

> On the other
> hand, if the user sets the sysctl_sched_migration_cost to a quite
> large value:
> 1. Without SIS_CACHE, there is no task migration.

But that is in the load balancing path. I think the wakeup path will
still migrate the task. But I believe there might be very few cases
where all CPUs are marked cache-hot and the SIS_UTIL will not bail
out straight away as a result of high utilization. Probably a rare
scenario.

> 2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped
>    in select_idle_cpu(), the wakee will be woken up locally.
> It seems to be of the same effect, so there is no much impact
> to wakeup behavior I suppose.
> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Mathieu Desnoyers 2 years, 3 months ago
On 9/10/23 22:50, Chen Yu wrote:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inspired by Mathieu's idea[1], consider the sleep time of the task.
> If that task is a short sleeping one, keep p's previous CPU idle
> for a short while. Later when p is woken up, it can choose its
> previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> other wakee is not allowed to choose this CPU in select_idle_idle().
> The reservation period is set to the task's average sleep time. That
> is to say, if p is a short sleeping task, there is no need to migrate
> p to another idle CPU.
> 
> This does not break the work conservation of the scheduler,
> because wakee will still try its best to find an idle CPU.
> The difference is that, different idle CPUs might have different
> priorities. On the other hand, in theory this extra check could
> increase the failure ratio of select_idle_cpu(), but per the
> initial test result, no regression is detected.
> 
> Baseline: tip/sched/core, on top of:
> Commit 3f4feb58037a ("sched: Misc cleanups")
> 
> Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> cpufreq governor is performance, turbo boost is disabled, C-states deeper
> than C1 are disabled, Numa balancing is disabled.
> 
> netperf
> =======
> case                    load            baseline(std%)  compare%( std%)
> UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> 
> Noticeable improvements of netperf is observed in 168 and 224 threads
> cases.
> 
> hackbench
> =========
> case                    load            baseline(std%)  compare%( std%)
> process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> 
> schbench(old)
> ========
> case                    load            baseline(std%)  compare%( std%)
> normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> 
> No much difference is observed in hackbench/schbench, due to the
> run-to-run variance.
> 
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>   kernel/sched/features.h |  1 +
>   kernel/sched/sched.h    |  1 +
>   3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	hrtick_update(rq);
>   	now = sched_clock_cpu(cpu_of(rq));
>   	p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this rq will become idle, and dequeued task is
> +	 * a short sleeping one, check if we can reserve
> +	 * this idle CPU for that task for a short while.
> +	 * During this reservation period, other wakees will
> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> +	 * short sleeping task can pick its previous CPU in
> +	 * select_idle_sibling(), which brings better cache
> +	 * locality.
> +	 */
> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> +		rq->cache_hot_timeout = now + p->se.sleep_avg;

This is really cool!

There is one scenario that worries me with this approach: workloads
that sleep for a long time and then have short blocked periods.
Those bursts will likely bring the average to values too high
to stay below sysctl_sched_migration_cost.

I wonder if changing the code above for the following would help ?

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
	rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);

For tasks that have a large sleep_avg, it would activate this rq
"appear as not idle for rq selection" scheme for a window of
sysctl_sched_migration_cost. If the sleep ends up being a long one,
preventing other tasks from being migrated to this rq for a tiny
window should not matter performance-wise. I would expect that it
could help workloads that come in bursts.

Thanks,

Mathieu


> +#endif
>   }
>   
>   #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>   static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>   {
>   	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> +		if (sched_feat(SIS_CACHE) &&
> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> +			return -1;
> +
>   		return cpu;
> +	}
>   
>   	return -1;
>   }
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>   	int cpu;
>   
>   	for_each_cpu(cpu, cpu_smt_mask(core)) {
> -		if (!available_idle_cpu(cpu)) {
> +		bool cache_hot = sched_feat(SIS_CACHE) ?
> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> +		if (!available_idle_cpu(cpu) || cache_hot) {
>   			idle = false;
>   			if (*idle_cpu == -1) {
> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +				    !cache_hot) {
>   					*idle_cpu = cpu;
>   					break;
>   				}
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index f770168230ae..04ed9fcf67f8 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>    */
>   SCHED_FEAT(SIS_PROP, false)
>   SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_CACHE, true)
>   
>   /*
>    * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 62013c49c451..7a2c12c3b6d0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,6 +1078,7 @@ struct rq {
>   #endif
>   	u64			idle_stamp;
>   	u64			avg_idle;
> +	u64			cache_hot_timeout;
>   
>   	unsigned long		wake_stamp;
>   	u64			wake_avg_idle;

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 2 months ago
Hi Mathieu,

On 2023-09-11 at 11:26:50 -0400, Mathieu Desnoyers wrote:
> On 9/10/23 22:50, Chen Yu wrote:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inspired by Mathieu's idea[1], consider the sleep time of the task.
> > If that task is a short sleeping one, keep p's previous CPU idle
> > for a short while. Later when p is woken up, it can choose its
> > previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> > other wakee is not allowed to choose this CPU in select_idle_idle().
> > The reservation period is set to the task's average sleep time. That
> > is to say, if p is a short sleeping task, there is no need to migrate
> > p to another idle CPU.
> > 
> > This does not break the work conservation of the scheduler,
> > because wakee will still try its best to find an idle CPU.
> > The difference is that, different idle CPUs might have different
> > priorities. On the other hand, in theory this extra check could
> > increase the failure ratio of select_idle_cpu(), but per the
> > initial test result, no regression is detected.
> > 
> > Baseline: tip/sched/core, on top of:
> > Commit 3f4feb58037a ("sched: Misc cleanups")
> > 
> > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> > cpufreq governor is performance, turbo boost is disabled, C-states deeper
> > than C1 are disabled, Numa balancing is disabled.
> > 
> > netperf
> > =======
> > case                    load            baseline(std%)  compare%( std%)
> > UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> > UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> > UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> > UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> > 
> > Noticeable improvements of netperf is observed in 168 and 224 threads
> > cases.
> > 
> > hackbench
> > =========
> > case                    load            baseline(std%)  compare%( std%)
> > process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> > process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> > process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> > process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> > process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> > process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> > threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> > threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> > threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> > threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> > threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> > threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> > 
> > schbench(old)
> > ========
> > case                    load            baseline(std%)  compare%( std%)
> > normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> > normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> > normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> > normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> > 
> > No much difference is observed in hackbench/schbench, due to the
> > run-to-run variance.
> > 
> > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >   kernel/sched/features.h |  1 +
> >   kernel/sched/sched.h    |  1 +
> >   3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   	hrtick_update(rq);
> >   	now = sched_clock_cpu(cpu_of(rq));
> >   	p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this rq will become idle, and dequeued task is
> > +	 * a short sleeping one, check if we can reserve
> > +	 * this idle CPU for that task for a short while.
> > +	 * During this reservation period, other wakees will
> > +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> > +	 * short sleeping task can pick its previous CPU in
> > +	 * select_idle_sibling(), which brings better cache
> > +	 * locality.
> > +	 */
> > +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> 
> This is really cool!
> 
> There is one scenario that worries me with this approach: workloads
> that sleep for a long time and then have short blocked periods.
> Those bursts will likely bring the average to values too high
> to stay below sysctl_sched_migration_cost.
> 
> I wonder if changing the code above for the following would help ?
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
> 	rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);
> 
> For tasks that have a large sleep_avg, it would activate this rq
> "appear as not idle for rq selection" scheme for a window of
> sysctl_sched_migration_cost. If the sleep ends up being a long one,
> preventing other tasks from being migrated to this rq for a tiny
> window should not matter performance-wise. I would expect that it
> could help workloads that come in bursts.
>

After a revist, I thought maybe above approach is better. Because
this method considers all the historic data rather than only the short
sleeping duration:

if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
     now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
	update_avg(&p->se.burst_sleep_avg, now - last_sleep);

Say, if a task sleeps for 1 ms, and in all the subsequent context switch
it will sleep much longer than 1 ms, then the burst_avg_sleep will remain
1 ms which does not reflect the behavior of this task.

update_avg() is actually (Exponentially Weighted Moving-Average) of

new_avg = 0.875 * old_avg + 0.125 * lastest_sleep_duration

which approximately reflects the latest 1/0.125 = 8 sample data. I guess
the 8 is small enought to capture the burst sleep duration.

Unless we want to capture the scenario that, during the latest 8 sleep
events, there are 1 small sleeps, and 7 large sleep behavior and
we still want to honor that 1 small sleep and reserve the idle CPU.
Then either we can use the method you proposed to use the min logic, or
use the logic you proposed in another reply, but enhance it by clearing
that burst avergae sleep duration if there is a continious 8 long sleep
event, which looks a little overkilled. Anyway, I can get some data based
on this.

thanks,
Chenyu
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Mathieu Desnoyers 2 years, 3 months ago
On 9/11/23 11:26, Mathieu Desnoyers wrote:
> On 9/10/23 22:50, Chen Yu wrote:
[...]
>> ---
>>   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>   kernel/sched/features.h |  1 +
>>   kernel/sched/sched.h    |  1 +
>>   3 files changed, 29 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e20f50726ab8..fe3b760c9654 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, 
>> struct task_struct *p, int flags)
>>       hrtick_update(rq);
>>       now = sched_clock_cpu(cpu_of(rq));
>>       p->se.prev_sleep_time = task_sleep ? now : 0;
>> +#ifdef CONFIG_SMP
>> +    /*
>> +     * If this rq will become idle, and dequeued task is
>> +     * a short sleeping one, check if we can reserve
>> +     * this idle CPU for that task for a short while.
>> +     * During this reservation period, other wakees will
>> +     * skip this 'idle' CPU in select_idle_cpu(), and this
>> +     * short sleeping task can pick its previous CPU in
>> +     * select_idle_sibling(), which brings better cache
>> +     * locality.
>> +     */
>> +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>> +        p->se.sleep_avg && p->se.sleep_avg < 
>> sysctl_sched_migration_cost)
>> +        rq->cache_hot_timeout = now + p->se.sleep_avg;
> 
> This is really cool!
> 
> There is one scenario that worries me with this approach: workloads
> that sleep for a long time and then have short blocked periods.
> Those bursts will likely bring the average to values too high
> to stay below sysctl_sched_migration_cost.
> 
> I wonder if changing the code above for the following would help ?
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && 
> p->se.sleep_avg)
>      rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, 
> p->se.sleep_avg);
> 
> For tasks that have a large sleep_avg, it would activate this rq
> "appear as not idle for rq selection" scheme for a window of
> sysctl_sched_migration_cost. If the sleep ends up being a long one,
> preventing other tasks from being migrated to this rq for a tiny
> window should not matter performance-wise. I would expect that it
> could help workloads that come in bursts.

There is perhaps a better way to handle bursts:

When calculating the sleep_avg, we actually only really care about
the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
to select which of the sleeps should be taken into account in the avg.

We could rename the field "sleep_avg" to "burst_sleep_avg", and have:

u64 now = sched_clock_cpu(task_cpu(p));

if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
     now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
	update_avg(&p->se.burst_sleep_avg, now - last_sleep);

Then we can have this code is dequeue_task_fair:

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
Hi Mathieu,

thanks for the review,

On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
> On 9/11/23 11:26, Mathieu Desnoyers wrote:
> > On 9/10/23 22:50, Chen Yu wrote:
> [...]
> > > ---
> > >   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> > >   kernel/sched/features.h |  1 +
> > >   kernel/sched/sched.h    |  1 +
> > >   3 files changed, 29 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e20f50726ab8..fe3b760c9654 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
> > > struct task_struct *p, int flags)
> > >       hrtick_update(rq);
> > >       now = sched_clock_cpu(cpu_of(rq));
> > >       p->se.prev_sleep_time = task_sleep ? now : 0;
> > > +#ifdef CONFIG_SMP
> > > +    /*
> > > +     * If this rq will become idle, and dequeued task is
> > > +     * a short sleeping one, check if we can reserve
> > > +     * this idle CPU for that task for a short while.
> > > +     * During this reservation period, other wakees will
> > > +     * skip this 'idle' CPU in select_idle_cpu(), and this
> > > +     * short sleeping task can pick its previous CPU in
> > > +     * select_idle_sibling(), which brings better cache
> > > +     * locality.
> > > +     */
> > > +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > > +        p->se.sleep_avg && p->se.sleep_avg <
> > > sysctl_sched_migration_cost)
> > > +        rq->cache_hot_timeout = now + p->se.sleep_avg;
> > 
> > This is really cool!
> > 
> > There is one scenario that worries me with this approach: workloads
> > that sleep for a long time and then have short blocked periods.
> > Those bursts will likely bring the average to values too high
> > to stay below sysctl_sched_migration_cost.
> > 
> > I wonder if changing the code above for the following would help ?
> > 
> > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > p->se.sleep_avg)
> >      rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
> > p->se.sleep_avg);
> > 
> > For tasks that have a large sleep_avg, it would activate this rq
> > "appear as not idle for rq selection" scheme for a window of
> > sysctl_sched_migration_cost. If the sleep ends up being a long one,
> > preventing other tasks from being migrated to this rq for a tiny
> > window should not matter performance-wise. I would expect that it
> > could help workloads that come in bursts.
> 
> There is perhaps a better way to handle bursts:
> 
> When calculating the sleep_avg, we actually only really care about
> the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
> to select which of the sleeps should be taken into account in the avg.
> 
> We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
> 
> u64 now = sched_clock_cpu(task_cpu(p));
> 
> if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
>     now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
> 	update_avg(&p->se.burst_sleep_avg, now - last_sleep);
> 
> Then we can have this code is dequeue_task_fair:
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
> 
> Thoughts ?
> 

This looks reasonable, it would be fine grained to only monitor the short sleep duration
of that task. Let me try this approach to see if there is any difference.

thanks,
Chenyu

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Mathieu Desnoyers 2 years, 3 months ago
On 9/12/23 07:53, Chen Yu wrote:
> Hi Mathieu,
> 
> thanks for the review,
> 
> On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
>> On 9/11/23 11:26, Mathieu Desnoyers wrote:
>>> On 9/10/23 22:50, Chen Yu wrote:
>> [...]
>>>> ---
>>>>    kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>>>    kernel/sched/features.h |  1 +
>>>>    kernel/sched/sched.h    |  1 +
>>>>    3 files changed, 29 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e20f50726ab8..fe3b760c9654 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
>>>> struct task_struct *p, int flags)
>>>>        hrtick_update(rq);
>>>>        now = sched_clock_cpu(cpu_of(rq));
>>>>        p->se.prev_sleep_time = task_sleep ? now : 0;
>>>> +#ifdef CONFIG_SMP
>>>> +    /*
>>>> +     * If this rq will become idle, and dequeued task is
>>>> +     * a short sleeping one, check if we can reserve
>>>> +     * this idle CPU for that task for a short while.
>>>> +     * During this reservation period, other wakees will
>>>> +     * skip this 'idle' CPU in select_idle_cpu(), and this
>>>> +     * short sleeping task can pick its previous CPU in
>>>> +     * select_idle_sibling(), which brings better cache
>>>> +     * locality.
>>>> +     */
>>>> +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>>> +        p->se.sleep_avg && p->se.sleep_avg <
>>>> sysctl_sched_migration_cost)
>>>> +        rq->cache_hot_timeout = now + p->se.sleep_avg;
>>>
>>> This is really cool!
>>>
>>> There is one scenario that worries me with this approach: workloads
>>> that sleep for a long time and then have short blocked periods.
>>> Those bursts will likely bring the average to values too high
>>> to stay below sysctl_sched_migration_cost.
>>>
>>> I wonder if changing the code above for the following would help ?
>>>
>>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> p->se.sleep_avg)
>>>       rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
>>> p->se.sleep_avg);
>>>
>>> For tasks that have a large sleep_avg, it would activate this rq
>>> "appear as not idle for rq selection" scheme for a window of
>>> sysctl_sched_migration_cost. If the sleep ends up being a long one,
>>> preventing other tasks from being migrated to this rq for a tiny
>>> window should not matter performance-wise. I would expect that it
>>> could help workloads that come in bursts.
>>
>> There is perhaps a better way to handle bursts:
>>
>> When calculating the sleep_avg, we actually only really care about
>> the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
>> to select which of the sleeps should be taken into account in the avg.
>>
>> We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
>>
>> u64 now = sched_clock_cpu(task_cpu(p));
>>
>> if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
>>      now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
>> 	update_avg(&p->se.burst_sleep_avg, now - last_sleep);
>>
>> Then we can have this code is dequeue_task_fair:
>>
>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
>> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>>
>> Thoughts ?
>>
> 
> This looks reasonable, it would be fine grained to only monitor the short sleep duration
> of that task. Let me try this approach to see if there is any difference.
> 

One more tweak: given that more than one task can update the cache_hot_timeout forward
one after another, and given that some tasks have larger burst_sleep_avg values than
others, I suspect we want to keep the forward movement monotonic with something like:

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
     rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;

Thanks,

Mathieu


> thanks,
> Chenyu
> 
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote:
> On 9/12/23 07:53, Chen Yu wrote:
> > Hi Mathieu,
> > 
> > thanks for the review,
> > 
> > On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
> > > On 9/11/23 11:26, Mathieu Desnoyers wrote:
> > > > On 9/10/23 22:50, Chen Yu wrote:
> > > [...]
> > > > > ---
> > > > >    kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> > > > >    kernel/sched/features.h |  1 +
> > > > >    kernel/sched/sched.h    |  1 +
> > > > >    3 files changed, 29 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index e20f50726ab8..fe3b760c9654 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
> > > > > struct task_struct *p, int flags)
> > > > >        hrtick_update(rq);
> > > > >        now = sched_clock_cpu(cpu_of(rq));
> > > > >        p->se.prev_sleep_time = task_sleep ? now : 0;
> > > > > +#ifdef CONFIG_SMP
> > > > > +    /*
> > > > > +     * If this rq will become idle, and dequeued task is
> > > > > +     * a short sleeping one, check if we can reserve
> > > > > +     * this idle CPU for that task for a short while.
> > > > > +     * During this reservation period, other wakees will
> > > > > +     * skip this 'idle' CPU in select_idle_cpu(), and this
> > > > > +     * short sleeping task can pick its previous CPU in
> > > > > +     * select_idle_sibling(), which brings better cache
> > > > > +     * locality.
> > > > > +     */
> > > > > +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > > > > +        p->se.sleep_avg && p->se.sleep_avg <
> > > > > sysctl_sched_migration_cost)
> > > > > +        rq->cache_hot_timeout = now + p->se.sleep_avg;
> > > > 
> > > > This is really cool!
> > > > 
> > > > There is one scenario that worries me with this approach: workloads
> > > > that sleep for a long time and then have short blocked periods.
> > > > Those bursts will likely bring the average to values too high
> > > > to stay below sysctl_sched_migration_cost.
> > > > 
> > > > I wonder if changing the code above for the following would help ?
> > > > 
> > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > > > p->se.sleep_avg)
> > > >       rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
> > > > p->se.sleep_avg);
> > > > 
> > > > For tasks that have a large sleep_avg, it would activate this rq
> > > > "appear as not idle for rq selection" scheme for a window of
> > > > sysctl_sched_migration_cost. If the sleep ends up being a long one,
> > > > preventing other tasks from being migrated to this rq for a tiny
> > > > window should not matter performance-wise. I would expect that it
> > > > could help workloads that come in bursts.
> > > 
> > > There is perhaps a better way to handle bursts:
> > > 
> > > When calculating the sleep_avg, we actually only really care about
> > > the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
> > > to select which of the sleeps should be taken into account in the avg.
> > > 
> > > We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
> > > 
> > > u64 now = sched_clock_cpu(task_cpu(p));
> > > 
> > > if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
> > >      now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
> > > 	update_avg(&p->se.burst_sleep_avg, now - last_sleep);
> > > 
> > > Then we can have this code is dequeue_task_fair:
> > > 
> > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
> > > 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
> > > 
> > > Thoughts ?
> > > 
> > 
> > This looks reasonable, it would be fine grained to only monitor the short sleep duration
> > of that task. Let me try this approach to see if there is any difference.
> > 
> 
> One more tweak: given that more than one task can update the cache_hot_timeout forward
> one after another, and given that some tasks have larger burst_sleep_avg values than
> others, I suspect we want to keep the forward movement monotonic with something like:
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
>     rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>

Yeah, Aaron has mentioned this too:
https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/
May I know the benefit of keeping forward movement monotonic?
I thought that, should we only honor the latest dequeued task's burst_sleep_avg?
Because we don't know whether the old deuqued task's cache has been scribbled by the latest
dequeued task or not, does it still make sense to wake up the old dequeued task on its
previous CPU?

thanks,
Chenyu
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Mathieu Desnoyers 2 years, 3 months ago
On 9/12/23 10:14, Chen Yu wrote:
> On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote:
[...]
>>
>> One more tweak: given that more than one task can update the cache_hot_timeout forward
>> one after another, and given that some tasks have larger burst_sleep_avg values than
>> others, I suspect we want to keep the forward movement monotonic with something like:
>>
>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
>>      rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
>> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>>
> 
> Yeah, Aaron has mentioned this too:
> https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/
> May I know the benefit of keeping forward movement monotonic?
> I thought that, should we only honor the latest dequeued task's burst_sleep_avg?
> Because we don't know whether the old deuqued task's cache has been scribbled by the latest
> dequeued task or not, does it still make sense to wake up the old dequeued task on its
> previous CPU?

Here is my reasoning:

If a second task is scheduled after the first dequeued task (a
task with large burst_sleep_avg) is dequeued, that second task (with
small burst_sleep_avg) would need to entirely scribble the other task's
cache lines within the time given by sysctl_sched_migration_cost, which
I suspect is typically not very large. So I doubt that the second task
can entirely kick out the first task cache lines within that time frame,
and therefore that second task should not move the cache_hot_timeout
value backwards.

But perhaps I'm missing something ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Aaron Lu 2 years, 3 months ago
On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inspired by Mathieu's idea[1], consider the sleep time of the task.
> If that task is a short sleeping one, keep p's previous CPU idle
> for a short while. Later when p is woken up, it can choose its
> previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> other wakee is not allowed to choose this CPU in select_idle_idle().
> The reservation period is set to the task's average sleep time. That
> is to say, if p is a short sleeping task, there is no need to migrate
> p to another idle CPU.
> 
> This does not break the work conservation of the scheduler,
> because wakee will still try its best to find an idle CPU.
> The difference is that, different idle CPUs might have different
> priorities. On the other hand, in theory this extra check could
> increase the failure ratio of select_idle_cpu(), but per the
> initial test result, no regression is detected.
> 
> Baseline: tip/sched/core, on top of:
> Commit 3f4feb58037a ("sched: Misc cleanups")
> 
> Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> cpufreq governor is performance, turbo boost is disabled, C-states deeper
> than C1 are disabled, Numa balancing is disabled.
> 
> netperf
> =======
> case                    load            baseline(std%)  compare%( std%)
> UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> 
> Noticeable improvements of netperf is observed in 168 and 224 threads
> cases.
> 
> hackbench
> =========
> case                    load            baseline(std%)  compare%( std%)
> process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> 
> schbench(old)
> ========
> case                    load            baseline(std%)  compare%( std%)
> normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> 
> No much difference is observed in hackbench/schbench, due to the
> run-to-run variance.
> 
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/features.h |  1 +
>  kernel/sched/sched.h    |  1 +
>  3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  	now = sched_clock_cpu(cpu_of(rq));
>  	p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this rq will become idle, and dequeued task is
> +	 * a short sleeping one, check if we can reserve
> +	 * this idle CPU for that task for a short while.
> +	 * During this reservation period, other wakees will
> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> +	 * short sleeping task can pick its previous CPU in
> +	 * select_idle_sibling(), which brings better cache
> +	 * locality.
> +	 */
> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> +		rq->cache_hot_timeout = now + p->se.sleep_avg;

Should this be written as:
		rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg);
?

A even earlier task may have a bigger sleep_avg and overwriting
rq->cache_hot_timeout here with an earlier time may cause that task
migrate. Not sure how much impact this can have, just someting hit my
brain while reading this code.

> +#endif
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>  {
>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> +		if (sched_feat(SIS_CACHE) &&
> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> +			return -1;
> +

Maybe introduce a new function that also considers rq->cache_hot_timeout,
like available_idle_cpu_migrate() so that above and below logic can be
simplified a bit?

I was thinking to simply add that rq->cache_hot_timeout check to
available_idle_cpu() but then a long sleeping task could be forced to
migrate if its prev_cpu happens to just deschedule a task that sets rq's
cache_hot_timeout. I guess that's why you chose to only change the idle
semantic in select_idle_cpu() but not in select_idle_sibling()?

Thanks,
Aaron

>  		return cpu;
> +	}
>  
>  	return -1;
>  }
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>  	int cpu;
>  
>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
> -		if (!available_idle_cpu(cpu)) {
> +		bool cache_hot = sched_feat(SIS_CACHE) ?
> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> +		if (!available_idle_cpu(cpu) || cache_hot) {
>  			idle = false;
>  			if (*idle_cpu == -1) {
> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +				    !cache_hot) {
>  					*idle_cpu = cpu;
>  					break;
>  				}
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index f770168230ae..04ed9fcf67f8 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_PROP, false)
>  SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_CACHE, true)
>  
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 62013c49c451..7a2c12c3b6d0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,6 +1078,7 @@ struct rq {
>  #endif
>  	u64			idle_stamp;
>  	u64			avg_idle;
> +	u64			cache_hot_timeout;
>  
>  	unsigned long		wake_stamp;
>  	u64			wake_avg_idle;
> -- 
> 2.25.1
>
Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
Posted by Chen Yu 2 years, 3 months ago
Hi Aaron,

thanks for the review,

On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inspired by Mathieu's idea[1], consider the sleep time of the task.
> > If that task is a short sleeping one, keep p's previous CPU idle
> > for a short while. Later when p is woken up, it can choose its
> > previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> > other wakee is not allowed to choose this CPU in select_idle_idle().
> > The reservation period is set to the task's average sleep time. That
> > is to say, if p is a short sleeping task, there is no need to migrate
> > p to another idle CPU.
> > 
> > This does not break the work conservation of the scheduler,
> > because wakee will still try its best to find an idle CPU.
> > The difference is that, different idle CPUs might have different
> > priorities. On the other hand, in theory this extra check could
> > increase the failure ratio of select_idle_cpu(), but per the
> > initial test result, no regression is detected.
> > 
> > Baseline: tip/sched/core, on top of:
> > Commit 3f4feb58037a ("sched: Misc cleanups")
> > 
> > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> > cpufreq governor is performance, turbo boost is disabled, C-states deeper
> > than C1 are disabled, Numa balancing is disabled.
> > 
> > netperf
> > =======
> > case                    load            baseline(std%)  compare%( std%)
> > UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> > UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> > UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> > UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> > 
> > Noticeable improvements of netperf is observed in 168 and 224 threads
> > cases.
> > 
> > hackbench
> > =========
> > case                    load            baseline(std%)  compare%( std%)
> > process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> > process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> > process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> > process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> > process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> > process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> > threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> > threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> > threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> > threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> > threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> > threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> > 
> > schbench(old)
> > ========
> > case                    load            baseline(std%)  compare%( std%)
> > normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> > normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> > normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> > normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> > 
> > No much difference is observed in hackbench/schbench, due to the
> > run-to-run variance.
> > 
> > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >  kernel/sched/features.h |  1 +
> >  kernel/sched/sched.h    |  1 +
> >  3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	hrtick_update(rq);
> >  	now = sched_clock_cpu(cpu_of(rq));
> >  	p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this rq will become idle, and dequeued task is
> > +	 * a short sleeping one, check if we can reserve
> > +	 * this idle CPU for that task for a short while.
> > +	 * During this reservation period, other wakees will
> > +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> > +	 * short sleeping task can pick its previous CPU in
> > +	 * select_idle_sibling(), which brings better cache
> > +	 * locality.
> > +	 */
> > +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> 
> Should this be written as:
> 		rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg);
> ?
> 
> A even earlier task may have a bigger sleep_avg and overwriting
> rq->cache_hot_timeout here with an earlier time may cause that task
> migrate. Not sure how much impact this can have, just someting hit my
> brain while reading this code.
>

My previous thought was that, SIS_CACHE only honors the average sleep time
of the latest dequeued task. Say, task p1 goes to sleep, then task p2 is woken
up and godes to sleeps, we only honor the sleep time of p2. Because we don't
know how long p2 has been executed, and how much p2 has polluted the L1/L2 cache,
and does it still make sense to let p1 be woken up on its previous
CPU(is this CPU still cache hot for p1?)
 
> > +#endif
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >  {
> >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > +		if (sched_feat(SIS_CACHE) &&
> > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > +			return -1;
> > +
> 
> Maybe introduce a new function that also considers rq->cache_hot_timeout,
> like available_idle_cpu_migrate() so that above and below logic can be
> simplified a bit?
> 

Yes, that would be simpler, I'll do in next version.

> I was thinking to simply add that rq->cache_hot_timeout check to
> available_idle_cpu() but then a long sleeping task could be forced to
> migrate if its prev_cpu happens to just deschedule a task that sets rq's
> cache_hot_timeout. I guess that's why you chose to only change the idle
> semantic in select_idle_cpu() but not in select_idle_sibling()?
>

Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
thus no one could use this CPU within the cache hot period, including the cache-hot task
itself.

thanks,
Chenyu