When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.
Inspired by Mathieu's idea[1], consider the sleep time of the task.
If that task is a short sleeping one, keep p's previous CPU idle
for a short while. Later when p is woken up, it can choose its
previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
other wakee is not allowed to choose this CPU in select_idle_idle().
The reservation period is set to the task's average sleep time. That
is to say, if p is a short sleeping task, there is no need to migrate
p to another idle CPU.
This does not break the work conservation of the scheduler,
because wakee will still try its best to find an idle CPU.
The difference is that, different idle CPUs might have different
priorities. On the other hand, in theory this extra check could
increase the failure ratio of select_idle_cpu(), but per the
initial test result, no regression is detected.
Baseline: tip/sched/core, on top of:
Commit 3f4feb58037a ("sched: Misc cleanups")
Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
cpufreq governor is performance, turbo boost is disabled, C-states deeper
than C1 are disabled, Numa balancing is disabled.
netperf
=======
case load baseline(std%) compare%( std%)
UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04)
UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42)
UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96)
UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50)
Noticeable improvements of netperf is observed in 168 and 224 threads
cases.
hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48)
process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10)
process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41)
process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95)
process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97)
process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09)
threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49)
threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90)
threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44)
threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43)
threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91)
threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07)
schbench(old)
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37)
normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83)
normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28)
normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28)
No much difference is observed in hackbench/schbench, due to the
run-to-run variance.
Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20f50726ab8..fe3b760c9654 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
now = sched_clock_cpu(cpu_of(rq));
p->se.prev_sleep_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+ /*
+ * If this rq will become idle, and dequeued task is
+ * a short sleeping one, check if we can reserve
+ * this idle CPU for that task for a short while.
+ * During this reservation period, other wakees will
+ * skip this 'idle' CPU in select_idle_cpu(), and this
+ * short sleeping task can pick its previous CPU in
+ * select_idle_sibling(), which brings better cache
+ * locality.
+ */
+ if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+ p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
+ rq->cache_hot_timeout = now + p->se.sleep_avg;
+#endif
}
#ifdef CONFIG_SMP
@@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
static inline int __select_idle_cpu(int cpu, struct task_struct *p)
{
if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
- sched_cpu_cookie_match(cpu_rq(cpu), p))
+ sched_cpu_cookie_match(cpu_rq(cpu), p)) {
+ if (sched_feat(SIS_CACHE) &&
+ sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
+ return -1;
+
return cpu;
+ }
return -1;
}
@@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
int cpu;
for_each_cpu(cpu, cpu_smt_mask(core)) {
- if (!available_idle_cpu(cpu)) {
+ bool cache_hot = sched_feat(SIS_CACHE) ?
+ sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
+
+ if (!available_idle_cpu(cpu) || cache_hot) {
idle = false;
if (*idle_cpu == -1) {
- if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
+ !cache_hot) {
*idle_cpu = cpu;
break;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index f770168230ae..04ed9fcf67f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CACHE, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62013c49c451..7a2c12c3b6d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,6 +1078,7 @@ struct rq {
#endif
u64 idle_stamp;
u64 avg_idle;
+ u64 cache_hot_timeout;
unsigned long wake_stamp;
u64 wake_avg_idle;
--
2.25.1
Hello Chenyu,
One question ...
On 9/11/2023 8:20 AM, Chen Yu wrote:
> [..snip..]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> [..more snip..]
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> int cpu;
>
> for_each_cpu(cpu, cpu_smt_mask(core)) {
> - if (!available_idle_cpu(cpu)) {
> + bool cache_hot = sched_feat(SIS_CACHE) ?
> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> + if (!available_idle_cpu(cpu) || cache_hot) {
> idle = false;
> if (*idle_cpu == -1) {
> - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + !cache_hot) {
Here, the CPU is running a SCHED_IDLE task ...
> *idle_cpu = cpu;
> break;
> }
... but just below this, there are following lines to cache the idle_cpu:
}
if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
*idle_cpu = cpu;
Would it make sense to also add the same "cache_hot" check here when we
come across an idle CPU during the search for an idle core? Something
like:
- if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+ if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
*idle_cpu = cpu;
Implications with the above change:
If the entire core is idle, "select_idle_core()" will return the core
and the search will bail out in "select_idle_cpu()". Otherwise, the
cache-hot idle CPUs encountered during the search for idle core will be
ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
not cache-hot.
Thoughts?
> [..snip..]
--
Thanks and Regards,
Prateek
Hi Prateek,
On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> One question ...
>
> On 9/11/2023 8:20 AM, Chen Yu wrote:
> > [..snip..]
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > [..more snip..]
> > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> > int cpu;
> >
> > for_each_cpu(cpu, cpu_smt_mask(core)) {
> > - if (!available_idle_cpu(cpu)) {
> > + bool cache_hot = sched_feat(SIS_CACHE) ?
> > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> > +
> > + if (!available_idle_cpu(cpu) || cache_hot) {
> > idle = false;
> > if (*idle_cpu == -1) {
> > - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> > + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> > + !cache_hot) {
>
> Here, the CPU is running a SCHED_IDLE task ...
>
> > *idle_cpu = cpu;
> > break;
> > }
>
> ... but just below this, there are following lines to cache the idle_cpu:
>
> }
> if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> *idle_cpu = cpu;
>
> Would it make sense to also add the same "cache_hot" check here when we
> come across an idle CPU during the search for an idle core? Something
> like:
>
> - if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
When we reached above code, the following condition should be true:
(available_idle_cpu(cpu) && !cache_hot)
because the previous 'if' statement is false. So I guess we already
has !cache_hot ?
> + if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
> *idle_cpu = cpu;
>
> Implications with the above change:
>
> If the entire core is idle, "select_idle_core()" will return the core
> and the search will bail out in "select_idle_cpu()". Otherwise, the
> cache-hot idle CPUs encountered during the search for idle core will be
> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
> not cache-hot.
>
> Thoughts?
>
Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle
in your case.
thanks,
Chenyu
Hello Chenyu,
On 9/14/2023 4:13 PM, Chen Yu wrote:
> Hi Prateek,
>
> On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> One question ...
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>> [..snip..]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> [..more snip..]
>>> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>>> int cpu;
>>>
>>> for_each_cpu(cpu, cpu_smt_mask(core)) {
>>> - if (!available_idle_cpu(cpu)) {
>>> + bool cache_hot = sched_feat(SIS_CACHE) ?
>>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
>>> +
>>> + if (!available_idle_cpu(cpu) || cache_hot) {
>>> idle = false;
>>> if (*idle_cpu == -1) {
>>> - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
>>> + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
>>> + !cache_hot) {
>>
>> Here, the CPU is running a SCHED_IDLE task ...
>>
>>> *idle_cpu = cpu;
>>> break;
>>> }
>>
>> ... but just below this, there are following lines to cache the idle_cpu:
>>
>> }
>> if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
>> *idle_cpu = cpu;
>>
>> Would it make sense to also add the same "cache_hot" check here when we
>> come across an idle CPU during the search for an idle core? Something
>> like:
>>
>> - if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
>
> When we reached above code, the following condition should be true:
> (available_idle_cpu(cpu) && !cache_hot)
> because the previous 'if' statement is false. So I guess we already
> has !cache_hot ?
Ah! You are right. I missed the break at end of the if block. Thank you
for pointing it out to me :)
>
>> + if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
>> *idle_cpu = cpu;
>>
>> Implications with the above change:
>>
>> If the entire core is idle, "select_idle_core()" will return the core
>> and the search will bail out in "select_idle_cpu()". Otherwise, the
>> cache-hot idle CPUs encountered during the search for idle core will be
>> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
>> not cache-hot.
>>
>> Thoughts?
>>
>
> Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle
> in your case.
>
> thanks,
> Chenyu
--
Thanks and Regards,
Prateek
Hello Chenyu,
On 9/11/2023 8:20 AM, Chen Yu wrote:
> [..snip..]
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> hrtick_update(rq);
> now = sched_clock_cpu(cpu_of(rq));
> p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> + /*
> + * If this rq will become idle, and dequeued task is
> + * a short sleeping one, check if we can reserve
> + * this idle CPU for that task for a short while.
> + * During this reservation period, other wakees will
> + * skip this 'idle' CPU in select_idle_cpu(), and this
> + * short sleeping task can pick its previous CPU in
> + * select_idle_sibling(), which brings better cache
> + * locality.
> + */
> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> + rq->cache_hot_timeout = now + p->se.sleep_avg;
> +#endif
> }
>
> #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> {
> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> - sched_cpu_cookie_match(cpu_rq(cpu), p))
> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> + if (sched_feat(SIS_CACHE) &&
> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> + return -1;
Just wondering,
Similar to how select_idle_core() caches the "idle_cpu" if it ends up
finding one in its search for an idle core, would returning a "cache-hot
idle CPU" be better than returning previous CPU / current CPU if all
idle CPUs found during the search in select_idle_cpu() are marked
cache-hot?
Speaking of cache-hot idle CPU, is netperf actually more happy with
piling on current CPU? I ask this because the logic seems to be
reserving the previous CPU for a task that dislikes migration but I
do not see anything in the wake_affine_idle() path that would make the
short sleeper proactively choose the previous CPU when the wakeup is
marked with the WF_SYNC flag. Let me know if I'm missing something?
To confirm this can you look at the trend in migration count with and
without the series? Also the ratio of cache-hot idle CPUs to number
of CPUs searched can help estimate overheads of additional search - I
presume SIS_UTIL is efficient at curbing the additional search in
a busy system.
In the meantime, I'll queue this series for testing on my end too.
> +
> return cpu;
> + }
>
> return -1;
> }
> [..snip..]
--
Thanks and Regards,
Prateek
On Mon, 2023-09-11 at 13:59 +0530, K Prateek Nayak wrote: > > Speaking of cache-hot idle CPU, is netperf actually more happy with > piling on current CPU? Some tests would be happier, others not at all, some numbers below. I doubt much in the real world can perform better stacked, to be a win, stacked task overlap induced service latency and utilization loss has to be less than cache population cost of an idle CPU, something that modern CPUs have become darn good at, making for a high bar. > I ask this because the logic seems to be > reserving the previous CPU for a task that dislikes migration but I > do not see anything in the wake_affine_idle() path that would make the > short sleeper proactively choose the previous CPU when the wakeup is > marked with the WF_SYNC flag. Let me know if I'm missing something? If select_idle_sibling() didn't intervene, the wake affine logic would indeed routinely step all over working sets, and at one time briefly did so due to a silly bug. (see kernel/sched/fair.c.today:7292) The sync hint stems from the bad old days of SMP when cross-cpu latency was horrid, and has lost much of its value, but its bias toward the waker CPU still helps reduce man-in-the-middle latency in a busy box, which can do even more damage than that done by stacking of not really synchronous tasks that can be seen below. The TCP/UDP_RR tests are very close to synchronous, and the numbers reflect that, stacking is unbeatable for them [1], but for the other tests, hopefully doing something a bit more realistic than tiny ball ping-pong, stacking is a demonstrable loser. Not super carefully run script output: homer:/root # netperf.sh TCP_SENDFILE-1 unbound Avg: 87889 Sum: 87889 TCP_SENDFILE-1 stacked Avg: 62885 Sum: 62885 TCP_SENDFILE-1 cross-smt Avg: 58887 Sum: 58887 TCP_SENDFILE-1 cross-core Avg: 90673 Sum: 90673 TCP_STREAM-1 unbound Avg: 71858 Sum: 71858 TCP_STREAM-1 stacked Avg: 58883 Sum: 58883 TCP_STREAM-1 cross-smt Avg: 49345 Sum: 49345 TCP_STREAM-1 cross-core Avg: 72346 Sum: 72346 TCP_MAERTS-1 unbound Avg: 73890 Sum: 73890 TCP_MAERTS-1 stacked Avg: 60682 Sum: 60682 TCP_MAERTS-1 cross-smt Avg: 49868 Sum: 49868 TCP_MAERTS-1 cross-core Avg: 73343 Sum: 73343 UDP_STREAM-1 unbound Avg: 99442 Sum: 99442 UDP_STREAM-1 stacked Avg: 85319 Sum: 85319 UDP_STREAM-1 cross-smt Avg: 63239 Sum: 63239 UDP_STREAM-1 cross-core Avg: 99102 Sum: 99102 TCP_RR-1 unbound Avg: 200833 Sum: 200833 TCP_RR-1 stacked Avg: 243733 Sum: 243733 TCP_RR-1 cross-smt Avg: 138507 Sum: 138507 TCP_RR-1 cross-core Avg: 210404 Sum: 210404 UDP_RR-1 unbound Avg: 252575 Sum: 252575 UDP_RR-1 stacked Avg: 273081 Sum: 273081 UDP_RR-1 cross-smt Avg: 168448 Sum: 168448 UDP_RR-1 cross-core Avg: 264124 Sum: 264124 1. nearly unbeatable - shared L2 CPUS can by a wee bit.
Hi Prateek,
thanks for your review,
On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/11/2023 8:20 AM, Chen Yu wrote:
> > [..snip..]
> > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > hrtick_update(rq);
> > now = sched_clock_cpu(cpu_of(rq));
> > p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > + /*
> > + * If this rq will become idle, and dequeued task is
> > + * a short sleeping one, check if we can reserve
> > + * this idle CPU for that task for a short while.
> > + * During this reservation period, other wakees will
> > + * skip this 'idle' CPU in select_idle_cpu(), and this
> > + * short sleeping task can pick its previous CPU in
> > + * select_idle_sibling(), which brings better cache
> > + * locality.
> > + */
> > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > + rq->cache_hot_timeout = now + p->se.sleep_avg;
> > +#endif
> > }
> >
> > #ifdef CONFIG_SMP
> > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > {
> > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > + if (sched_feat(SIS_CACHE) &&
> > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > + return -1;
>
> Just wondering,
>
> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
> finding one in its search for an idle core, would returning a "cache-hot
> idle CPU" be better than returning previous CPU / current CPU if all
> idle CPUs found during the search in select_idle_cpu() are marked
> cache-hot?
>
This is a good point, we can optimize this further. Currently I only
send a simple version to desmonstrate how we can leverage the task's
sleep time.
> Speaking of cache-hot idle CPU, is netperf actually more happy with
> piling on current CPU?
Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
put the waker and wakee together.
> I ask this because the logic seems to be
> reserving the previous CPU for a task that dislikes migration but I
> do not see anything in the wake_affine_idle() path that would make the
> short sleeper proactively choose the previous CPU when the wakeup is
> marked with the WF_SYNC flag. Let me know if I'm missing something?
>
If I understand correctly, WF_SYNC is to let the wakee to woken up
on the waker's CPU, rather than the wakee's previous CPU, because
the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
wakee's previous CPU. We can only restrict that other wakee does not
occupy the previous CPU, but do not enhance the possibility that
wake_affine_idle() chooses the previous CPU.
Say, there are two tasks t1, t2. t1's previous CPU is p1.
We don't enhance that when t1 is woken up, wake_affine_idle() will
choose p1 or not, but we makes sure t2 will not choose p1.
> To confirm this can you look at the trend in migration count with and
> without the series? Also the ratio of cache-hot idle CPUs to number
> of CPUs searched can help estimate overheads of additional search - I
> presume SIS_UTIL is efficient at curbing the additional search in
> a busy system.
OK, I'll collect these statistics.
>
> In the meantime, I'll queue this series for testing on my end too.
Thanks again for your time.
thanks,
Chenyu
On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote: > > > Speaking of cache-hot idle CPU, is netperf actually more happy with > > piling on current CPU? > > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to > put the waker and wakee together. Hm, seems there's at least one shared L2 case where that's untrue by more than a tiny margin, which surprised me rather a lot. For grins, I tested netperf on my dinky rpi4b, and while its RR numbers seem kinda odd, they're also seemingly repeatable (ergo showing them). I measured a very modest cross-core win on a shared L2 Intel CPU some years ago (when Q6600 was shiny/new) but nothing close to these deltas. Makes me wonder what (a tad beefier) Bulldog RR numbers look like. root@rpi4:~# ONLY=TCP_RR netperf.sh TCP_RR-1 unbound Avg: 29611 Sum: 29611 TCP_RR-1 stacked Avg: 22540 Sum: 22540 TCP_RR-1 cross-core Avg: 30181 Sum: 30181 root@rpi4:~# netperf.sh TCP_SENDFILE-1 unbound Avg: 15572 Sum: 15572 TCP_SENDFILE-1 stacked Avg: 11533 Sum: 11533 TCP_SENDFILE-1 cross-core Avg: 15751 Sum: 15751 TCP_STREAM-1 unbound Avg: 6331 Sum: 6331 TCP_STREAM-1 stacked Avg: 6031 Sum: 6031 TCP_STREAM-1 cross-core Avg: 6211 Sum: 6211 TCP_MAERTS-1 unbound Avg: 6306 Sum: 6306 TCP_MAERTS-1 stacked Avg: 6094 Sum: 6094 TCP_MAERTS-1 cross-core Avg: 9393 Sum: 9393 UDP_STREAM-1 unbound Avg: 22277 Sum: 22277 UDP_STREAM-1 stacked Avg: 18844 Sum: 18844 UDP_STREAM-1 cross-core Avg: 24749 Sum: 24749 TCP_RR-1 unbound Avg: 29674 Sum: 29674 TCP_RR-1 stacked Avg: 22267 Sum: 22267 TCP_RR-1 cross-core Avg: 30237 Sum: 30237 UDP_RR-1 unbound Avg: 36189 Sum: 36189 UDP_RR-1 stacked Avg: 27129 Sum: 27129 UDP_RR-1 cross-core Avg: 37033 Sum: 37033
Hi Mike, thanks for taking a look, On 2023-09-12 at 11:39:55 +0200, Mike Galbraith wrote: > On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote: > > > > > Speaking of cache-hot idle CPU, is netperf actually more happy with > > > piling on current CPU? > > > > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to > > put the waker and wakee together. > > Hm, seems there's at least one shared L2 case where that's untrue by > more than a tiny margin, which surprised me rather a lot. > Yes, the task stacking is in theory against the work conservation of the scheduler, and it depends on how much the resource(l1/l2 cache, dsb) locallity is, and it is workload and hardware specific. > For grins, I tested netperf on my dinky rpi4b, and while its RR numbers > seem kinda odd, they're also seemingly repeatable (ergo showing them). > I measured a very modest cross-core win on a shared L2 Intel CPU some > years ago (when Q6600 was shiny/new) but nothing close to these deltas. > This is interesting, I have a Jacobsville which also has shared L2, I'll run some tests to check what the difference between task stacking vs spreading task on that platform. But I guess that is another topic because current patch avoids stacking tasks. thanks, Chenyu > Makes me wonder what (a tad beefier) Bulldog RR numbers look like. > > root@rpi4:~# ONLY=TCP_RR netperf.sh > TCP_RR-1 unbound Avg: 29611 Sum: 29611 > TCP_RR-1 stacked Avg: 22540 Sum: 22540 > TCP_RR-1 cross-core Avg: 30181 Sum: 30181 > > root@rpi4:~# netperf.sh > TCP_SENDFILE-1 unbound Avg: 15572 Sum: 15572 > TCP_SENDFILE-1 stacked Avg: 11533 Sum: 11533 > TCP_SENDFILE-1 cross-core Avg: 15751 Sum: 15751 > > TCP_STREAM-1 unbound Avg: 6331 Sum: 6331 > TCP_STREAM-1 stacked Avg: 6031 Sum: 6031 > TCP_STREAM-1 cross-core Avg: 6211 Sum: 6211 > > TCP_MAERTS-1 unbound Avg: 6306 Sum: 6306 > TCP_MAERTS-1 stacked Avg: 6094 Sum: 6094 > TCP_MAERTS-1 cross-core Avg: 9393 Sum: 9393 > > UDP_STREAM-1 unbound Avg: 22277 Sum: 22277 > UDP_STREAM-1 stacked Avg: 18844 Sum: 18844 > UDP_STREAM-1 cross-core Avg: 24749 Sum: 24749 > > TCP_RR-1 unbound Avg: 29674 Sum: 29674 > TCP_RR-1 stacked Avg: 22267 Sum: 22267 > TCP_RR-1 cross-core Avg: 30237 Sum: 30237 > > UDP_RR-1 unbound Avg: 36189 Sum: 36189 > UDP_RR-1 stacked Avg: 27129 Sum: 27129 > UDP_RR-1 cross-core Avg: 37033 Sum: 37033
Hello Chenyu,
On 9/11/2023 3:49 PM, Chen Yu wrote:
> Hi Prateek,
>
> thanks for your review,
>
> On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>> [..snip..]
>>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
>>> kernel/sched/features.h | 1 +
>>> kernel/sched/sched.h | 1 +
>>> 3 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> hrtick_update(rq);
>>> now = sched_clock_cpu(cpu_of(rq));
>>> p->se.prev_sleep_time = task_sleep ? now : 0;
>>> +#ifdef CONFIG_SMP
>>> + /*
>>> + * If this rq will become idle, and dequeued task is
>>> + * a short sleeping one, check if we can reserve
>>> + * this idle CPU for that task for a short while.
>>> + * During this reservation period, other wakees will
>>> + * skip this 'idle' CPU in select_idle_cpu(), and this
>>> + * short sleeping task can pick its previous CPU in
>>> + * select_idle_sibling(), which brings better cache
>>> + * locality.
>>> + */
>>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
>>> + rq->cache_hot_timeout = now + p->se.sleep_avg;
>>> +#endif
>>> }
>>>
>>> #ifdef CONFIG_SMP
>>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>> {
>>> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> - sched_cpu_cookie_match(cpu_rq(cpu), p))
>>> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
>>> + if (sched_feat(SIS_CACHE) &&
>>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
>>> + return -1;
>>
>> Just wondering,
>>
>> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
>> finding one in its search for an idle core, would returning a "cache-hot
>> idle CPU" be better than returning previous CPU / current CPU if all
>> idle CPUs found during the search in select_idle_cpu() are marked
>> cache-hot?
>>
>
> This is a good point, we can optimize this further. Currently I only
> send a simple version to desmonstrate how we can leverage the task's
> sleep time.
>
>> Speaking of cache-hot idle CPU, is netperf actually more happy with
>> piling on current CPU?
>
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.
>
>> I ask this because the logic seems to be
>> reserving the previous CPU for a task that dislikes migration but I
>> do not see anything in the wake_affine_idle() path that would make the
>> short sleeper proactively choose the previous CPU when the wakeup is
>> marked with the WF_SYNC flag. Let me know if I'm missing something?
>>
>
> If I understand correctly, WF_SYNC is to let the wakee to woken up
> on the waker's CPU, rather than the wakee's previous CPU, because
> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> wakee's previous CPU. We can only restrict that other wakee does not
> occupy the previous CPU, but do not enhance the possibility that
> wake_affine_idle() chooses the previous CPU.
Correct me if I'm wrong here,
Say a short sleeper, is always woken up using WF_SYNC flag. When the
task is dequeued, we mark the previous CPU where it ran as "cache-hot"
and restrict any wakeup happening until the "cache_hot_timeout" is
crossed. Let us assume a perfect world where the task wakes up before
the "cache_hot_timeout" expires. Logically this CPU was reserved all
this while for the short sleeper but since the wakeup bears WF_SYNC
flag, the whole reservation is ignored and waker's LLC is explored.
Should the timeout be cleared if the wakeup decides to not target the
previous CPU? (The default "sysctl_sched_migration_cost" is probably
small enough to curb any side effect that could possibly show here but
if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
a larger value, the wakeup path might be affected where lot of idle
targets are overlooked since the CPUs are marked cache-hot forr longer
duration)
Let me know what you think.
>
> Say, there are two tasks t1, t2. t1's previous CPU is p1.
> We don't enhance that when t1 is woken up, wake_affine_idle() will
> choose p1 or not, but we makes sure t2 will not choose p1.
>
>> To confirm this can you look at the trend in migration count with and
>> without the series? Also the ratio of cache-hot idle CPUs to number
>> of CPUs searched can help estimate overheads of additional search - I
>> presume SIS_UTIL is efficient at curbing the additional search in
>> a busy system.
>
> OK, I'll collect these statistics.
Thank you :)
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
Hi Prateek,
On 2023-09-12 at 08:35:12 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/11/2023 3:49 PM, Chen Yu wrote:
> > Hi Prateek,
> >
> > thanks for your review,
> >
> > On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/11/2023 8:20 AM, Chen Yu wrote:
> >>> [..snip..]
> >>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> >>> kernel/sched/features.h | 1 +
> >>> kernel/sched/sched.h | 1 +
> >>> 3 files changed, 29 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index e20f50726ab8..fe3b760c9654 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>> hrtick_update(rq);
> >>> now = sched_clock_cpu(cpu_of(rq));
> >>> p->se.prev_sleep_time = task_sleep ? now : 0;
> >>> +#ifdef CONFIG_SMP
> >>> + /*
> >>> + * If this rq will become idle, and dequeued task is
> >>> + * a short sleeping one, check if we can reserve
> >>> + * this idle CPU for that task for a short while.
> >>> + * During this reservation period, other wakees will
> >>> + * skip this 'idle' CPU in select_idle_cpu(), and this
> >>> + * short sleeping task can pick its previous CPU in
> >>> + * select_idle_sibling(), which brings better cache
> >>> + * locality.
> >>> + */
> >>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> >>> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> >>> + rq->cache_hot_timeout = now + p->se.sleep_avg;
> >>> +#endif
> >>> }
> >>>
> >>> #ifdef CONFIG_SMP
> >>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >>> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >>> {
> >>> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> >>> - sched_cpu_cookie_match(cpu_rq(cpu), p))
> >>> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> >>> + if (sched_feat(SIS_CACHE) &&
> >>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> >>> + return -1;
> >>
> >> Just wondering,
> >>
> >> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
> >> finding one in its search for an idle core, would returning a "cache-hot
> >> idle CPU" be better than returning previous CPU / current CPU if all
> >> idle CPUs found during the search in select_idle_cpu() are marked
> >> cache-hot?
> >>
> >
> > This is a good point, we can optimize this further. Currently I only
> > send a simple version to desmonstrate how we can leverage the task's
> > sleep time.
> >
> >> Speaking of cache-hot idle CPU, is netperf actually more happy with
> >> piling on current CPU?
> >
> > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> > put the waker and wakee together.
> >
> >> I ask this because the logic seems to be
> >> reserving the previous CPU for a task that dislikes migration but I
> >> do not see anything in the wake_affine_idle() path that would make the
> >> short sleeper proactively choose the previous CPU when the wakeup is
> >> marked with the WF_SYNC flag. Let me know if I'm missing something?
> >>
> >
> > If I understand correctly, WF_SYNC is to let the wakee to woken up
> > on the waker's CPU, rather than the wakee's previous CPU, because
> > the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> > wakee's previous CPU. We can only restrict that other wakee does not
> > occupy the previous CPU, but do not enhance the possibility that
> > wake_affine_idle() chooses the previous CPU.
>
> Correct me if I'm wrong here,
>
> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
> and restrict any wakeup happening until the "cache_hot_timeout" is
> crossed. Let us assume a perfect world where the task wakes up before
> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> this while for the short sleeper but since the wakeup bears WF_SYNC
> flag, the whole reservation is ignored and waker's LLC is explored.
>
Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
forces the short sleeping wakee to be woken up on waker's CPU rather the
wakee's previous CPU, but wakee's previous has been marked as cache hot
for nothing?
> Should the timeout be cleared if the wakeup decides to not target the
> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> small enough to curb any side effect that could possibly show here but
> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> a larger value, the wakeup path might be affected where lot of idle
> targets are overlooked since the CPUs are marked cache-hot forr longer
> duration)
>
> Let me know what you think.
>
This makes sense. In theory the above logic can be added in
select_idle_sibling(), if target CPU is chosen rather than
the previous CPU, the previous CPU's cache hot flag should be
cleared.
But this might bring overhead. Because we need to grab the rq
lock and write to other CPU's rq, which could be costly. It
seems to be a trade-off of current implementation. On the other
hand, if the user sets the sysctl_sched_migration_cost to a quite
large value:
1. Without SIS_CACHE, there is no task migration.
2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped
in select_idle_cpu(), the wakee will be woken up locally.
It seems to be of the same effect, so there is no much impact
to wakeup behavior I suppose.
thanks,
Chenyu
Hello Chenyu, On 9/12/2023 6:02 PM, Chen Yu wrote: > [..snip..] > >>> If I understand correctly, WF_SYNC is to let the wakee to woken up >>> on the waker's CPU, rather than the wakee's previous CPU, because >>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about >>> wakee's previous CPU. We can only restrict that other wakee does not >>> occupy the previous CPU, but do not enhance the possibility that >>> wake_affine_idle() chooses the previous CPU. >> >> Correct me if I'm wrong here, >> >> Say a short sleeper, is always woken up using WF_SYNC flag. When the >> task is dequeued, we mark the previous CPU where it ran as "cache-hot" >> and restrict any wakeup happening until the "cache_hot_timeout" is >> crossed. Let us assume a perfect world where the task wakes up before >> the "cache_hot_timeout" expires. Logically this CPU was reserved all >> this while for the short sleeper but since the wakeup bears WF_SYNC >> flag, the whole reservation is ignored and waker's LLC is explored. >> > > Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle() > forces the short sleeping wakee to be woken up on waker's CPU rather the > wakee's previous CPU, but wakee's previous has been marked as cache hot > for nothing? Precisely :) > >> Should the timeout be cleared if the wakeup decides to not target the >> previous CPU? (The default "sysctl_sched_migration_cost" is probably >> small enough to curb any side effect that could possibly show here but >> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to >> a larger value, the wakeup path might be affected where lot of idle >> targets are overlooked since the CPUs are marked cache-hot forr longer >> duration) >> >> Let me know what you think. >> > > This makes sense. In theory the above logic can be added in > select_idle_sibling(), if target CPU is chosen rather than > the previous CPU, the previous CPU's cache hot flag should be > cleared. > > But this might bring overhead. Because we need to grab the rq > lock and write to other CPU's rq, which could be costly. It > seems to be a trade-off of current implementation. I agree, it will not be pretty. Maybe the other way is to have a history of the type of wakeup the task experiences (similar to wakee_flips but for sync and non-syn wakeups) and only reserve the CPU if the task wakes up more via non-sync wakeups? Thinking out loud here. > On the other > hand, if the user sets the sysctl_sched_migration_cost to a quite > large value: > 1. Without SIS_CACHE, there is no task migration. But that is in the load balancing path. I think the wakeup path will still migrate the task. But I believe there might be very few cases where all CPUs are marked cache-hot and the SIS_UTIL will not bail out straight away as a result of high utilization. Probably a rare scenario. > 2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped > in select_idle_cpu(), the wakee will be woken up locally. > It seems to be of the same effect, so there is no much impact > to wakeup behavior I suppose. > > [..snip..] > -- Thanks and Regards, Prateek
On 9/10/23 22:50, Chen Yu wrote:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Inspired by Mathieu's idea[1], consider the sleep time of the task.
> If that task is a short sleeping one, keep p's previous CPU idle
> for a short while. Later when p is woken up, it can choose its
> previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> other wakee is not allowed to choose this CPU in select_idle_idle().
> The reservation period is set to the task's average sleep time. That
> is to say, if p is a short sleeping task, there is no need to migrate
> p to another idle CPU.
>
> This does not break the work conservation of the scheduler,
> because wakee will still try its best to find an idle CPU.
> The difference is that, different idle CPUs might have different
> priorities. On the other hand, in theory this extra check could
> increase the failure ratio of select_idle_cpu(), but per the
> initial test result, no regression is detected.
>
> Baseline: tip/sched/core, on top of:
> Commit 3f4feb58037a ("sched: Misc cleanups")
>
> Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> cpufreq governor is performance, turbo boost is disabled, C-states deeper
> than C1 are disabled, Numa balancing is disabled.
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04)
> UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42)
> UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96)
> UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50)
>
> Noticeable improvements of netperf is observed in 168 and 224 threads
> cases.
>
> hackbench
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48)
> process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10)
> process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41)
> process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95)
> process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97)
> process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09)
> threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49)
> threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90)
> threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44)
> threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43)
> threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91)
> threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07)
>
> schbench(old)
> ========
> case load baseline(std%) compare%( std%)
> normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37)
> normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83)
> normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28)
> normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28)
>
> No much difference is observed in hackbench/schbench, due to the
> run-to-run variance.
>
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> hrtick_update(rq);
> now = sched_clock_cpu(cpu_of(rq));
> p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> + /*
> + * If this rq will become idle, and dequeued task is
> + * a short sleeping one, check if we can reserve
> + * this idle CPU for that task for a short while.
> + * During this reservation period, other wakees will
> + * skip this 'idle' CPU in select_idle_cpu(), and this
> + * short sleeping task can pick its previous CPU in
> + * select_idle_sibling(), which brings better cache
> + * locality.
> + */
> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> + rq->cache_hot_timeout = now + p->se.sleep_avg;
This is really cool!
There is one scenario that worries me with this approach: workloads
that sleep for a long time and then have short blocked periods.
Those bursts will likely bring the average to values too high
to stay below sysctl_sched_migration_cost.
I wonder if changing the code above for the following would help ?
if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);
For tasks that have a large sleep_avg, it would activate this rq
"appear as not idle for rq selection" scheme for a window of
sysctl_sched_migration_cost. If the sleep ends up being a long one,
preventing other tasks from being migrated to this rq for a tiny
window should not matter performance-wise. I would expect that it
could help workloads that come in bursts.
Thanks,
Mathieu
> +#endif
> }
>
> #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> {
> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> - sched_cpu_cookie_match(cpu_rq(cpu), p))
> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> + if (sched_feat(SIS_CACHE) &&
> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> + return -1;
> +
> return cpu;
> + }
>
> return -1;
> }
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> int cpu;
>
> for_each_cpu(cpu, cpu_smt_mask(core)) {
> - if (!available_idle_cpu(cpu)) {
> + bool cache_hot = sched_feat(SIS_CACHE) ?
> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> + if (!available_idle_cpu(cpu) || cache_hot) {
> idle = false;
> if (*idle_cpu == -1) {
> - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + !cache_hot) {
> *idle_cpu = cpu;
> break;
> }
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index f770168230ae..04ed9fcf67f8 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_CACHE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 62013c49c451..7a2c12c3b6d0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,6 +1078,7 @@ struct rq {
> #endif
> u64 idle_stamp;
> u64 avg_idle;
> + u64 cache_hot_timeout;
>
> unsigned long wake_stamp;
> u64 wake_avg_idle;
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Hi Mathieu,
On 2023-09-11 at 11:26:50 -0400, Mathieu Desnoyers wrote:
> On 9/10/23 22:50, Chen Yu wrote:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Inspired by Mathieu's idea[1], consider the sleep time of the task.
> > If that task is a short sleeping one, keep p's previous CPU idle
> > for a short while. Later when p is woken up, it can choose its
> > previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> > other wakee is not allowed to choose this CPU in select_idle_idle().
> > The reservation period is set to the task's average sleep time. That
> > is to say, if p is a short sleeping task, there is no need to migrate
> > p to another idle CPU.
> >
> > This does not break the work conservation of the scheduler,
> > because wakee will still try its best to find an idle CPU.
> > The difference is that, different idle CPUs might have different
> > priorities. On the other hand, in theory this extra check could
> > increase the failure ratio of select_idle_cpu(), but per the
> > initial test result, no regression is detected.
> >
> > Baseline: tip/sched/core, on top of:
> > Commit 3f4feb58037a ("sched: Misc cleanups")
> >
> > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> > cpufreq governor is performance, turbo boost is disabled, C-states deeper
> > than C1 are disabled, Numa balancing is disabled.
> >
> > netperf
> > =======
> > case load baseline(std%) compare%( std%)
> > UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04)
> > UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42)
> > UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96)
> > UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50)
> >
> > Noticeable improvements of netperf is observed in 168 and 224 threads
> > cases.
> >
> > hackbench
> > =========
> > case load baseline(std%) compare%( std%)
> > process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48)
> > process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10)
> > process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41)
> > process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95)
> > process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97)
> > process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09)
> > threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49)
> > threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90)
> > threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44)
> > threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43)
> > threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91)
> > threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07)
> >
> > schbench(old)
> > ========
> > case load baseline(std%) compare%( std%)
> > normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37)
> > normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83)
> > normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28)
> > normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28)
> >
> > No much difference is observed in hackbench/schbench, due to the
> > run-to-run variance.
> >
> > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > hrtick_update(rq);
> > now = sched_clock_cpu(cpu_of(rq));
> > p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > + /*
> > + * If this rq will become idle, and dequeued task is
> > + * a short sleeping one, check if we can reserve
> > + * this idle CPU for that task for a short while.
> > + * During this reservation period, other wakees will
> > + * skip this 'idle' CPU in select_idle_cpu(), and this
> > + * short sleeping task can pick its previous CPU in
> > + * select_idle_sibling(), which brings better cache
> > + * locality.
> > + */
> > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > + rq->cache_hot_timeout = now + p->se.sleep_avg;
>
> This is really cool!
>
> There is one scenario that worries me with this approach: workloads
> that sleep for a long time and then have short blocked periods.
> Those bursts will likely bring the average to values too high
> to stay below sysctl_sched_migration_cost.
>
> I wonder if changing the code above for the following would help ?
>
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
> rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);
>
> For tasks that have a large sleep_avg, it would activate this rq
> "appear as not idle for rq selection" scheme for a window of
> sysctl_sched_migration_cost. If the sleep ends up being a long one,
> preventing other tasks from being migrated to this rq for a tiny
> window should not matter performance-wise. I would expect that it
> could help workloads that come in bursts.
>
After a revist, I thought maybe above approach is better. Because
this method considers all the historic data rather than only the short
sleeping duration:
if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
update_avg(&p->se.burst_sleep_avg, now - last_sleep);
Say, if a task sleeps for 1 ms, and in all the subsequent context switch
it will sleep much longer than 1 ms, then the burst_avg_sleep will remain
1 ms which does not reflect the behavior of this task.
update_avg() is actually (Exponentially Weighted Moving-Average) of
new_avg = 0.875 * old_avg + 0.125 * lastest_sleep_duration
which approximately reflects the latest 1/0.125 = 8 sample data. I guess
the 8 is small enought to capture the burst sleep duration.
Unless we want to capture the scenario that, during the latest 8 sleep
events, there are 1 small sleeps, and 7 large sleep behavior and
we still want to honor that 1 small sleep and reserve the idle CPU.
Then either we can use the method you proposed to use the min logic, or
use the logic you proposed in another reply, but enhance it by clearing
that burst avergae sleep duration if there is a continious 8 long sleep
event, which looks a little overkilled. Anyway, I can get some data based
on this.
thanks,
Chenyu
On 9/11/23 11:26, Mathieu Desnoyers wrote:
> On 9/10/23 22:50, Chen Yu wrote:
[...]
>> ---
>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
>> kernel/sched/features.h | 1 +
>> kernel/sched/sched.h | 1 +
>> 3 files changed, 29 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e20f50726ab8..fe3b760c9654 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
>> struct task_struct *p, int flags)
>> hrtick_update(rq);
>> now = sched_clock_cpu(cpu_of(rq));
>> p->se.prev_sleep_time = task_sleep ? now : 0;
>> +#ifdef CONFIG_SMP
>> + /*
>> + * If this rq will become idle, and dequeued task is
>> + * a short sleeping one, check if we can reserve
>> + * this idle CPU for that task for a short while.
>> + * During this reservation period, other wakees will
>> + * skip this 'idle' CPU in select_idle_cpu(), and this
>> + * short sleeping task can pick its previous CPU in
>> + * select_idle_sibling(), which brings better cache
>> + * locality.
>> + */
>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>> + p->se.sleep_avg && p->se.sleep_avg <
>> sysctl_sched_migration_cost)
>> + rq->cache_hot_timeout = now + p->se.sleep_avg;
>
> This is really cool!
>
> There is one scenario that worries me with this approach: workloads
> that sleep for a long time and then have short blocked periods.
> Those bursts will likely bring the average to values too high
> to stay below sysctl_sched_migration_cost.
>
> I wonder if changing the code above for the following would help ?
>
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> p->se.sleep_avg)
> rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
> p->se.sleep_avg);
>
> For tasks that have a large sleep_avg, it would activate this rq
> "appear as not idle for rq selection" scheme for a window of
> sysctl_sched_migration_cost. If the sleep ends up being a long one,
> preventing other tasks from being migrated to this rq for a tiny
> window should not matter performance-wise. I would expect that it
> could help workloads that come in bursts.
There is perhaps a better way to handle bursts:
When calculating the sleep_avg, we actually only really care about
the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
to select which of the sleeps should be taken into account in the avg.
We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
u64 now = sched_clock_cpu(task_cpu(p));
if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
update_avg(&p->se.burst_sleep_avg, now - last_sleep);
Then we can have this code is dequeue_task_fair:
if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Hi Mathieu, thanks for the review, On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote: > On 9/11/23 11:26, Mathieu Desnoyers wrote: > > On 9/10/23 22:50, Chen Yu wrote: > [...] > > > --- > > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > > kernel/sched/features.h | 1 + > > > kernel/sched/sched.h | 1 + > > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index e20f50726ab8..fe3b760c9654 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, > > > struct task_struct *p, int flags) > > > hrtick_update(rq); > > > now = sched_clock_cpu(cpu_of(rq)); > > > p->se.prev_sleep_time = task_sleep ? now : 0; > > > +#ifdef CONFIG_SMP > > > + /* > > > + * If this rq will become idle, and dequeued task is > > > + * a short sleeping one, check if we can reserve > > > + * this idle CPU for that task for a short while. > > > + * During this reservation period, other wakees will > > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > > + * short sleeping task can pick its previous CPU in > > > + * select_idle_sibling(), which brings better cache > > > + * locality. > > > + */ > > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > > + p->se.sleep_avg && p->se.sleep_avg < > > > sysctl_sched_migration_cost) > > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > > > This is really cool! > > > > There is one scenario that worries me with this approach: workloads > > that sleep for a long time and then have short blocked periods. > > Those bursts will likely bring the average to values too high > > to stay below sysctl_sched_migration_cost. > > > > I wonder if changing the code above for the following would help ? > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > p->se.sleep_avg) > > rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, > > p->se.sleep_avg); > > > > For tasks that have a large sleep_avg, it would activate this rq > > "appear as not idle for rq selection" scheme for a window of > > sysctl_sched_migration_cost. If the sleep ends up being a long one, > > preventing other tasks from being migrated to this rq for a tiny > > window should not matter performance-wise. I would expect that it > > could help workloads that come in bursts. > > There is perhaps a better way to handle bursts: > > When calculating the sleep_avg, we actually only really care about > the sleep time for short bursts, so we could use the sysctl_sched_migration_cost > to select which of the sleeps should be taken into account in the avg. > > We could rename the field "sleep_avg" to "burst_sleep_avg", and have: > > u64 now = sched_clock_cpu(task_cpu(p)); > > if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && > now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) > update_avg(&p->se.burst_sleep_avg, now - last_sleep); > > Then we can have this code is dequeue_task_fair: > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg) > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > > Thoughts ? > This looks reasonable, it would be fine grained to only monitor the short sleep duration of that task. Let me try this approach to see if there is any difference. thanks, Chenyu > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 9/12/23 07:53, Chen Yu wrote:
> Hi Mathieu,
>
> thanks for the review,
>
> On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
>> On 9/11/23 11:26, Mathieu Desnoyers wrote:
>>> On 9/10/23 22:50, Chen Yu wrote:
>> [...]
>>>> ---
>>>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
>>>> kernel/sched/features.h | 1 +
>>>> kernel/sched/sched.h | 1 +
>>>> 3 files changed, 29 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e20f50726ab8..fe3b760c9654 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
>>>> struct task_struct *p, int flags)
>>>> hrtick_update(rq);
>>>> now = sched_clock_cpu(cpu_of(rq));
>>>> p->se.prev_sleep_time = task_sleep ? now : 0;
>>>> +#ifdef CONFIG_SMP
>>>> + /*
>>>> + * If this rq will become idle, and dequeued task is
>>>> + * a short sleeping one, check if we can reserve
>>>> + * this idle CPU for that task for a short while.
>>>> + * During this reservation period, other wakees will
>>>> + * skip this 'idle' CPU in select_idle_cpu(), and this
>>>> + * short sleeping task can pick its previous CPU in
>>>> + * select_idle_sibling(), which brings better cache
>>>> + * locality.
>>>> + */
>>>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>>> + p->se.sleep_avg && p->se.sleep_avg <
>>>> sysctl_sched_migration_cost)
>>>> + rq->cache_hot_timeout = now + p->se.sleep_avg;
>>>
>>> This is really cool!
>>>
>>> There is one scenario that worries me with this approach: workloads
>>> that sleep for a long time and then have short blocked periods.
>>> Those bursts will likely bring the average to values too high
>>> to stay below sysctl_sched_migration_cost.
>>>
>>> I wonder if changing the code above for the following would help ?
>>>
>>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> p->se.sleep_avg)
>>> rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
>>> p->se.sleep_avg);
>>>
>>> For tasks that have a large sleep_avg, it would activate this rq
>>> "appear as not idle for rq selection" scheme for a window of
>>> sysctl_sched_migration_cost. If the sleep ends up being a long one,
>>> preventing other tasks from being migrated to this rq for a tiny
>>> window should not matter performance-wise. I would expect that it
>>> could help workloads that come in bursts.
>>
>> There is perhaps a better way to handle bursts:
>>
>> When calculating the sleep_avg, we actually only really care about
>> the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
>> to select which of the sleeps should be taken into account in the avg.
>>
>> We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
>>
>> u64 now = sched_clock_cpu(task_cpu(p));
>>
>> if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
>> now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
>> update_avg(&p->se.burst_sleep_avg, now - last_sleep);
>>
>> Then we can have this code is dequeue_task_fair:
>>
>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
>> rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>>
>> Thoughts ?
>>
>
> This looks reasonable, it would be fine grained to only monitor the short sleep duration
> of that task. Let me try this approach to see if there is any difference.
>
One more tweak: given that more than one task can update the cache_hot_timeout forward
one after another, and given that some tasks have larger burst_sleep_avg values than
others, I suspect we want to keep the forward movement monotonic with something like:
if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
Thanks,
Mathieu
> thanks,
> Chenyu
>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote: > On 9/12/23 07:53, Chen Yu wrote: > > Hi Mathieu, > > > > thanks for the review, > > > > On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote: > > > On 9/11/23 11:26, Mathieu Desnoyers wrote: > > > > On 9/10/23 22:50, Chen Yu wrote: > > > [...] > > > > > --- > > > > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > > > > kernel/sched/features.h | 1 + > > > > > kernel/sched/sched.h | 1 + > > > > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index e20f50726ab8..fe3b760c9654 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, > > > > > struct task_struct *p, int flags) > > > > > hrtick_update(rq); > > > > > now = sched_clock_cpu(cpu_of(rq)); > > > > > p->se.prev_sleep_time = task_sleep ? now : 0; > > > > > +#ifdef CONFIG_SMP > > > > > + /* > > > > > + * If this rq will become idle, and dequeued task is > > > > > + * a short sleeping one, check if we can reserve > > > > > + * this idle CPU for that task for a short while. > > > > > + * During this reservation period, other wakees will > > > > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > > > > + * short sleeping task can pick its previous CPU in > > > > > + * select_idle_sibling(), which brings better cache > > > > > + * locality. > > > > > + */ > > > > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > > > > + p->se.sleep_avg && p->se.sleep_avg < > > > > > sysctl_sched_migration_cost) > > > > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > > > > > > > This is really cool! > > > > > > > > There is one scenario that worries me with this approach: workloads > > > > that sleep for a long time and then have short blocked periods. > > > > Those bursts will likely bring the average to values too high > > > > to stay below sysctl_sched_migration_cost. > > > > > > > > I wonder if changing the code above for the following would help ? > > > > > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > > > p->se.sleep_avg) > > > > rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, > > > > p->se.sleep_avg); > > > > > > > > For tasks that have a large sleep_avg, it would activate this rq > > > > "appear as not idle for rq selection" scheme for a window of > > > > sysctl_sched_migration_cost. If the sleep ends up being a long one, > > > > preventing other tasks from being migrated to this rq for a tiny > > > > window should not matter performance-wise. I would expect that it > > > > could help workloads that come in bursts. > > > > > > There is perhaps a better way to handle bursts: > > > > > > When calculating the sleep_avg, we actually only really care about > > > the sleep time for short bursts, so we could use the sysctl_sched_migration_cost > > > to select which of the sleeps should be taken into account in the avg. > > > > > > We could rename the field "sleep_avg" to "burst_sleep_avg", and have: > > > > > > u64 now = sched_clock_cpu(task_cpu(p)); > > > > > > if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && > > > now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) > > > update_avg(&p->se.burst_sleep_avg, now - last_sleep); > > > > > > Then we can have this code is dequeue_task_fair: > > > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg) > > > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > > > > > > Thoughts ? > > > > > > > This looks reasonable, it would be fine grained to only monitor the short sleep duration > > of that task. Let me try this approach to see if there is any difference. > > > > One more tweak: given that more than one task can update the cache_hot_timeout forward > one after another, and given that some tasks have larger burst_sleep_avg values than > others, I suspect we want to keep the forward movement monotonic with something like: > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg && > rq->cache_hot_timeout < now + p->se.burst_sleep_avg) > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > Yeah, Aaron has mentioned this too: https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/ May I know the benefit of keeping forward movement monotonic? I thought that, should we only honor the latest dequeued task's burst_sleep_avg? Because we don't know whether the old deuqued task's cache has been scribbled by the latest dequeued task or not, does it still make sense to wake up the old dequeued task on its previous CPU? thanks, Chenyu
On 9/12/23 10:14, Chen Yu wrote: > On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote: [...] >> >> One more tweak: given that more than one task can update the cache_hot_timeout forward >> one after another, and given that some tasks have larger burst_sleep_avg values than >> others, I suspect we want to keep the forward movement monotonic with something like: >> >> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg && >> rq->cache_hot_timeout < now + p->se.burst_sleep_avg) >> rq->cache_hot_timeout = now + p->se.burst_sleep_avg; >> > > Yeah, Aaron has mentioned this too: > https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/ > May I know the benefit of keeping forward movement monotonic? > I thought that, should we only honor the latest dequeued task's burst_sleep_avg? > Because we don't know whether the old deuqued task's cache has been scribbled by the latest > dequeued task or not, does it still make sense to wake up the old dequeued task on its > previous CPU? Here is my reasoning: If a second task is scheduled after the first dequeued task (a task with large burst_sleep_avg) is dequeued, that second task (with small burst_sleep_avg) would need to entirely scribble the other task's cache lines within the time given by sysctl_sched_migration_cost, which I suspect is typically not very large. So I doubt that the second task can entirely kick out the first task cache lines within that time frame, and therefore that second task should not move the cache_hot_timeout value backwards. But perhaps I'm missing something ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
>
> Inspired by Mathieu's idea[1], consider the sleep time of the task.
> If that task is a short sleeping one, keep p's previous CPU idle
> for a short while. Later when p is woken up, it can choose its
> previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> other wakee is not allowed to choose this CPU in select_idle_idle().
> The reservation period is set to the task's average sleep time. That
> is to say, if p is a short sleeping task, there is no need to migrate
> p to another idle CPU.
>
> This does not break the work conservation of the scheduler,
> because wakee will still try its best to find an idle CPU.
> The difference is that, different idle CPUs might have different
> priorities. On the other hand, in theory this extra check could
> increase the failure ratio of select_idle_cpu(), but per the
> initial test result, no regression is detected.
>
> Baseline: tip/sched/core, on top of:
> Commit 3f4feb58037a ("sched: Misc cleanups")
>
> Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> cpufreq governor is performance, turbo boost is disabled, C-states deeper
> than C1 are disabled, Numa balancing is disabled.
>
> netperf
> =======
> case load baseline(std%) compare%( std%)
> UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04)
> UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42)
> UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96)
> UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50)
>
> Noticeable improvements of netperf is observed in 168 and 224 threads
> cases.
>
> hackbench
> =========
> case load baseline(std%) compare%( std%)
> process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48)
> process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10)
> process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41)
> process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95)
> process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97)
> process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09)
> threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49)
> threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90)
> threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44)
> threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43)
> threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91)
> threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07)
>
> schbench(old)
> ========
> case load baseline(std%) compare%( std%)
> normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37)
> normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83)
> normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28)
> normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28)
>
> No much difference is observed in hackbench/schbench, due to the
> run-to-run variance.
>
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> hrtick_update(rq);
> now = sched_clock_cpu(cpu_of(rq));
> p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> + /*
> + * If this rq will become idle, and dequeued task is
> + * a short sleeping one, check if we can reserve
> + * this idle CPU for that task for a short while.
> + * During this reservation period, other wakees will
> + * skip this 'idle' CPU in select_idle_cpu(), and this
> + * short sleeping task can pick its previous CPU in
> + * select_idle_sibling(), which brings better cache
> + * locality.
> + */
> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> + rq->cache_hot_timeout = now + p->se.sleep_avg;
Should this be written as:
rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg);
?
A even earlier task may have a bigger sleep_avg and overwriting
rq->cache_hot_timeout here with an earlier time may cause that task
migrate. Not sure how much impact this can have, just someting hit my
brain while reading this code.
> +#endif
> }
>
> #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> {
> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> - sched_cpu_cookie_match(cpu_rq(cpu), p))
> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> + if (sched_feat(SIS_CACHE) &&
> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> + return -1;
> +
Maybe introduce a new function that also considers rq->cache_hot_timeout,
like available_idle_cpu_migrate() so that above and below logic can be
simplified a bit?
I was thinking to simply add that rq->cache_hot_timeout check to
available_idle_cpu() but then a long sleeping task could be forced to
migrate if its prev_cpu happens to just deschedule a task that sets rq's
cache_hot_timeout. I guess that's why you chose to only change the idle
semantic in select_idle_cpu() but not in select_idle_sibling()?
Thanks,
Aaron
> return cpu;
> + }
>
> return -1;
> }
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> int cpu;
>
> for_each_cpu(cpu, cpu_smt_mask(core)) {
> - if (!available_idle_cpu(cpu)) {
> + bool cache_hot = sched_feat(SIS_CACHE) ?
> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> + if (!available_idle_cpu(cpu) || cache_hot) {
> idle = false;
> if (*idle_cpu == -1) {
> - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + !cache_hot) {
> *idle_cpu = cpu;
> break;
> }
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index f770168230ae..04ed9fcf67f8 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_CACHE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 62013c49c451..7a2c12c3b6d0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,6 +1078,7 @@ struct rq {
> #endif
> u64 idle_stamp;
> u64 avg_idle;
> + u64 cache_hot_timeout;
>
> unsigned long wake_stamp;
> u64 wake_avg_idle;
> --
> 2.25.1
>
Hi Aaron,
thanks for the review,
On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> >
> > Inspired by Mathieu's idea[1], consider the sleep time of the task.
> > If that task is a short sleeping one, keep p's previous CPU idle
> > for a short while. Later when p is woken up, it can choose its
> > previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> > other wakee is not allowed to choose this CPU in select_idle_idle().
> > The reservation period is set to the task's average sleep time. That
> > is to say, if p is a short sleeping task, there is no need to migrate
> > p to another idle CPU.
> >
> > This does not break the work conservation of the scheduler,
> > because wakee will still try its best to find an idle CPU.
> > The difference is that, different idle CPUs might have different
> > priorities. On the other hand, in theory this extra check could
> > increase the failure ratio of select_idle_cpu(), but per the
> > initial test result, no regression is detected.
> >
> > Baseline: tip/sched/core, on top of:
> > Commit 3f4feb58037a ("sched: Misc cleanups")
> >
> > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> > cpufreq governor is performance, turbo boost is disabled, C-states deeper
> > than C1 are disabled, Numa balancing is disabled.
> >
> > netperf
> > =======
> > case load baseline(std%) compare%( std%)
> > UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04)
> > UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42)
> > UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96)
> > UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50)
> >
> > Noticeable improvements of netperf is observed in 168 and 224 threads
> > cases.
> >
> > hackbench
> > =========
> > case load baseline(std%) compare%( std%)
> > process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48)
> > process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10)
> > process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41)
> > process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95)
> > process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97)
> > process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09)
> > threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49)
> > threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90)
> > threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44)
> > threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43)
> > threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91)
> > threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07)
> >
> > schbench(old)
> > ========
> > case load baseline(std%) compare%( std%)
> > normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37)
> > normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83)
> > normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28)
> > normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28)
> >
> > No much difference is observed in hackbench/schbench, due to the
> > run-to-run variance.
> >
> > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > hrtick_update(rq);
> > now = sched_clock_cpu(cpu_of(rq));
> > p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > + /*
> > + * If this rq will become idle, and dequeued task is
> > + * a short sleeping one, check if we can reserve
> > + * this idle CPU for that task for a short while.
> > + * During this reservation period, other wakees will
> > + * skip this 'idle' CPU in select_idle_cpu(), and this
> > + * short sleeping task can pick its previous CPU in
> > + * select_idle_sibling(), which brings better cache
> > + * locality.
> > + */
> > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > + rq->cache_hot_timeout = now + p->se.sleep_avg;
>
> Should this be written as:
> rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg);
> ?
>
> A even earlier task may have a bigger sleep_avg and overwriting
> rq->cache_hot_timeout here with an earlier time may cause that task
> migrate. Not sure how much impact this can have, just someting hit my
> brain while reading this code.
>
My previous thought was that, SIS_CACHE only honors the average sleep time
of the latest dequeued task. Say, task p1 goes to sleep, then task p2 is woken
up and godes to sleeps, we only honor the sleep time of p2. Because we don't
know how long p2 has been executed, and how much p2 has polluted the L1/L2 cache,
and does it still make sense to let p1 be woken up on its previous
CPU(is this CPU still cache hot for p1?)
> > +#endif
> > }
> >
> > #ifdef CONFIG_SMP
> > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > {
> > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > - sched_cpu_cookie_match(cpu_rq(cpu), p))
> > + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > + if (sched_feat(SIS_CACHE) &&
> > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > + return -1;
> > +
>
> Maybe introduce a new function that also considers rq->cache_hot_timeout,
> like available_idle_cpu_migrate() so that above and below logic can be
> simplified a bit?
>
Yes, that would be simpler, I'll do in next version.
> I was thinking to simply add that rq->cache_hot_timeout check to
> available_idle_cpu() but then a long sleeping task could be forced to
> migrate if its prev_cpu happens to just deschedule a task that sets rq's
> cache_hot_timeout. I guess that's why you chose to only change the idle
> semantic in select_idle_cpu() but not in select_idle_sibling()?
>
Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
thus no one could use this CPU within the cache hot period, including the cache-hot task
itself.
thanks,
Chenyu
© 2016 - 2025 Red Hat, Inc.