[PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

Andrei Vagin posted 6 patches 1 year, 8 months ago
[PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Andrei Vagin 1 year, 8 months ago
From: Peter Oskolkov <posk@google.com>

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 13 ++++++++-----
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af017e038b48..386a0c40d341 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4123,8 +4123,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f855b96..4c67652aa302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..f8420e9ed290 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2093,12 +2093,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
 
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_CURRENT_CPU  0x40 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3232,6 +3233,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
-- 
2.40.0.rc0.216.gc4246ad0f0-goog
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Chen Yu 1 year, 6 months ago
On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> From: Peter Oskolkov <posk@google.com>
> 
> Add WF_CURRENT_CPU wake flag that advices the scheduler to
> move the wakee to the current CPU. This is useful for fast on-CPU
> context switching use cases.
> 
> In addition, make ttwu external rather than static so that
> the flag could be passed to it from outside of sched/core.c.
> 
> Signed-off-by: Peter Oskolkov <posk@google.com>
> Signed-off-by: Andrei Vagin <avagin@google.com>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  	if (wake_flags & WF_TTWU) {
>  		record_wakee(p);
>  
> +		if ((wake_flags & WF_CURRENT_CPU) &&
> +		    cpumask_test_cpu(cpu, p->cpus_ptr))
> +			return cpu;
> +
I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
are regressions when running some workloads, and these workloads want to be
spreaded on idle CPUs whenever possible.
The reason for the regression is that, above change chooses current CPU no matter
what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
throughput/latency. And I believe this issue would be more severe on system with
smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
Arm64.

I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
dynamically when some condition is met(a short task for example).

Thanks,
Chenyu
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Andrei Vagin 1 year, 6 months ago
On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote:
>
> On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > From: Peter Oskolkov <posk@google.com>
> >
> > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > move the wakee to the current CPU. This is useful for fast on-CPU
> > context switching use cases.
> >
> > In addition, make ttwu external rather than static so that
> > the flag could be passed to it from outside of sched/core.c.
> >
> > Signed-off-by: Peter Oskolkov <posk@google.com>
> > Signed-off-by: Andrei Vagin <avagin@google.com>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >       if (wake_flags & WF_TTWU) {
> >               record_wakee(p);
> >
> > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > +                     return cpu;
> > +
> I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> are regressions when running some workloads, and these workloads want to be
> spreaded on idle CPUs whenever possible.
> The reason for the regression is that, above change chooses current CPU no matter
> what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> throughput/latency. And I believe this issue would be more severe on system with
> smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> Arm64.

WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
attempt to change how WF_SYNC works:

https://www.spinics.net/lists/kernel/msg4567650.html

Then we've found that this idea doesn't work well, and it is a reason
why we have the separate WF_CURRENT_CPU flag.

>
> I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> dynamically when some condition is met(a short task for example).

We discussed all of these here and here:

https://www.spinics.net/lists/kernel/msg4657545.html

https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/

I like your idea about short-duration tasks, but I think it is a
separate task and it has to be done in a separate patch set. Here, I
solve the problem of optimizing synchronous switches when one task wakes
up another one and falls asleep immediately after that. Waking up the
target task on the current CPU looks reasonable for a few reasons in
this case. First, waking up a task on the current CPU is cheaper than on
another one and it is much cheaper than waking on an idle cpu.  Second,
when tasks want to do synchronous switches, they often exchange some
data, so memory caches can play on us.

Thanks,
Andrei
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Chen Yu 1 year, 6 months ago
On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote:
> >
> > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <posk@google.com>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > Signed-off-by: Andrei Vagin <avagin@google.com>
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >       if (wake_flags & WF_TTWU) {
> > >               record_wakee(p);
> > >
> > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > +                     return cpu;
> > > +
> > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > are regressions when running some workloads, and these workloads want to be
> > spreaded on idle CPUs whenever possible.
> > The reason for the regression is that, above change chooses current CPU no matter
> > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > throughput/latency. And I believe this issue would be more severe on system with
> > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > Arm64.
> 
> WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> attempt to change how WF_SYNC works:
> 
> https://www.spinics.net/lists/kernel/msg4567650.html
> 
> Then we've found that this idea doesn't work well, and it is a reason
> why we have the separate WF_CURRENT_CPU flag.
>
I see, in seccomp case, even the idle CPU is not a better choice. 
> >
> > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > dynamically when some condition is met(a short task for example).
> 
> We discussed all of these here and here:
> 
> https://www.spinics.net/lists/kernel/msg4657545.html
> 
> https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> 
> I like your idea about short-duration tasks, but I think it is a
> separate task and it has to be done in a separate patch set. Here, I
> solve the problem of optimizing synchronous switches when one task wakes
> up another one and falls asleep immediately after that. Waking up the
> target task on the current CPU looks reasonable for a few reasons in
> this case. First, waking up a task on the current CPU is cheaper than on
> another one and it is much cheaper than waking on an idle cpu. 
It depends. For waker and wakee that compete for cache resource and do
not have share data, sometimes an idle target would be better.
> Second,
> when tasks want to do synchronous switches, they often exchange some
> data, so memory caches can play on us.
I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
become a auto-detect behavior so others can benefit from this.

If I understand correctly, the scenario this patch deals with is:
task A wakeups task B, task A and taks B have close relationship with each
other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
idle CPU.

I'm thinking if the following logic would cover your case:
1. the waker A is a short duration one (waker will fall asleep soon)
2. the waker B is a short duration one (impact of B is minor)
3. the A->wakee_flips is 0 and A->last_wakee = B
4. the A->wakee_flips is 0 and B->last_wakee = A
5, cpu(A)->nr_running = 1

(3 and 4 mean that, A and B wake up each other, so it is likely that
they share cache data, and they are good firends to be put together)

If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
can be set dynamically.

thanks,
Chenyu
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Andrei Vagin 1 year, 6 months ago
On Mon, Apr 10, 2023 at 11:17 AM Chen Yu <yu.c.chen@intel.com> wrote:
>
> On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@google.com>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > > Signed-off-by: Andrei Vagin <avagin@google.com>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >       if (wake_flags & WF_TTWU) {
> > > >               record_wakee(p);
> > > >
> > > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > +                     return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> I see, in seccomp case, even the idle CPU is not a better choice.
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu.
> It depends. For waker and wakee that compete for cache resource and do
> not have share data, sometimes an idle target would be better.
> > Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
> I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
> become a auto-detect behavior so others can benefit from this.
>
> If I understand correctly, the scenario this patch deals with is:
> task A wakeups task B, task A and taks B have close relationship with each
> other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
> idle CPU.
>
> I'm thinking if the following logic would cover your case:
> 1. the waker A is a short duration one (waker will fall asleep soon)
> 2. the waker B is a short duration one (impact of B is minor)

In the seccomp case, A or B can be a non-short-duration but if they do
synchronous
switches they get all the benefits of WF_CURRENT_CPU.

> 3. the A->wakee_flips is 0 and A->last_wakee = B

In our case, a "supervisor" is written in golang and there are
goroutines that can be
executed from different system threads, so this condition will fail often too.

> 4. the A->wakee_flips is 0 and B->last_wakee = A
> 5, cpu(A)->nr_running = 1
>
> (3 and 4 mean that, A and B wake up each other, so it is likely that
> they share cache data, and they are good firends to be put together)
>
> If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
> can be set dynamically.
>
> thanks,
> Chenyu
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Chen Yu 1 year, 6 months ago
On 2023-04-11 at 02:16:56 +0800, Chen Yu wrote:
> On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@google.com>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > > Signed-off-by: Andrei Vagin <avagin@google.com>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >       if (wake_flags & WF_TTWU) {
> > > >               record_wakee(p);
> > > >
> > > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > +                     return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> > 
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> > 
> > https://www.spinics.net/lists/kernel/msg4567650.html
> > 
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> I see, in seccomp case, even the idle CPU is not a better choice. 
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> > 
> > We discussed all of these here and here:
> > 
> > https://www.spinics.net/lists/kernel/msg4657545.html
> > 
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> > 
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu. 
> It depends. For waker and wakee that compete for cache resource and do
> not have share data, sometimes an idle target would be better.
> > Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
> I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
> become a auto-detect behavior so others can benefit from this.
> 
> If I understand correctly, the scenario this patch deals with is:
> task A wakeups task B, task A and taks B have close relationship with each
> other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
> idle CPU.
> 
> I'm thinking if the following logic would cover your case:
> 1. the waker A is a short duration one (waker will fall asleep soon)
> 2. the waker B is a short duration one (impact of B is minor)
typo:  s/waker B/wakee B/
> 3. the A->wakee_flips is 0 and A->last_wakee = B
> 4. the A->wakee_flips is 0 and B->last_wakee = A
typo:  s/A->wakee_flips/B->wakee_flips/

Sorry I typed too quickly yesterday.

thanks,
Chenyu
> 5, cpu(A)->nr_running = 1
> 
> (3 and 4 mean that, A and B wake up each other, so it is likely that
> they share cache data, and they are good firends to be put together)
> 
> If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
> can be set dynamically.
> 
> thanks,
> Chenyu
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Andy Lutomirski 1 year, 6 months ago
On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <avagin@gmail.com> wrote:
>
> On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote:
> >
> > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <posk@google.com>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > Signed-off-by: Andrei Vagin <avagin@google.com>
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >       if (wake_flags & WF_TTWU) {
> > >               record_wakee(p);
> > >
> > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > +                     return cpu;
> > > +
> > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > are regressions when running some workloads, and these workloads want to be
> > spreaded on idle CPUs whenever possible.
> > The reason for the regression is that, above change chooses current CPU no matter
> > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > throughput/latency. And I believe this issue would be more severe on system with
> > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > Arm64.
>
> WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> attempt to change how WF_SYNC works:
>
> https://www.spinics.net/lists/kernel/msg4567650.html
>
> Then we've found that this idea doesn't work well, and it is a reason
> why we have the separate WF_CURRENT_CPU flag.
>
> >
> > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > dynamically when some condition is met(a short task for example).
>
> We discussed all of these here and here:
>
> https://www.spinics.net/lists/kernel/msg4657545.html
>
> https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
>
> I like your idea about short-duration tasks, but I think it is a
> separate task and it has to be done in a separate patch set. Here, I
> solve the problem of optimizing synchronous switches when one task wakes
> up another one and falls asleep immediately after that. Waking up the
> target task on the current CPU looks reasonable for a few reasons in
> this case. First, waking up a task on the current CPU is cheaper than on
> another one and it is much cheaper than waking on an idle cpu.  Second,
> when tasks want to do synchronous switches, they often exchange some
> data, so memory caches can play on us.

I've contemplated this on occasion for quite a few years, and I think
that part of our issue is that the userspace ABI part doesn't exist
widely.  In particular, most of the common ways that user tasks talk
to each other don't have a single system call that can do the
send-a-message-and-start-waiting part all at once.  For example, if
task A is running and it wants to wake task B and then sleep:

UNIX sockets (or other sockets): A calls send() or write() or
sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll()
or select().

Pipes: Same as sockets except no send/recv.

Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :)

I think io_uring can kind of do a write-and-wait operation, but I
doubt it's wired up for this purpose.


seccomp seems like it should be able to do this straightforwardly on
the transition from the seccomp-sandboxed task to the monitor, but the
reverse direction is tricky.



Anyway, short of a massive API project, I don't see a totally
brilliant solution.  But maybe we could take a baby step toward a
general solution by deferring all the hard work of a wakeup a bit so
that, as we grow syscalls and other mechanisms that do wake-and-wait,
we can optimize them automatically.  For example, we could perhaps add
a pending wakeup to task_struct, kind of like:

struct task_struct *task_to_wake;

and then, the next time we sleep or return to usermode, we handle the
wakeup.  And if we're going to sleep, we can do it as an optimized
synchronous wakeup.  And if we try to wake a task while task_to_wake
is set, we just wake everything normally.

(There are refcounting issues here, and maybe this wants to be percpu,
not per task.)

I think it would be really nifty if Linux could somewhat reliably do
this style of synchronous con

PeterZ, is this at all sensible or am I nuts?

--Andy
Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu
Posted by Andrei Vagin 1 year, 6 months ago
On Mon, Apr 10, 2023 at 10:27 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <avagin@gmail.com> wrote:
> >
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@google.com>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@google.com>
> > > > Signed-off-by: Andrei Vagin <avagin@google.com>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > >       if (wake_flags & WF_TTWU) {
> > > >               record_wakee(p);
> > > >
> > > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > +                     return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu.  Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
>
> I've contemplated this on occasion for quite a few years, and I think
> that part of our issue is that the userspace ABI part doesn't exist
> widely.  In particular, most of the common ways that user tasks talk
> to each other don't have a single system call that can do the
> send-a-message-and-start-waiting part all at once.  For example, if
> task A is running and it wants to wake task B and then sleep:
>
> UNIX sockets (or other sockets): A calls send() or write() or
> sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll()
> or select().
>
> Pipes: Same as sockets except no send/recv.
>
> Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :)

futex-es? Here was an attempt to add FUTEX_SWAP a few years ago:
https://www.spinics.net/lists/kernel/msg3871065.html

It hasn't been merged to the upstream repo in favor of umcg:
https://lore.kernel.org/linux-mm/20211122211327.5931-1-posk@google.com/

Both these features solve similar problems, where FUTEX_SWAP is simple
and straightforward
but umcg is wider and more complicated.

>
> I think io_uring can kind of do a write-and-wait operation, but I
> doubt it's wired up for this purpose.

I think it may be a good candidate where this logic can be placed.

>
>
> seccomp seems like it should be able to do this straightforwardly on
> the transition from the seccomp-sandboxed task to the monitor, but the
> reverse direction is tricky.
>
>
>
> Anyway, short of a massive API project, I don't see a totally
> brilliant solution.  But maybe we could take a baby step toward a
> general solution by deferring all the hard work of a wakeup a bit so
> that, as we grow syscalls and other mechanisms that do wake-and-wait,
> we can optimize them automatically.  For example, we could perhaps add
> a pending wakeup to task_struct, kind of like:
>
> struct task_struct *task_to_wake;
>
> and then, the next time we sleep or return to usermode, we handle the
> wakeup.  And if we're going to sleep, we can do it as an optimized
> synchronous wakeup.  And if we try to wake a task while task_to_wake
> is set, we just wake everything normally.

I am not sure that I understand when it has to be set and when it will
be in effect. For example, we want to do the pair write&read syscall. It
means write sets task_to_wake, then the current task is resumed without waking
the target task and only after that task_to_wake will be in effect.
In other words,
it has to be in effect after the next but one returns to user-mode.

Thanks,
Andrei

>
> (There are refcounting issues here, and maybe this wants to be percpu,
> not per task.)
>
> I think it would be really nifty if Linux could somewhat reliably do
> this style of synchronous con
>
> PeterZ, is this at all sensible or am I nuts?
>
> --Andy