From: Peter Oskolkov <posk@google.com>
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.
In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
---
kernel/sched/core.c | 3 +--
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 13 ++++++++-----
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af017e038b48..386a0c40d341 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4123,8 +4123,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f855b96..4c67652aa302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);
+ if ((wake_flags & WF_CURRENT_CPU) &&
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if (sched_energy_enabled()) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..f8420e9ed290 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2093,12 +2093,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}
/* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
+#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
-#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_CURRENT_CPU 0x40 /* Prefer to move the wakee to the current CPU. */
#ifdef CONFIG_SMP
static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3232,6 +3233,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
extern int preempt_dynamic_mode;
extern int sched_dynamic_mode(const char *str);
--
2.40.0.rc0.216.gc4246ad0f0-goog
On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > From: Peter Oskolkov <posk@google.com> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > move the wakee to the current CPU. This is useful for fast on-CPU > context switching use cases. > > In addition, make ttwu external rather than static so that > the flag could be passed to it from outside of sched/core.c. > > Signed-off-by: Peter Oskolkov <posk@google.com> > Signed-off-by: Andrei Vagin <avagin@google.com> > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > if (wake_flags & WF_TTWU) { > record_wakee(p); > > + if ((wake_flags & WF_CURRENT_CPU) && > + cpumask_test_cpu(cpu, p->cpus_ptr)) > + return cpu; > + I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there are regressions when running some workloads, and these workloads want to be spreaded on idle CPUs whenever possible. The reason for the regression is that, above change chooses current CPU no matter what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts throughput/latency. And I believe this issue would be more severe on system with smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, Arm64. I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag dynamically when some condition is met(a short task for example). Thanks, Chenyu
On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote: > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > From: Peter Oskolkov <posk@google.com> > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > move the wakee to the current CPU. This is useful for fast on-CPU > > context switching use cases. > > > > In addition, make ttwu external rather than static so that > > the flag could be passed to it from outside of sched/core.c. > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > Signed-off-by: Andrei Vagin <avagin@google.com> > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > if (wake_flags & WF_TTWU) { > > record_wakee(p); > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > + return cpu; > > + > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > are regressions when running some workloads, and these workloads want to be > spreaded on idle CPUs whenever possible. > The reason for the regression is that, above change chooses current CPU no matter > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > throughput/latency. And I believe this issue would be more severe on system with > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > Arm64. WF_CURRENT_CPU works only in certain conditions. Maybe you saw my attempt to change how WF_SYNC works: https://www.spinics.net/lists/kernel/msg4567650.html Then we've found that this idea doesn't work well, and it is a reason why we have the separate WF_CURRENT_CPU flag. > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > dynamically when some condition is met(a short task for example). We discussed all of these here and here: https://www.spinics.net/lists/kernel/msg4657545.html https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/ I like your idea about short-duration tasks, but I think it is a separate task and it has to be done in a separate patch set. Here, I solve the problem of optimizing synchronous switches when one task wakes up another one and falls asleep immediately after that. Waking up the target task on the current CPU looks reasonable for a few reasons in this case. First, waking up a task on the current CPU is cheaper than on another one and it is much cheaper than waking on an idle cpu. Second, when tasks want to do synchronous switches, they often exchange some data, so memory caches can play on us. Thanks, Andrei
On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote: > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > > From: Peter Oskolkov <posk@google.com> > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > context switching use cases. > > > > > > In addition, make ttwu external rather than static so that > > > the flag could be passed to it from outside of sched/core.c. > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > Signed-off-by: Andrei Vagin <avagin@google.com> > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > if (wake_flags & WF_TTWU) { > > > record_wakee(p); > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > + return cpu; > > > + > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > > are regressions when running some workloads, and these workloads want to be > > spreaded on idle CPUs whenever possible. > > The reason for the regression is that, above change chooses current CPU no matter > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > > throughput/latency. And I believe this issue would be more severe on system with > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > > Arm64. > > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my > attempt to change how WF_SYNC works: > > https://www.spinics.net/lists/kernel/msg4567650.html > > Then we've found that this idea doesn't work well, and it is a reason > why we have the separate WF_CURRENT_CPU flag. > I see, in seccomp case, even the idle CPU is not a better choice. > > > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > > dynamically when some condition is met(a short task for example). > > We discussed all of these here and here: > > https://www.spinics.net/lists/kernel/msg4657545.html > > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/ > > I like your idea about short-duration tasks, but I think it is a > separate task and it has to be done in a separate patch set. Here, I > solve the problem of optimizing synchronous switches when one task wakes > up another one and falls asleep immediately after that. Waking up the > target task on the current CPU looks reasonable for a few reasons in > this case. First, waking up a task on the current CPU is cheaper than on > another one and it is much cheaper than waking on an idle cpu. It depends. For waker and wakee that compete for cache resource and do not have share data, sometimes an idle target would be better. > Second, > when tasks want to do synchronous switches, they often exchange some > data, so memory caches can play on us. I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could become a auto-detect behavior so others can benefit from this. If I understand correctly, the scenario this patch deals with is: task A wakeups task B, task A and taks B have close relationship with each other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an idle CPU. I'm thinking if the following logic would cover your case: 1. the waker A is a short duration one (waker will fall asleep soon) 2. the waker B is a short duration one (impact of B is minor) 3. the A->wakee_flips is 0 and A->last_wakee = B 4. the A->wakee_flips is 0 and B->last_wakee = A 5, cpu(A)->nr_running = 1 (3 and 4 mean that, A and B wake up each other, so it is likely that they share cache data, and they are good firends to be put together) If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU can be set dynamically. thanks, Chenyu
On Mon, Apr 10, 2023 at 11:17 AM Chen Yu <yu.c.chen@intel.com> wrote: > > On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote: > > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > > > From: Peter Oskolkov <posk@google.com> > > > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > > context switching use cases. > > > > > > > > In addition, make ttwu external rather than static so that > > > > the flag could be passed to it from outside of sched/core.c. > > > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > > Signed-off-by: Andrei Vagin <avagin@google.com> > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > > if (wake_flags & WF_TTWU) { > > > > record_wakee(p); > > > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > > + return cpu; > > > > + > > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > > > are regressions when running some workloads, and these workloads want to be > > > spreaded on idle CPUs whenever possible. > > > The reason for the regression is that, above change chooses current CPU no matter > > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > > > throughput/latency. And I believe this issue would be more severe on system with > > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > > > Arm64. > > > > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my > > attempt to change how WF_SYNC works: > > > > https://www.spinics.net/lists/kernel/msg4567650.html > > > > Then we've found that this idea doesn't work well, and it is a reason > > why we have the separate WF_CURRENT_CPU flag. > > > I see, in seccomp case, even the idle CPU is not a better choice. > > > > > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > > > dynamically when some condition is met(a short task for example). > > > > We discussed all of these here and here: > > > > https://www.spinics.net/lists/kernel/msg4657545.html > > > > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/ > > > > I like your idea about short-duration tasks, but I think it is a > > separate task and it has to be done in a separate patch set. Here, I > > solve the problem of optimizing synchronous switches when one task wakes > > up another one and falls asleep immediately after that. Waking up the > > target task on the current CPU looks reasonable for a few reasons in > > this case. First, waking up a task on the current CPU is cheaper than on > > another one and it is much cheaper than waking on an idle cpu. > It depends. For waker and wakee that compete for cache resource and do > not have share data, sometimes an idle target would be better. > > Second, > > when tasks want to do synchronous switches, they often exchange some > > data, so memory caches can play on us. > I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could > become a auto-detect behavior so others can benefit from this. > > If I understand correctly, the scenario this patch deals with is: > task A wakeups task B, task A and taks B have close relationship with each > other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an > idle CPU. > > I'm thinking if the following logic would cover your case: > 1. the waker A is a short duration one (waker will fall asleep soon) > 2. the waker B is a short duration one (impact of B is minor) In the seccomp case, A or B can be a non-short-duration but if they do synchronous switches they get all the benefits of WF_CURRENT_CPU. > 3. the A->wakee_flips is 0 and A->last_wakee = B In our case, a "supervisor" is written in golang and there are goroutines that can be executed from different system threads, so this condition will fail often too. > 4. the A->wakee_flips is 0 and B->last_wakee = A > 5, cpu(A)->nr_running = 1 > > (3 and 4 mean that, A and B wake up each other, so it is likely that > they share cache data, and they are good firends to be put together) > > If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU > can be set dynamically. > > thanks, > Chenyu
On 2023-04-11 at 02:16:56 +0800, Chen Yu wrote: > On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote: > > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > > > From: Peter Oskolkov <posk@google.com> > > > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > > context switching use cases. > > > > > > > > In addition, make ttwu external rather than static so that > > > > the flag could be passed to it from outside of sched/core.c. > > > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > > Signed-off-by: Andrei Vagin <avagin@google.com> > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > > if (wake_flags & WF_TTWU) { > > > > record_wakee(p); > > > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > > + return cpu; > > > > + > > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > > > are regressions when running some workloads, and these workloads want to be > > > spreaded on idle CPUs whenever possible. > > > The reason for the regression is that, above change chooses current CPU no matter > > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > > > throughput/latency. And I believe this issue would be more severe on system with > > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > > > Arm64. > > > > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my > > attempt to change how WF_SYNC works: > > > > https://www.spinics.net/lists/kernel/msg4567650.html > > > > Then we've found that this idea doesn't work well, and it is a reason > > why we have the separate WF_CURRENT_CPU flag. > > > I see, in seccomp case, even the idle CPU is not a better choice. > > > > > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > > > dynamically when some condition is met(a short task for example). > > > > We discussed all of these here and here: > > > > https://www.spinics.net/lists/kernel/msg4657545.html > > > > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/ > > > > I like your idea about short-duration tasks, but I think it is a > > separate task and it has to be done in a separate patch set. Here, I > > solve the problem of optimizing synchronous switches when one task wakes > > up another one and falls asleep immediately after that. Waking up the > > target task on the current CPU looks reasonable for a few reasons in > > this case. First, waking up a task on the current CPU is cheaper than on > > another one and it is much cheaper than waking on an idle cpu. > It depends. For waker and wakee that compete for cache resource and do > not have share data, sometimes an idle target would be better. > > Second, > > when tasks want to do synchronous switches, they often exchange some > > data, so memory caches can play on us. > I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could > become a auto-detect behavior so others can benefit from this. > > If I understand correctly, the scenario this patch deals with is: > task A wakeups task B, task A and taks B have close relationship with each > other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an > idle CPU. > > I'm thinking if the following logic would cover your case: > 1. the waker A is a short duration one (waker will fall asleep soon) > 2. the waker B is a short duration one (impact of B is minor) typo: s/waker B/wakee B/ > 3. the A->wakee_flips is 0 and A->last_wakee = B > 4. the A->wakee_flips is 0 and B->last_wakee = A typo: s/A->wakee_flips/B->wakee_flips/ Sorry I typed too quickly yesterday. thanks, Chenyu > 5, cpu(A)->nr_running = 1 > > (3 and 4 mean that, A and B wake up each other, so it is likely that > they share cache data, and they are good firends to be put together) > > If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU > can be set dynamically. > > thanks, > Chenyu
On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <avagin@gmail.com> wrote: > > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > > From: Peter Oskolkov <posk@google.com> > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > context switching use cases. > > > > > > In addition, make ttwu external rather than static so that > > > the flag could be passed to it from outside of sched/core.c. > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > Signed-off-by: Andrei Vagin <avagin@google.com> > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > if (wake_flags & WF_TTWU) { > > > record_wakee(p); > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > + return cpu; > > > + > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > > are regressions when running some workloads, and these workloads want to be > > spreaded on idle CPUs whenever possible. > > The reason for the regression is that, above change chooses current CPU no matter > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > > throughput/latency. And I believe this issue would be more severe on system with > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > > Arm64. > > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my > attempt to change how WF_SYNC works: > > https://www.spinics.net/lists/kernel/msg4567650.html > > Then we've found that this idea doesn't work well, and it is a reason > why we have the separate WF_CURRENT_CPU flag. > > > > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > > dynamically when some condition is met(a short task for example). > > We discussed all of these here and here: > > https://www.spinics.net/lists/kernel/msg4657545.html > > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/ > > I like your idea about short-duration tasks, but I think it is a > separate task and it has to be done in a separate patch set. Here, I > solve the problem of optimizing synchronous switches when one task wakes > up another one and falls asleep immediately after that. Waking up the > target task on the current CPU looks reasonable for a few reasons in > this case. First, waking up a task on the current CPU is cheaper than on > another one and it is much cheaper than waking on an idle cpu. Second, > when tasks want to do synchronous switches, they often exchange some > data, so memory caches can play on us. I've contemplated this on occasion for quite a few years, and I think that part of our issue is that the userspace ABI part doesn't exist widely. In particular, most of the common ways that user tasks talk to each other don't have a single system call that can do the send-a-message-and-start-waiting part all at once. For example, if task A is running and it wants to wake task B and then sleep: UNIX sockets (or other sockets): A calls send() or write() or sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll() or select(). Pipes: Same as sockets except no send/recv. Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :) I think io_uring can kind of do a write-and-wait operation, but I doubt it's wired up for this purpose. seccomp seems like it should be able to do this straightforwardly on the transition from the seccomp-sandboxed task to the monitor, but the reverse direction is tricky. Anyway, short of a massive API project, I don't see a totally brilliant solution. But maybe we could take a baby step toward a general solution by deferring all the hard work of a wakeup a bit so that, as we grow syscalls and other mechanisms that do wake-and-wait, we can optimize them automatically. For example, we could perhaps add a pending wakeup to task_struct, kind of like: struct task_struct *task_to_wake; and then, the next time we sleep or return to usermode, we handle the wakeup. And if we're going to sleep, we can do it as an optimized synchronous wakeup. And if we try to wake a task while task_to_wake is set, we just wake everything normally. (There are refcounting issues here, and maybe this wants to be percpu, not per task.) I think it would be really nifty if Linux could somewhat reliably do this style of synchronous con PeterZ, is this at all sensible or am I nuts? --Andy
On Mon, Apr 10, 2023 at 10:27 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <avagin@gmail.com> wrote: > > > > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <yu.c.chen@intel.com> wrote: > > > > > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote: > > > > From: Peter Oskolkov <posk@google.com> > > > > > > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to > > > > move the wakee to the current CPU. This is useful for fast on-CPU > > > > context switching use cases. > > > > > > > > In addition, make ttwu external rather than static so that > > > > the flag could be passed to it from outside of sched/core.c. > > > > > > > > Signed-off-by: Peter Oskolkov <posk@google.com> > > > > Signed-off-by: Andrei Vagin <avagin@google.com> > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > > if (wake_flags & WF_TTWU) { > > > > record_wakee(p); > > > > > > > > + if ((wake_flags & WF_CURRENT_CPU) && > > > > + cpumask_test_cpu(cpu, p->cpus_ptr)) > > > > + return cpu; > > > > + > > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there > > > are regressions when running some workloads, and these workloads want to be > > > spreaded on idle CPUs whenever possible. > > > The reason for the regression is that, above change chooses current CPU no matter > > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts > > > throughput/latency. And I believe this issue would be more severe on system with > > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD, > > > Arm64. > > > > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my > > attempt to change how WF_SYNC works: > > > > https://www.spinics.net/lists/kernel/msg4567650.html > > > > Then we've found that this idea doesn't work well, and it is a reason > > why we have the separate WF_CURRENT_CPU flag. > > > > > > > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic > > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter? > > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by > > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag > > > dynamically when some condition is met(a short task for example). > > > > We discussed all of these here and here: > > > > https://www.spinics.net/lists/kernel/msg4657545.html > > > > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/ > > > > I like your idea about short-duration tasks, but I think it is a > > separate task and it has to be done in a separate patch set. Here, I > > solve the problem of optimizing synchronous switches when one task wakes > > up another one and falls asleep immediately after that. Waking up the > > target task on the current CPU looks reasonable for a few reasons in > > this case. First, waking up a task on the current CPU is cheaper than on > > another one and it is much cheaper than waking on an idle cpu. Second, > > when tasks want to do synchronous switches, they often exchange some > > data, so memory caches can play on us. > > I've contemplated this on occasion for quite a few years, and I think > that part of our issue is that the userspace ABI part doesn't exist > widely. In particular, most of the common ways that user tasks talk > to each other don't have a single system call that can do the > send-a-message-and-start-waiting part all at once. For example, if > task A is running and it wants to wake task B and then sleep: > > UNIX sockets (or other sockets): A calls send() or write() or > sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll() > or select(). > > Pipes: Same as sockets except no send/recv. > > Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :) futex-es? Here was an attempt to add FUTEX_SWAP a few years ago: https://www.spinics.net/lists/kernel/msg3871065.html It hasn't been merged to the upstream repo in favor of umcg: https://lore.kernel.org/linux-mm/20211122211327.5931-1-posk@google.com/ Both these features solve similar problems, where FUTEX_SWAP is simple and straightforward but umcg is wider and more complicated. > > I think io_uring can kind of do a write-and-wait operation, but I > doubt it's wired up for this purpose. I think it may be a good candidate where this logic can be placed. > > > seccomp seems like it should be able to do this straightforwardly on > the transition from the seccomp-sandboxed task to the monitor, but the > reverse direction is tricky. > > > > Anyway, short of a massive API project, I don't see a totally > brilliant solution. But maybe we could take a baby step toward a > general solution by deferring all the hard work of a wakeup a bit so > that, as we grow syscalls and other mechanisms that do wake-and-wait, > we can optimize them automatically. For example, we could perhaps add > a pending wakeup to task_struct, kind of like: > > struct task_struct *task_to_wake; > > and then, the next time we sleep or return to usermode, we handle the > wakeup. And if we're going to sleep, we can do it as an optimized > synchronous wakeup. And if we try to wake a task while task_to_wake > is set, we just wake everything normally. I am not sure that I understand when it has to be set and when it will be in effect. For example, we want to do the pair write&read syscall. It means write sets task_to_wake, then the current task is resumed without waking the target task and only after that task_to_wake will be in effect. In other words, it has to be in effect after the next but one returns to user-mode. Thanks, Andrei > > (There are refcounting issues here, and maybe this wants to be percpu, > not per task.) > > I think it would be really nifty if Linux could somewhat reliably do > this style of synchronous con > > PeterZ, is this at all sensible or am I nuts? > > --Andy
© 2016 - 2024 Red Hat, Inc.