kernel/sched/core.c | 7 ++++- kernel/sched/psi.c | 65 ++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/stats.h | 16 ++++++++++- 3 files changed, 83 insertions(+), 5 deletions(-)
When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:
psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
When investigating the series of events leading up to the splat,
following sequence was observed:
[008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
...
[008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
[008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
# CPU8 goes into newidle balance and releases the rq lock
...
# CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
[015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
[015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
[008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, the following race is possible with
psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
psi_sched_switch()
__schedule()
rq_lock(rq)
try_to_block_task(p)
psi_dequeue()
[ psi_task_switch() is responsible
for adjusting the PSI flags ]
put_prev_entity(&p->se) try_to_wake_up(p)
# no runnable task on rq->cfs ...
sched_balance_newidle()
raw_spin_rq_unlock(rq) __task_rq_lock(p)
... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
__task_rq_unlock(p)
raw_spin_rq_lock(rq)
...
[ p was re-enqueued or has migrated away ]
...
psi_task_switch() [Too late!]
raw_spin_rq_unlock(rq)
The wakeup context will see the flags for a running task when the flags
should have reflected the task being blocked. Similarly, a migration
context in the wakeup path can clear the flags that psi_sched_switch()
assumes will be set (TSK_ONCPU / TSK_RUNNING)
Since the TSK_ONCPU flag has to be modified with the rq lock of
task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
prevent the race. Specifically:
o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
psi_enqueue() will only be called with TSK_ONCPU set when the task is
being requeued on the same CPU. If the task was migrated,
psi_ttwu_dequeue() would have already cleared the PSI flags.
psi_enqueue() cannot guarantee that this same task will be picked
again when the scheduling CPU returns from newidle balance which is
why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
without migration.
o When psi_sched_switch() observes that prev's task_cpu() has changes or
the TSK_ONCPU flag is not set, a wakeup has raced with the
psi_sched_switch() trying to adjust the dequeue flag. If the next is
same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
already adjusted the PSI flags and no further changes are required
to prev's PSI flags.
With the introduction of DELAY_DEQUEUE, the requeue path is considerably
shortened and with the addition of bandwidth throttling in the
__schedule() path, the race window is large enough to observed this
issue.
Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
This patch is based on tip:sched/core at commit af98d8a36a96
("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
Reproducer for the PSI splat:
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
# Ridiculous limit on SMP to throttle multiple rqs at once
echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
perf bench sched messaging -t -p -l 100000 -g 16
This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
also on a 32 vCPU VM.
---
kernel/sched/core.c | 7 ++++-
kernel/sched/psi.c | 65 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/stats.h | 16 ++++++++++-
3 files changed, 83 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..9bbe51e44e98 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
rq->last_seen_need_resched_ns = 0;
#endif
+ /*
+ * PSI might have to deal with the consequences of newidle balance
+ * possibly dropping the rq lock and prev being requeued and selected.
+ */
+ psi_sched_switch(prev, next, block);
+
if (likely(prev != next)) {
rq->nr_switches++;
/*
@@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, block);
trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..c355a6189595 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep)
{
struct psi_group *group, *common = NULL;
- int cpu = task_cpu(prev);
+ int prev_cpu, cpu;
+
+ /* No race between psi_dequeue() and now */
+ if (prev == next && (prev->psi_flags & TSK_ONCPU))
+ return;
+
+ prev_cpu = task_cpu(prev);
+ cpu = smp_processor_id();
if (next->pid) {
+ /*
+ * If next == prev but TSK_ONCPU is cleared, the task was
+ * requeued when newidle balance dropped the rq lock and
+ * psi_enqueue() cleared the TSK_ONCPU flag.
+ */
psi_flags_change(next, 0, TSK_ONCPU);
/*
* Set TSK_ONCPU on @next's cgroups. If @next shares any
@@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
*/
group = task_psi_group(next);
do {
- if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
- PSI_ONCPU) {
+ /*
+ * Since newidle balance can drop the rq lock (see the next comment)
+ * there is a possibility of try_to_wake_up() migrating prev away
+ * before reaching here. Do not find common if task has migrated.
+ */
+ if (prev_cpu == cpu &&
+ (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
common = group;
break;
}
@@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
} while ((group = group->parent));
}
+ /*
+ * When a task is blocked, psi_dequeue() leaves the PSI flag
+ * adjustments to psi_task_switch() however, there is a possibility of
+ * rq lock being dropped in the interim and the task being woken up
+ * again before psi_task_switch() is called leading to psi_enqueue()
+ * seeing the flags for a running task. Specifically, the following
+ * scenario is possible:
+ *
+ * __schedule()
+ * rq_lock(rq)
+ * try_to_block_task(p)
+ * psi_dequeue()
+ * [ psi_task_switch() is responsible
+ * for adjusting the PSI flags ]
+ * put_prev_entity(&p->se) try_to_wake_up(p)
+ * # no runnable task on rq->cfs ...
+ * sched_balance_newidle()
+ * raw_spin_rq_unlock(rq) __task_rq_lock(p)
+ * ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
+ * __task_rq_unlock(p)
+ * raw_spin_rq_lock(rq)
+ * ...
+ * [ p was re-enqueued or has migrated away ]
+ * ...
+ * psi_task_switch() [Too late!]
+ * raw_spin_rq_unlock(rq)
+ *
+ * In the above case, psi_enqueue() can sees the p->psi_flags state
+ * before it is adjusted to account for dequeue in psi_task_switch(),
+ * or psi_ttwu_dequeue() can clear the p->psi_flags which
+ * psi_task_switch() tries to adjust assuming that the entity has just
+ * finished running.
+ *
+ * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
+ * the combination of TSK_ONCPU and task_cpu(p) to catch the race
+ * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
+ * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
+ * flags already for prev on this CPU, skip adjusting flags.
+ */
+ if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
+ return;
+
if (prev->pid) {
int clear = TSK_ONCPU, set = 0;
bool wake_clock = true;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..f09903165456 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
if (flags & ENQUEUE_RESTORE)
return;
- if (p->se.sched_delayed) {
+ if (p->psi_flags & TSK_ONCPU) {
+ /*
+ * psi_enqueue() can race with psi_task_switch() where
+ * TSK_ONCPU will be still set for the task (see the
+ * comment in psi_task_switch())
+ *
+ * Reaching here with TSK_ONCPU is only possible when
+ * the task is being enqueued on the same CPU. Since
+ * psi_task_switch() has not had the chance to adjust
+ * the flags yet, just clear the TSK_ONCPU which yields
+ * the same result as sleep + wakeup without migration.
+ */
+ SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
+ clear = TSK_ONCPU;
+ } else if (p->se.sched_delayed) {
/* CPU migration of "sleeping" task */
SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
if (p->in_memstall)
base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb
--
2.34.1
Hi,
On 2024/12/26 13:34, K Prateek Nayak wrote:
> When running hackbench in a cgroup with bandwidth throttling enabled,
> following PSI splat was observed:
>
> psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
>
> When investigating the series of events leading up to the splat,
> following sequence was observed:
> [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
> ...
> [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
> [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
> # CPU8 goes into newidle balance and releases the rq lock
> ...
> # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
> [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
I have a question here, why TSK_ONCPU is not set in psi_flags if
the task hasn't arrived psi_sched_switch()?
> [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
> [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
>
> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
> for the blocked entity, however, the following race is possible with
> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
> psi_sched_switch()
Yeah, this race is introduced by delayed dequeue changes.
In the past, a sleep task can't be migrated or enqueued before it's done
in __schedule(). (finish_task(prev) clear prev->on_cpu.)
Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
to bring it schedulable.
But migration is still impossible, since it's still running on this cpu,
so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
(Actually, there we can enqueue_task() for any sleep task, including
those are not delayed dequeue, if select_task_rq() returns same cpu
as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
later.)
>
> __schedule()
> rq_lock(rq)
> try_to_block_task(p)
> psi_dequeue()
> [ psi_task_switch() is responsible
> for adjusting the PSI flags ]
> put_prev_entity(&p->se) try_to_wake_up(p)
> # no runnable task on rq->cfs ...
> sched_balance_newidle()
> raw_spin_rq_unlock(rq) __task_rq_lock(p)
> ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> __task_rq_unlock(p)
> raw_spin_rq_lock(rq)
> ...
> [ p was re-enqueued or has migrated away ]
Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
migration can't happen since p->on_cpu is still true.
> ...
> psi_task_switch() [Too late!]
> raw_spin_rq_unlock(rq)
>
> The wakeup context will see the flags for a running task when the flags
> should have reflected the task being blocked. Similarly, a migration
> context in the wakeup path can clear the flags that psi_sched_switch()
> assumes will be set (TSK_ONCPU / TSK_RUNNING)
In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
should do nothing at all.
Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
change any state too. (It's not a wakeup or migration from PSI POV.)
And the current code of "psi_sched_switch(prev, next, block);" looks
buggy to me too! The "block" value is from try_to_block_task(), then
pick_next_task() may drop and gain rq lock, so we can't use the stale
value for psi_sched_switch().
Before we used "task_on_rq_queued(prev)", now we have to also consider
delayed dequeue case, so it should be:
"!task_on_rq_queued(prev) || prev->se.sched_delayed"
Thanks!
>
> Since the TSK_ONCPU flag has to be modified with the rq lock of
> task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
> prevent the race. Specifically:
>
> o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
> psi_enqueue() will only be called with TSK_ONCPU set when the task is
> being requeued on the same CPU. If the task was migrated,
> psi_ttwu_dequeue() would have already cleared the PSI flags.
>
> psi_enqueue() cannot guarantee that this same task will be picked
> again when the scheduling CPU returns from newidle balance which is
> why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
> without migration.
>
> o When psi_sched_switch() observes that prev's task_cpu() has changes or
> the TSK_ONCPU flag is not set, a wakeup has raced with the
> psi_sched_switch() trying to adjust the dequeue flag. If the next is
> same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
> again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
> already adjusted the PSI flags and no further changes are required
> to prev's PSI flags.
>
> With the introduction of DELAY_DEQUEUE, the requeue path is considerably
> shortened and with the addition of bandwidth throttling in the
> __schedule() path, the race window is large enough to observed this
> issue.
>
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> This patch is based on tip:sched/core at commit af98d8a36a96
> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
>
> Reproducer for the PSI splat:
>
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> # Ridiculous limit on SMP to throttle multiple rqs at once
> echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
> perf bench sched messaging -t -p -l 100000 -g 16
>
> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
> also on a 32 vCPU VM.
> ---
> kernel/sched/core.c | 7 ++++-
> kernel/sched/psi.c | 65 ++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/stats.h | 16 ++++++++++-
> 3 files changed, 83 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..9bbe51e44e98 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
> rq->last_seen_need_resched_ns = 0;
> #endif
>
> + /*
> + * PSI might have to deal with the consequences of newidle balance
> + * possibly dropping the rq lock and prev being requeued and selected.
> + */
> + psi_sched_switch(prev, next, block);
> +
> if (likely(prev != next)) {
> rq->nr_switches++;
> /*
> @@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
>
> migrate_disable_switch(rq, prev);
> psi_account_irqtime(rq, prev, next);
> - psi_sched_switch(prev, next, block);
>
> trace_sched_switch(preempt, prev, next, prev_state);
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84dad1511d1e..c355a6189595 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> bool sleep)
> {
> struct psi_group *group, *common = NULL;
> - int cpu = task_cpu(prev);
> + int prev_cpu, cpu;
> +
> + /* No race between psi_dequeue() and now */
> + if (prev == next && (prev->psi_flags & TSK_ONCPU))
> + return;
> +
> + prev_cpu = task_cpu(prev);
> + cpu = smp_processor_id();
>
> if (next->pid) {
> + /*
> + * If next == prev but TSK_ONCPU is cleared, the task was
> + * requeued when newidle balance dropped the rq lock and
> + * psi_enqueue() cleared the TSK_ONCPU flag.
> + */
> psi_flags_change(next, 0, TSK_ONCPU);
> /*
> * Set TSK_ONCPU on @next's cgroups. If @next shares any
> @@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> */
> group = task_psi_group(next);
> do {
> - if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> - PSI_ONCPU) {
> + /*
> + * Since newidle balance can drop the rq lock (see the next comment)
> + * there is a possibility of try_to_wake_up() migrating prev away
> + * before reaching here. Do not find common if task has migrated.
> + */
> + if (prev_cpu == cpu &&
> + (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
> common = group;
> break;
> }
> @@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> } while ((group = group->parent));
> }
>
> + /*
> + * When a task is blocked, psi_dequeue() leaves the PSI flag
> + * adjustments to psi_task_switch() however, there is a possibility of
> + * rq lock being dropped in the interim and the task being woken up
> + * again before psi_task_switch() is called leading to psi_enqueue()
> + * seeing the flags for a running task. Specifically, the following
> + * scenario is possible:
> + *
> + * __schedule()
> + * rq_lock(rq)
> + * try_to_block_task(p)
> + * psi_dequeue()
> + * [ psi_task_switch() is responsible
> + * for adjusting the PSI flags ]
> + * put_prev_entity(&p->se) try_to_wake_up(p)
> + * # no runnable task on rq->cfs ...
> + * sched_balance_newidle()
> + * raw_spin_rq_unlock(rq) __task_rq_lock(p)
> + * ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> + * __task_rq_unlock(p)
> + * raw_spin_rq_lock(rq)
> + * ...
> + * [ p was re-enqueued or has migrated away ]
> + * ...
> + * psi_task_switch() [Too late!]
> + * raw_spin_rq_unlock(rq)
> + *
> + * In the above case, psi_enqueue() can sees the p->psi_flags state
> + * before it is adjusted to account for dequeue in psi_task_switch(),
> + * or psi_ttwu_dequeue() can clear the p->psi_flags which
> + * psi_task_switch() tries to adjust assuming that the entity has just
> + * finished running.
> + *
> + * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
> + * the combination of TSK_ONCPU and task_cpu(p) to catch the race
> + * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
> + * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
> + * flags already for prev on this CPU, skip adjusting flags.
> + */
> + if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
> + return;
> +
> if (prev->pid) {
> int clear = TSK_ONCPU, set = 0;
> bool wake_clock = true;
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..f09903165456 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
> if (flags & ENQUEUE_RESTORE)
> return;
>
> - if (p->se.sched_delayed) {
> + if (p->psi_flags & TSK_ONCPU) {
> + /*
> + * psi_enqueue() can race with psi_task_switch() where
> + * TSK_ONCPU will be still set for the task (see the
> + * comment in psi_task_switch())
> + *
> + * Reaching here with TSK_ONCPU is only possible when
> + * the task is being enqueued on the same CPU. Since
> + * psi_task_switch() has not had the chance to adjust
> + * the flags yet, just clear the TSK_ONCPU which yields
> + * the same result as sleep + wakeup without migration.
> + */
> + SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
> + clear = TSK_ONCPU;
> + } else if (p->se.sched_delayed) {
> /* CPU migration of "sleeping" task */
> SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
> if (p->in_memstall)
>
> base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb
Hello there,
Thank you for taking a look at the patch!
On 12/26/2024 4:13 PM, Chengming Zhou wrote:
> Hi,
>
> On 2024/12/26 13:34, K Prateek Nayak wrote:
>> When running hackbench in a cgroup with bandwidth throttling enabled,
>> following PSI splat was observed:
>>
>> psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
>>
>> When investigating the series of events leading up to the splat,
>> following sequence was observed:
>> [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
>> ...
>> [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
>> [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
>> # CPU8 goes into newidle balance and releases the rq lock
>> ...
>> # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>> [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
>
> I have a question here, why TSK_ONCPU is not set in psi_flags if
> the task hasn't arrived psi_sched_switch()?
It is set. "psi_flags" is in fact a hex value so the psi_flags is "0x14"
which is (TSK_ONCPU | TSK_RUNNING)
>
>> [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
>> [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
>>
>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>> for the blocked entity, however, the following race is possible with
>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>> psi_sched_switch()
>
> Yeah, this race is introduced by delayed dequeue changes.
>
> In the past, a sleep task can't be migrated or enqueued before it's done in __schedule(). (finish_task(prev) clear prev->on_cpu.)
I see __block_task() doing:
smp_store_release(&p->on_rq, 0);
wouldn't this allow the task to be migrated? P.S. I have not encountered a
case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() but
looking at the code, I thought it might be possible (I might very well be
wrong)
>
> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
> to bring it schedulable.
>
> But migration is still impossible, since it's still running on this cpu,
> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
>
> (Actually, there we can enqueue_task() for any sleep task, including
> those are not delayed dequeue, if select_task_rq() returns same cpu
> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
> later.)
>
>>
>> __schedule()
>> rq_lock(rq)
>> try_to_block_task(p)
>> psi_dequeue()
>> [ psi_task_switch() is responsible
>> for adjusting the PSI flags ]
>> put_prev_entity(&p->se) try_to_wake_up(p)
>> # no runnable task on rq->cfs ...
>> sched_balance_newidle()
>> raw_spin_rq_unlock(rq) __task_rq_lock(p)
>> ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
>> __task_rq_unlock(p)
>> raw_spin_rq_lock(rq)
>> ...
>> [ p was re-enqueued or has migrated away ]
>
> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
>
> migration can't happen since p->on_cpu is still true.
>
>> ...
>> psi_task_switch() [Too late!]
>> raw_spin_rq_unlock(rq)
>>
>> The wakeup context will see the flags for a running task when the flags
>> should have reflected the task being blocked. Similarly, a migration
>> context in the wakeup path can clear the flags that psi_sched_switch()
>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
>
> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
> should do nothing at all.
>
> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
> change any state too. (It's not a wakeup or migration from PSI POV.)
There I imagined that newidle_balance() can still pull a task that can
be selected before prev and with the current implementation where
calling try_to_block_task() would still mark it as blocked and the flags
would again be inconsistent.
>
> And the current code of "psi_sched_switch(prev, next, block);" looks
> buggy to me too! The "block" value is from try_to_block_task(), then
> pick_next_task() may drop and gain rq lock, so we can't use the stale
> value for psi_sched_switch().
>
> Before we used "task_on_rq_queued(prev)", now we have to also consider
> delayed dequeue case, so it should be:
>
> "!task_on_rq_queued(prev) || prev->se.sched_delayed"
Peter had suggested the current approach as opposed to that on:
https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
we can perhaps revisit that in light of this.
Again, lot of the observations in the cover letter are from auditing the
code itself and I might have missed something; any and all comments are
greatly appreciated.
--
Thanks and Regards,
Prateek
>
> Thanks!
>
>>
>> Since the TSK_ONCPU flag has to be modified with the rq lock of
>> task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
>> prevent the race. Specifically:
>>
>> o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
>> psi_enqueue() will only be called with TSK_ONCPU set when the task is
>> being requeued on the same CPU. If the task was migrated,
>> psi_ttwu_dequeue() would have already cleared the PSI flags.
>>
>> psi_enqueue() cannot guarantee that this same task will be picked
>> again when the scheduling CPU returns from newidle balance which is
>> why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
>> without migration.
>>
>> o When psi_sched_switch() observes that prev's task_cpu() has changes or
>> the TSK_ONCPU flag is not set, a wakeup has raced with the
>> psi_sched_switch() trying to adjust the dequeue flag. If the next is
>> same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
>> again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
>> already adjusted the PSI flags and no further changes are required
>> to prev's PSI flags.
>>
>> With the introduction of DELAY_DEQUEUE, the requeue path is considerably
>> shortened and with the addition of bandwidth throttling in the
>> __schedule() path, the race window is large enough to observed this
>> issue.
>>
>> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> This patch is based on tip:sched/core at commit af98d8a36a96
>> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
>>
>> Reproducer for the PSI splat:
>>
>> mkdir /sys/fs/cgroup/test
>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> # Ridiculous limit on SMP to throttle multiple rqs at once
>> echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
>> perf bench sched messaging -t -p -l 100000 -g 16
>>
>> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
>> also on a 32 vCPU VM.
>> ---
>> [..snip..]
On 2024/12/26 19:04, K Prateek Nayak wrote: > Hello there, > > Thank you for taking a look at the patch! > > On 12/26/2024 4:13 PM, Chengming Zhou wrote: >> Hi, >> >> On 2024/12/26 13:34, K Prateek Nayak wrote: >>> When running hackbench in a cgroup with bandwidth throttling enabled, >>> following PSI splat was observed: >>> >>> psi: inconsistent task state! task=1831:hackbench cpu=8 >>> psi_flags=14 clear=0 set=4 >>> >>> When investigating the series of events leading up to the splat, >>> following sequence was observed: >>> [008] d..2.: sched_switch: ... ==> next_comm=hackbench >>> next_pid=1831 next_prio=120 >>> ... >>> [008] dN.2.: dequeue_entity(task delayed): task=hackbench >>> pid=1831 cfs_rq->throttled=0 >>> [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled >>> cfs_rq on CPU8 >>> # CPU8 goes into newidle balance and releases the rq lock >>> ... >>> # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831) >>> [015] d..4.: psi_flags_change: psi: task state: >>> task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat >>> (cfs_rq->throttled=1) >> >> I have a question here, why TSK_ONCPU is not set in psi_flags if >> the task hasn't arrived psi_sched_switch()? > > It is set. "psi_flags" is in fact a hex value so the psi_flags is "0x14" > which is (TSK_ONCPU | TSK_RUNNING) Ah, right :) > >> >>> [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 >>> target_cpu=008 # Task has woken on a throttled hierarchy >>> [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 >>> prev_prio=120 prev_state=S ==> ... >>> >>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags >>> for the blocked entity, however, the following race is possible with >>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to >>> psi_sched_switch() >> >> Yeah, this race is introduced by delayed dequeue changes. >> >> In the past, a sleep task can't be migrated or enqueued before it's >> done in __schedule(). (finish_task(prev) clear prev->on_cpu.) > > I see __block_task() doing: > > smp_store_release(&p->on_rq, 0); Right, p->on_rq is cleared if not delayed dequeue. > > wouldn't this allow the task to be migrated? P.S. I have not encountered a But p->on_cpu hasn't been cleared until finish_task(prev). We can see in `can_migrate_task()`, we can't migrate `task_on_cpu()` task. Anyway, its code still running on the cpu, we can't migrate it to another cpu and run its code concurrently. > case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() but > looking at the code, I thought it might be possible (I might very well be > wrong) > >> >> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task >> to bring it schedulable. >> >> But migration is still impossible, since it's still running on this cpu, >> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right? >> >> (Actually, there we can enqueue_task() for any sleep task, including >> those are not delayed dequeue, if select_task_rq() returns same cpu >> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch >> later.) >> >>> >>> __schedule() >>> rq_lock(rq) >>> try_to_block_task(p) >>> psi_dequeue() >>> [ psi_task_switch() is responsible >>> for adjusting the PSI flags ] >>> put_prev_entity(&p->se) try_to_wake_up(p) >>> # no runnable task on rq->cfs ... >>> sched_balance_newidle() >>> raw_spin_rq_unlock(rq) __task_rq_lock(p) >>> ... psi_enqueue()/psi_ttwu_dequeue() >>> [Woops!] >>> __task_rq_unlock(p) >>> raw_spin_rq_lock(rq) >>> ... >>> [ p was re-enqueued or has migrated away ] >> >> Here ttwu_runnable() call enqueue_task() for delayed dequeue task. >> >> migration can't happen since p->on_cpu is still true. >> >>> ... >>> psi_task_switch() [Too late!] >>> raw_spin_rq_unlock(rq) >>> >>> The wakeup context will see the flags for a running task when the flags >>> should have reflected the task being blocked. Similarly, a migration >>> context in the wakeup path can clear the flags that psi_sched_switch() >>> assumes will be set (TSK_ONCPU / TSK_RUNNING) >> >> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue() >> should do nothing at all. >> >> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from >> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT >> change any state too. (It's not a wakeup or migration from PSI POV.) > > There I imagined that newidle_balance() can still pull a task that can > be selected before prev and with the current implementation where > calling try_to_block_task() would still mark it as blocked and the flags > would again be inconsistent. Yes, it's blocked by try_to_block_task(), just ignored (deferred) by PSI. During the time before `psi_sched_switch()`, it maybe enqueued by ttwu(), which should be ignored by PSI too. At last `psi_sched_switch()` called with "sleep == false", just don't notice this transient dequeue & enqueue operations. And the flags are still consistent. > >> >> And the current code of "psi_sched_switch(prev, next, block);" looks >> buggy to me too! The "block" value is from try_to_block_task(), then >> pick_next_task() may drop and gain rq lock, so we can't use the stale >> value for psi_sched_switch(). >> >> Before we used "task_on_rq_queued(prev)", now we have to also consider >> delayed dequeue case, so it should be: >> >> "!task_on_rq_queued(prev) || prev->se.sched_delayed" > > Peter had suggested the current approach as opposed to that on: > https://lore.kernel.org/ > lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/ > we can perhaps revisit that in light of this. Ok, this "block" value is stale when `psi_sched_switch()` called. So we would see these inconsistent psi flags changes. > > Again, lot of the observations in the cover letter are from auditing the > code itself and I might have missed something; any and all comments are > greatly appreciated. > Thanks!
On 2024/12/26 19:35, Chengming Zhou wrote:
> On 2024/12/26 19:04, K Prateek Nayak wrote:
>> Hello there,
>>
>> Thank you for taking a look at the patch!
>>
>> On 12/26/2024 4:13 PM, Chengming Zhou wrote:
>>> Hi,
>>>
>>> On 2024/12/26 13:34, K Prateek Nayak wrote:
>>>> When running hackbench in a cgroup with bandwidth throttling enabled,
>>>> following PSI splat was observed:
>>>>
>>>> psi: inconsistent task state! task=1831:hackbench cpu=8
>>>> psi_flags=14 clear=0 set=4
>>>>
>>>> When investigating the series of events leading up to the splat,
>>>> following sequence was observed:
>>>> [008] d..2.: sched_switch: ... ==> next_comm=hackbench
>>>> next_pid=1831 next_prio=120
>>>> ...
>>>> [008] dN.2.: dequeue_entity(task delayed): task=hackbench
>>>> pid=1831 cfs_rq->throttled=0
>>>> [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled
>>>> cfs_rq on CPU8
>>>> # CPU8 goes into newidle balance and releases the rq lock
>>>> ...
>>>> # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>>>> [015] d..4.: psi_flags_change: psi: task state:
>>>> task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 #
>>>> Splat (cfs_rq->throttled=1)
>>>
>>> I have a question here, why TSK_ONCPU is not set in psi_flags if
>>> the task hasn't arrived psi_sched_switch()?
>>
>> It is set. "psi_flags" is in fact a hex value so the psi_flags is "0x14"
>> which is (TSK_ONCPU | TSK_RUNNING)
>
> Ah, right :)
>
>>
>>>
>>>> [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120
>>>> target_cpu=008 # Task has woken on a throttled hierarchy
>>>> [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831
>>>> prev_prio=120 prev_state=S ==> ...
>>>>
>>>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>>>> for the blocked entity, however, the following race is possible with
>>>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>>>> psi_sched_switch()
>>>
>>> Yeah, this race is introduced by delayed dequeue changes.
>>>
>>> In the past, a sleep task can't be migrated or enqueued before it's
>>> done in __schedule(). (finish_task(prev) clear prev->on_cpu.)
>>
>> I see __block_task() doing:
>>
>> smp_store_release(&p->on_rq, 0);
>
> Right, p->on_rq is cleared if not delayed dequeue.
>
>>
>> wouldn't this allow the task to be migrated? P.S. I have not
>> encountered a
>
> But p->on_cpu hasn't been cleared until finish_task(prev).
>
> We can see in `can_migrate_task()`, we can't migrate `task_on_cpu()`
> task. Anyway, its code still running on the cpu, we can't migrate it
> to another cpu and run its code concurrently.
>
>> case where psi_ttwu_dequeue() has occurred before a psi_sched_switch()
>> but
>> looking at the code, I thought it might be possible (I might very well be
>> wrong)
>>
>>>
>>> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
>>> to bring it schedulable.
>>>
>>> But migration is still impossible, since it's still running on this cpu,
>>> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
>>>
>>> (Actually, there we can enqueue_task() for any sleep task, including
>>> those are not delayed dequeue, if select_task_rq() returns same cpu
>>> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
>>> later.)
>>>
>>>>
>>>> __schedule()
>>>> rq_lock(rq)
>>>> try_to_block_task(p)
>>>> psi_dequeue()
>>>> [ psi_task_switch() is responsible
>>>> for adjusting the PSI flags ]
>>>> put_prev_entity(&p->se) try_to_wake_up(p)
>>>> # no runnable task on rq->cfs ...
>>>> sched_balance_newidle()
>>>> raw_spin_rq_unlock(rq) __task_rq_lock(p)
>>>> ... psi_enqueue()/psi_ttwu_dequeue()
>>>> [Woops!]
>>>> __task_rq_unlock(p)
>>>> raw_spin_rq_lock(rq)
>>>> ...
>>>> [ p was re-enqueued or has migrated away ]
>>>
>>> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
>>>
>>> migration can't happen since p->on_cpu is still true.
>>>
>>>> ...
>>>> psi_task_switch() [Too late!]
>>>> raw_spin_rq_unlock(rq)
>>>>
>>>> The wakeup context will see the flags for a running task when the flags
>>>> should have reflected the task being blocked. Similarly, a migration
>>>> context in the wakeup path can clear the flags that psi_sched_switch()
>>>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
>>>
>>> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
>>> should do nothing at all.
>>>
>>> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
>>> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
>>> change any state too. (It's not a wakeup or migration from PSI POV.)
>>
>> There I imagined that newidle_balance() can still pull a task that can
>> be selected before prev and with the current implementation where
>> calling try_to_block_task() would still mark it as blocked and the flags
>> would again be inconsistent.
>
> Yes, it's blocked by try_to_block_task(), just ignored (deferred) by
> PSI. During the time before `psi_sched_switch()`, it maybe enqueued
> by ttwu(), which should be ignored by PSI too.
>
> At last `psi_sched_switch()` called with "sleep == false", just don't
> notice this transient dequeue & enqueue operations. And the flags are
> still consistent.
>
>>
>>>
>>> And the current code of "psi_sched_switch(prev, next, block);" looks
>>> buggy to me too! The "block" value is from try_to_block_task(), then
>>> pick_next_task() may drop and gain rq lock, so we can't use the stale
>>> value for psi_sched_switch().
>>>
>>> Before we used "task_on_rq_queued(prev)", now we have to also consider
>>> delayed dequeue case, so it should be:
>>>
>>> "!task_on_rq_queued(prev) || prev->se.sched_delayed"
>>
>> Peter had suggested the current approach as opposed to that on:
>> https://lore.kernel.org/
>> lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
>> we can perhaps revisit that in light of this.
>
> Ok, this "block" value is stale when `psi_sched_switch()` called.
>
> So we would see these inconsistent psi flags changes.
>
>>
>> Again, lot of the observations in the cover letter are from auditing the
>> code itself and I might have missed something; any and all comments are
>> greatly appreciated.
>>
>
> Thanks!
Just made a quick fix and tested passed using your script.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f9..065ac76c47f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
* as a preemption by schedule_debug() and RCU.
*/
bool preempt = sched_mode > SM_NONE;
- bool block = false;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
@@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
goto picked;
}
} else if (!preempt && prev_state) {
- block = try_to_block_task(rq, prev, prev_state);
+ try_to_block_task(rq, prev, prev_state);
switch_count = &prev->nvcsw;
}
@@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, block);
+ psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
+ prev->se.sched_delayed);
trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..65efe45fcc77 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct
*p, int flags)
set = TSK_RUNNING;
if (p->in_memstall)
set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
- } else {
+ } else if (!task_on_cpu(task_rq(p), p)) {
/* Wakeup of new or sleeping task */
if (p->in_iowait)
clear |= TSK_IOWAIT;
Hello there,
On 12/26/2024 9:12 PM, Chengming Zhou wrote:
> [..snip..]
>>>>>
>>>>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>>>>> for the blocked entity, however, the following race is possible with
>>>>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>>>>> psi_sched_switch()
>>>>
>>>> Yeah, this race is introduced by delayed dequeue changes.
>>>>
>>>> In the past, a sleep task can't be migrated or enqueued before it's done in __schedule(). (finish_task(prev) clear prev->on_cpu.)
>>>
>>> I see __block_task() doing:
>>>
>>> smp_store_release(&p->on_rq, 0);
>>
>> Right, p->on_rq is cleared if not delayed dequeue.
>>
>>>
>>> wouldn't this allow the task to be migrated? P.S. I have not encountered a
>>
>> But p->on_cpu hasn't been cleared until finish_task(prev).
I forgot those are two different flags! Thank you for clarifying :)
>>
>> We can see in `can_migrate_task()`, we can't migrate `task_on_cpu()`
>> task. Anyway, its code still running on the cpu, we can't migrate it
>> to another cpu and run its code concurrently.
>>
>>> case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() but
>>> looking at the code, I thought it might be possible (I might very well be
>>> wrong)
>>>
>>>>
>>>> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
>>>> to bring it schedulable.
>>>>
>>>> But migration is still impossible, since it's still running on this cpu,
>>>> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
>>>>
>>>> (Actually, there we can enqueue_task() for any sleep task, including
>>>> those are not delayed dequeue, if select_task_rq() returns same cpu
>>>> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
>>>> later.)
>>>>
>>>>>
>>>>> __schedule()
>>>>> rq_lock(rq)
>>>>> try_to_block_task(p)
>>>>> psi_dequeue()
>>>>> [ psi_task_switch() is responsible
>>>>> for adjusting the PSI flags ]
>>>>> put_prev_entity(&p->se) try_to_wake_up(p)
>>>>> # no runnable task on rq->cfs ...
>>>>> sched_balance_newidle()
>>>>> raw_spin_rq_unlock(rq) __task_rq_lock(p)
>>>>> ... psi_enqueue()/psi_ttwu_dequeue() [Woops!]
>>>>> __task_rq_unlock(p)
>>>>> raw_spin_rq_lock(rq)
>>>>> ...
>>>>> [ p was re-enqueued or has migrated away ]
>>>>
>>>> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
>>>>
>>>> migration can't happen since p->on_cpu is still true.
>>>>
>>>>> ...
>>>>> psi_task_switch() [Too late!]
>>>>> raw_spin_rq_unlock(rq)
>>>>>
>>>>> The wakeup context will see the flags for a running task when the flags
>>>>> should have reflected the task being blocked. Similarly, a migration
>>>>> context in the wakeup path can clear the flags that psi_sched_switch()
>>>>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
>>>>
>>>> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
>>>> should do nothing at all.
>>>>
>>>> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
>>>> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
>>>> change any state too. (It's not a wakeup or migration from PSI POV.)
>>>
>>> There I imagined that newidle_balance() can still pull a task that can
>>> be selected before prev and with the current implementation where
>>> calling try_to_block_task() would still mark it as blocked and the flags
>>> would again be inconsistent.
>>
>> Yes, it's blocked by try_to_block_task(), just ignored (deferred) by
>> PSI. During the time before `psi_sched_switch()`, it maybe enqueued
>> by ttwu(), which should be ignored by PSI too.
>>
>> At last `psi_sched_switch()` called with "sleep == false", just don't
>> notice this transient dequeue & enqueue operations. And the flags are
>> still consistent.
>>
>>>
>>>>
>>>> And the current code of "psi_sched_switch(prev, next, block);" looks
>>>> buggy to me too! The "block" value is from try_to_block_task(), then
>>>> pick_next_task() may drop and gain rq lock, so we can't use the stale
>>>> value for psi_sched_switch().
>>>>
>>>> Before we used "task_on_rq_queued(prev)", now we have to also consider
>>>> delayed dequeue case, so it should be:
>>>>
>>>> "!task_on_rq_queued(prev) || prev->se.sched_delayed"
>>>
>>> Peter had suggested the current approach as opposed to that on:
>>> https://lore.kernel.org/ lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
>>> we can perhaps revisit that in light of this.
>>
>> Ok, this "block" value is stale when `psi_sched_switch()` called.
>>
>> So we would see these inconsistent psi flags changes.
>>
>>>
>>> Again, lot of the observations in the cover letter are from auditing the
>>> code itself and I might have missed something; any and all comments are
>>> greatly appreciated.
>>>
>>
>> Thanks!
>
> Just made a quick fix and tested passed using your script.
Thank you! The diff seems to be malformed as a result of whitespaces but
I was able to test if by recreating the diff. Feel free to add:
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/20241226053441.1110-1-kprateek.nayak@amd.com/
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
If you can give your sign off, I could add a commit message and send it on
your behalf too.
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e5a6bf587f9..065ac76c47f9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
> * as a preemption by schedule_debug() and RCU.
> */
> bool preempt = sched_mode > SM_NONE;
> - bool block = false;
> unsigned long *switch_count;
> unsigned long prev_state;
> struct rq_flags rf;
> @@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
> goto picked;
> }
> } else if (!preempt && prev_state) {
> - block = try_to_block_task(rq, prev, prev_state);
> + try_to_block_task(rq, prev, prev_state);
> switch_count = &prev->nvcsw;
> }
>
> @@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)
>
> migrate_disable_switch(rq, prev);
> psi_account_irqtime(rq, prev, next);
> - psi_sched_switch(prev, next, block);
> + psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
> + prev->se.sched_delayed);
>
> trace_sched_switch(preempt, prev, next, prev_state);
>
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..65efe45fcc77 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
> set = TSK_RUNNING;
> if (p->in_memstall)
> set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
> - } else {
> + } else if (!task_on_cpu(task_rq(p), p)) {
One small nit. here
If the task is on CPU at this point, both set and clear are 0 but
psi_task_change() is still called and I don't see it bailing out if it
doesn't have to adjust any flags.
Can we instead just do an early return if task_on_cpu(task_rq(p), p)
returns true? I've tested that version too and I haven't seen any
splats.
> /* Wakeup of new or sleeping task */
> if (p->in_iowait)
> clear |= TSK_IOWAIT;
--
Thanks and Regards,
Prateek
On 2024/12/27 12:10, K Prateek Nayak wrote:
> Hello there,
>
[...]
>>
>> Just made a quick fix and tested passed using your script.
>
> Thank you! The diff seems to be malformed as a result of whitespaces but
> I was able to test if by recreating the diff. Feel free to add:
>
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Closes: https://lore.kernel.org/lkml/20241226053441.1110-1-
> kprateek.nayak@amd.com/
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> If you can give your sign off, I could add a commit message and send it on
> your behalf too.
Great, thanks for your time!
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
>
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3e5a6bf587f9..065ac76c47f9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int
>> sched_mode)
>> * as a preemption by schedule_debug() and RCU.
>> */
>> bool preempt = sched_mode > SM_NONE;
>> - bool block = false;
>> unsigned long *switch_count;
>> unsigned long prev_state;
>> struct rq_flags rf;
>> @@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int
>> sched_mode)
>> goto picked;
>> }
>> } else if (!preempt && prev_state) {
>> - block = try_to_block_task(rq, prev, prev_state);
>> + try_to_block_task(rq, prev, prev_state);
>> switch_count = &prev->nvcsw;
>> }
>>
>> @@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int
>> sched_mode)
>>
>> migrate_disable_switch(rq, prev);
>> psi_account_irqtime(rq, prev, next);
>> - psi_sched_switch(prev, next, block);
>> + psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
>> + prev->se.sched_delayed);
>>
>> trace_sched_switch(preempt, prev, next, prev_state);
>>
>> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
>> index 8ee0add5a48a..65efe45fcc77 100644
>> --- a/kernel/sched/stats.h
>> +++ b/kernel/sched/stats.h
>> @@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct
>> *p, int flags)
>> set = TSK_RUNNING;
>> if (p->in_memstall)
>> set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
>> - } else {
>> + } else if (!task_on_cpu(task_rq(p), p)) {
>
> One small nit. here
>
> If the task is on CPU at this point, both set and clear are 0 but
> psi_task_change() is still called and I don't see it bailing out if it
> doesn't have to adjust any flags.
Yes.
>
> Can we instead just do an early return if task_on_cpu(task_rq(p), p)
> returns true? I've tested that version too and I haven't seen any
> splats.
I thought it's good to preserve the current flow that:
if (restore)
return;
if (migrate)
...
else if (wakeup)
...
As for early return when `task_on_cpu()`, it looks right to me.
Anyway, it's not a migrate or wakeup from PSI POV.
Thanks!
>
>> /* Wakeup of new or sleeping task */
>> if (p->in_iowait)
>> clear |= TSK_IOWAIT;
>
On 12/27/2024 10:10 AM, Chengming Zhou wrote:
> On 2024/12/27 12:10, K Prateek Nayak wrote:
>> Hello there,
>>
> [...]
>>>
>>> Just made a quick fix and tested passed using your script.
>>
>> Thank you! The diff seems to be malformed as a result of whitespaces but
>> I was able to test if by recreating the diff. Feel free to add:
>>
>> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Closes: https://lore.kernel.org/lkml/20241226053441.1110-1- kprateek.nayak@amd.com/
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>
>> If you can give your sign off, I could add a commit message and send it on
>> your behalf too.
>
> Great, thanks for your time!
>
> Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
Thank you!
>
>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 3e5a6bf587f9..065ac76c47f9 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
>>> * as a preemption by schedule_debug() and RCU.
>>> */
>>> bool preempt = sched_mode > SM_NONE;
>>> - bool block = false;
>>> unsigned long *switch_count;
>>> unsigned long prev_state;
>>> struct rq_flags rf;
>>> @@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
>>> goto picked;
>>> }
>>> } else if (!preempt && prev_state) {
>>> - block = try_to_block_task(rq, prev, prev_state);
>>> + try_to_block_task(rq, prev, prev_state);
>>> switch_count = &prev->nvcsw;
>>> }
>>>
>>> @@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)
>>>
>>> migrate_disable_switch(rq, prev);
>>> psi_account_irqtime(rq, prev, next);
>>> - psi_sched_switch(prev, next, block);
>>> + psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
>>> + prev->se.sched_delayed);
>>>
>>> trace_sched_switch(preempt, prev, next, prev_state);
>>>
>>> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
>>> index 8ee0add5a48a..65efe45fcc77 100644
>>> --- a/kernel/sched/stats.h
>>> +++ b/kernel/sched/stats.h
>>> @@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
>>> set = TSK_RUNNING;
>>> if (p->in_memstall)
>>> set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
>>> - } else {
>>> + } else if (!task_on_cpu(task_rq(p), p)) {
>>
>> One small nit. here
>>
>> If the task is on CPU at this point, both set and clear are 0 but
>> psi_task_change() is still called and I don't see it bailing out if it
>> doesn't have to adjust any flags.
>
> Yes.
>
>>
>> Can we instead just do an early return if task_on_cpu(task_rq(p), p)
>> returns true? I've tested that version too and I haven't seen any
>> splats.
>
> I thought it's good to preserve the current flow that:
>
> if (restore)
> return;
>
> if (migrate)
> ...
> else if (wakeup)
> ...
>
> As for early return when `task_on_cpu()`, it looks right to me.
Thank you for your feedback. I agree the current approach fits the flow
well but the call to psi_task_change() which does a whole lot although
the flags remain same (like seq_count updates, etc.) seems unnecessary.
> Anyway, it's not a migrate or wakeup from PSI POV.
>
> Thanks!
>
>>
>>> /* Wakeup of new or sleeping task */
>>> if (p->in_iowait)
>>> clear |= TSK_IOWAIT;
>>
--
Thanks and Regards,
Prateek
© 2016 - 2026 Red Hat, Inc.