[v1] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

[PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by K Prateek Nayak 1 year, 1 month ago

When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

    psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4

When investigating the series of events leading up to the splat,
following sequence was observed:
    [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
        ...
    [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
    [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
    # CPU8 goes into newidle balance and releases the rq lock
        ...
    # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
    [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
    [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
    [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...

psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, the following race is possible with
psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
psi_sched_switch()

    __schedule()
	rq_lock(rq)
	    try_to_block_task(p)
		psi_dequeue()
		[ psi_task_switch() is responsible
		  for adjusting the PSI flags ]
	    put_prev_entity(&p->se)			try_to_wake_up(p)
	    # no runnable task on rq->cfs		    ...
	    sched_balance_newidle()
		raw_spin_rq_unlock(rq)			    __task_rq_lock(p)
		...						psi_enqueue()/psi_ttwu_dequeue() [Woops!]
							    __task_rq_unlock(p)
		raw_spin_rq_lock(rq)
	    ...
	    [ p was re-enqueued or has migrated away ]
	    ...
	    psi_task_switch() [Too late!]
	raw_spin_rq_unlock(rq)

The wakeup context will see the flags for a running task when the flags
should have reflected the task being blocked. Similarly, a migration
context in the wakeup path can clear the flags that psi_sched_switch()
assumes will be set (TSK_ONCPU / TSK_RUNNING)

Since the TSK_ONCPU flag has to be modified with the rq lock of
task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
prevent the race. Specifically:

o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
  psi_enqueue() will only be called with TSK_ONCPU set when the task is
  being requeued on the same CPU. If the task was migrated,
  psi_ttwu_dequeue() would have already cleared the PSI flags.

  psi_enqueue() cannot guarantee that this same task will be picked
  again when the scheduling CPU returns from newidle balance which is
  why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
  without migration.

o When psi_sched_switch() observes that prev's task_cpu() has changes or
  the TSK_ONCPU flag is not set, a wakeup has raced with the
  psi_sched_switch() trying to adjust the dequeue flag. If the next is
  same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
  again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
  already adjusted the PSI flags and no further changes are required
  to prev's PSI flags.

With the introduction of DELAY_DEQUEUE, the requeue path is considerably
shortened and with the addition of bandwidth throttling in the
__schedule() path, the race window is large enough to observed this
issue.

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
This patch is based on tip:sched/core at commit af98d8a36a96
("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")

Reproducer for the PSI splat:

  mkdir /sys/fs/cgroup/test
  echo $$ > /sys/fs/cgroup/test/cgroup.procs
  # Ridiculous limit on SMP to throttle multiple rqs at once
  echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
  perf bench sched messaging -t -p -l 100000 -g 16

This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
also on a 32 vCPU VM.
---
 kernel/sched/core.c  |  7 ++++-
 kernel/sched/psi.c   | 65 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/stats.h | 16 ++++++++++-
 3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..9bbe51e44e98 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
 	rq->last_seen_need_resched_ns = 0;
 #endif
 
+	/*
+	 * PSI might have to deal with the consequences of newidle balance
+	 * possibly dropping the rq lock and prev being requeued and selected.
+	 */
+	psi_sched_switch(prev, next, block);
+
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		/*
@@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
 
 		migrate_disable_switch(rq, prev);
 		psi_account_irqtime(rq, prev, next);
-		psi_sched_switch(prev, next, block);
 
 		trace_sched_switch(preempt, prev, next, prev_state);
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..c355a6189595 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep)
 {
 	struct psi_group *group, *common = NULL;
-	int cpu = task_cpu(prev);
+	int prev_cpu, cpu;
+
+	/* No race between psi_dequeue() and now */
+	if (prev == next && (prev->psi_flags & TSK_ONCPU))
+		return;
+
+	prev_cpu = task_cpu(prev);
+	cpu = smp_processor_id();
 
 	if (next->pid) {
+		/*
+		 * If next == prev but TSK_ONCPU is cleared, the task was
+		 * requeued when newidle balance dropped the rq lock and
+		 * psi_enqueue() cleared the TSK_ONCPU flag.
+		 */
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
@@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		group = task_psi_group(next);
 		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
+			/*
+			 * Since newidle balance can drop the rq lock (see the next comment)
+			 * there is a possibility of try_to_wake_up() migrating prev away
+			 * before reaching here. Do not find common if task has migrated.
+			 */
+			if (prev_cpu == cpu &&
+			    (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
 				common = group;
 				break;
 			}
@@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		} while ((group = group->parent));
 	}
 
+	/*
+	 * When a task is blocked, psi_dequeue() leaves the PSI flag
+	 * adjustments to psi_task_switch() however, there is a possibility of
+	 * rq lock being dropped in the interim and the task being woken up
+	 * again before psi_task_switch() is called leading to psi_enqueue()
+	 * seeing the flags for a running task. Specifically, the following
+	 * scenario is possible:
+	 *
+	 * __schedule()
+	 *   rq_lock(rq)
+	 *     try_to_block_task(p)
+	 *       psi_dequeue()
+	 *        [ psi_task_switch() is responsible
+	 *          for adjusting the PSI flags ]
+	 *     put_prev_entity(&p->se)			try_to_wake_up(p)
+	 *     # no runnable task on rq->cfs		  ...
+	 *     sched_balance_newidle()
+	 *	 raw_spin_rq_unlock(rq)			  __task_rq_lock(p)
+	 *	 ...					  psi_enqueue()/psi_ttwu_dequeue() [Woops!]
+	 *						  __task_rq_unlock(p)
+	 *	 raw_spin_rq_lock(rq)
+	 *     ...
+	 *     [ p was re-enqueued or has migrated away ]
+	 *     ...
+	 *     psi_task_switch() [Too late!]
+	 *   raw_spin_rq_unlock(rq)
+	 *
+	 * In the above case, psi_enqueue() can sees the p->psi_flags state
+	 * before it is adjusted to account for dequeue in psi_task_switch(),
+	 * or psi_ttwu_dequeue() can clear the p->psi_flags which
+	 * psi_task_switch() tries to adjust assuming that the entity has just
+	 * finished running.
+	 *
+	 * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
+	 * the combination of TSK_ONCPU and task_cpu(p) to catch the race
+	 * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
+	 * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
+	 * flags already for prev on this CPU, skip adjusting flags.
+	 */
+	if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
+		return;
+
 	if (prev->pid) {
 		int clear = TSK_ONCPU, set = 0;
 		bool wake_clock = true;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..f09903165456 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
 	if (flags & ENQUEUE_RESTORE)
 		return;
 
-	if (p->se.sched_delayed) {
+	if (p->psi_flags & TSK_ONCPU) {
+		/*
+		 * psi_enqueue() can race with psi_task_switch() where
+		 * TSK_ONCPU will be still set for the task (see the
+		 * comment in psi_task_switch())
+		 *
+		 * Reaching here with TSK_ONCPU is only possible when
+		 * the task is being enqueued on the same CPU. Since
+		 * psi_task_switch() has not had the chance to adjust
+		 * the flags yet, just clear the TSK_ONCPU which yields
+		 * the same result as sleep + wakeup without migration.
+		 */
+		SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
+		clear = TSK_ONCPU;
+	} else if (p->se.sched_delayed) {
 		/* CPU migration of "sleeping" task */
 		SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
 		if (p->in_memstall)

base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb
-- 
2.34.1

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by Chengming Zhou 1 year, 1 month ago

Hi,

On 2024/12/26 13:34, K Prateek Nayak wrote:
> When running hackbench in a cgroup with bandwidth throttling enabled,
> following PSI splat was observed:
> 
>      psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
> 
> When investigating the series of events leading up to the splat,
> following sequence was observed:
>      [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
>          ...
>      [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
>      [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
>      # CPU8 goes into newidle balance and releases the rq lock
>          ...
>      # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>      [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)

I have a question here, why TSK_ONCPU is not set in psi_flags if
the task hasn't arrived psi_sched_switch()?

>      [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
>      [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
> 
> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
> for the blocked entity, however, the following race is possible with
> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
> psi_sched_switch()

Yeah, this race is introduced by delayed dequeue changes.

In the past, a sleep task can't be migrated or enqueued before it's done 
in __schedule(). (finish_task(prev) clear prev->on_cpu.)

Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
to bring it schedulable.

But migration is still impossible, since it's still running on this cpu,
so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?

(Actually, there we can enqueue_task() for any sleep task, including
those are not delayed dequeue, if select_task_rq() returns same cpu
as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
later.)

> 
>      __schedule()
> 	rq_lock(rq)
> 	    try_to_block_task(p)
> 		psi_dequeue()
> 		[ psi_task_switch() is responsible
> 		  for adjusting the PSI flags ]
> 	    put_prev_entity(&p->se)			try_to_wake_up(p)
> 	    # no runnable task on rq->cfs		    ...
> 	    sched_balance_newidle()
> 		raw_spin_rq_unlock(rq)			    __task_rq_lock(p)
> 		...						psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> 							    __task_rq_unlock(p)
> 		raw_spin_rq_lock(rq)
> 	    ...
> 	    [ p was re-enqueued or has migrated away ]

Here ttwu_runnable() call enqueue_task() for delayed dequeue task.

migration can't happen since p->on_cpu is still true.

> 	    ...
> 	    psi_task_switch() [Too late!]
> 	raw_spin_rq_unlock(rq)
> 
> The wakeup context will see the flags for a running task when the flags
> should have reflected the task being blocked. Similarly, a migration
> context in the wakeup path can clear the flags that psi_sched_switch()
> assumes will be set (TSK_ONCPU / TSK_RUNNING)

In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
should do nothing at all.

Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
change any state too. (It's not a wakeup or migration from PSI POV.)

And the current code of "psi_sched_switch(prev, next, block);" looks
buggy to me too! The "block" value is from try_to_block_task(), then
pick_next_task() may drop and gain rq lock, so we can't use the stale
value for psi_sched_switch().

Before we used "task_on_rq_queued(prev)", now we have to also consider
delayed dequeue case, so it should be:

"!task_on_rq_queued(prev) || prev->se.sched_delayed"

Thanks!

> 
> Since the TSK_ONCPU flag has to be modified with the rq lock of
> task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
> prevent the race. Specifically:
> 
> o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
>    psi_enqueue() will only be called with TSK_ONCPU set when the task is
>    being requeued on the same CPU. If the task was migrated,
>    psi_ttwu_dequeue() would have already cleared the PSI flags.
> 
>    psi_enqueue() cannot guarantee that this same task will be picked
>    again when the scheduling CPU returns from newidle balance which is
>    why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
>    without migration.
> 
> o When psi_sched_switch() observes that prev's task_cpu() has changes or
>    the TSK_ONCPU flag is not set, a wakeup has raced with the
>    psi_sched_switch() trying to adjust the dequeue flag. If the next is
>    same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
>    again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
>    already adjusted the PSI flags and no further changes are required
>    to prev's PSI flags.
> 
> With the introduction of DELAY_DEQUEUE, the requeue path is considerably
> shortened and with the addition of bandwidth throttling in the
> __schedule() path, the race window is large enough to observed this
> issue.
> 
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> This patch is based on tip:sched/core at commit af98d8a36a96
> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
> 
> Reproducer for the PSI splat:
> 
>    mkdir /sys/fs/cgroup/test
>    echo $$ > /sys/fs/cgroup/test/cgroup.procs
>    # Ridiculous limit on SMP to throttle multiple rqs at once
>    echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
>    perf bench sched messaging -t -p -l 100000 -g 16
> 
> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
> also on a 32 vCPU VM.
> ---
>   kernel/sched/core.c  |  7 ++++-
>   kernel/sched/psi.c   | 65 ++++++++++++++++++++++++++++++++++++++++++--
>   kernel/sched/stats.h | 16 ++++++++++-
>   3 files changed, 83 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..9bbe51e44e98 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mode)
>   	rq->last_seen_need_resched_ns = 0;
>   #endif
>   
> +	/*
> +	 * PSI might have to deal with the consequences of newidle balance
> +	 * possibly dropping the rq lock and prev being requeued and selected.
> +	 */
> +	psi_sched_switch(prev, next, block);
> +
>   	if (likely(prev != next)) {
>   		rq->nr_switches++;
>   		/*
> @@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
>   
>   		migrate_disable_switch(rq, prev);
>   		psi_account_irqtime(rq, prev, next);
> -		psi_sched_switch(prev, next, block);
>   
>   		trace_sched_switch(preempt, prev, next, prev_state);
>   
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 84dad1511d1e..c355a6189595 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		     bool sleep)
>   {
>   	struct psi_group *group, *common = NULL;
> -	int cpu = task_cpu(prev);
> +	int prev_cpu, cpu;
> +
> +	/* No race between psi_dequeue() and now */
> +	if (prev == next && (prev->psi_flags & TSK_ONCPU))
> +		return;
> +
> +	prev_cpu = task_cpu(prev);
> +	cpu = smp_processor_id();
>   
>   	if (next->pid) {
> +		/*
> +		 * If next == prev but TSK_ONCPU is cleared, the task was
> +		 * requeued when newidle balance dropped the rq lock and
> +		 * psi_enqueue() cleared the TSK_ONCPU flag.
> +		 */
>   		psi_flags_change(next, 0, TSK_ONCPU);
>   		/*
>   		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
> @@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		 */
>   		group = task_psi_group(next);
>   		do {
> -			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
> -			    PSI_ONCPU) {
> +			/*
> +			 * Since newidle balance can drop the rq lock (see the next comment)
> +			 * there is a possibility of try_to_wake_up() migrating prev away
> +			 * before reaching here. Do not find common if task has migrated.
> +			 */
> +			if (prev_cpu == cpu &&
> +			    (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
>   				common = group;
>   				break;
>   			}
> @@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   		} while ((group = group->parent));
>   	}
>   
> +	/*
> +	 * When a task is blocked, psi_dequeue() leaves the PSI flag
> +	 * adjustments to psi_task_switch() however, there is a possibility of
> +	 * rq lock being dropped in the interim and the task being woken up
> +	 * again before psi_task_switch() is called leading to psi_enqueue()
> +	 * seeing the flags for a running task. Specifically, the following
> +	 * scenario is possible:
> +	 *
> +	 * __schedule()
> +	 *   rq_lock(rq)
> +	 *     try_to_block_task(p)
> +	 *       psi_dequeue()
> +	 *        [ psi_task_switch() is responsible
> +	 *          for adjusting the PSI flags ]
> +	 *     put_prev_entity(&p->se)			try_to_wake_up(p)
> +	 *     # no runnable task on rq->cfs		  ...
> +	 *     sched_balance_newidle()
> +	 *	 raw_spin_rq_unlock(rq)			  __task_rq_lock(p)
> +	 *	 ...					  psi_enqueue()/psi_ttwu_dequeue() [Woops!]
> +	 *						  __task_rq_unlock(p)
> +	 *	 raw_spin_rq_lock(rq)
> +	 *     ...
> +	 *     [ p was re-enqueued or has migrated away ]
> +	 *     ...
> +	 *     psi_task_switch() [Too late!]
> +	 *   raw_spin_rq_unlock(rq)
> +	 *
> +	 * In the above case, psi_enqueue() can sees the p->psi_flags state
> +	 * before it is adjusted to account for dequeue in psi_task_switch(),
> +	 * or psi_ttwu_dequeue() can clear the p->psi_flags which
> +	 * psi_task_switch() tries to adjust assuming that the entity has just
> +	 * finished running.
> +	 *
> +	 * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
> +	 * the combination of TSK_ONCPU and task_cpu(p) to catch the race
> +	 * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
> +	 * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
> +	 * flags already for prev on this CPU, skip adjusting flags.
> +	 */
> +	if (prev == next || prev_cpu != cpu || !(prev->psi_flags & TSK_ONCPU))
> +		return;
> +
>   	if (prev->pid) {
>   		int clear = TSK_ONCPU, set = 0;
>   		bool wake_clock = true;
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..f09903165456 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
>   	if (flags & ENQUEUE_RESTORE)
>   		return;
>   
> -	if (p->se.sched_delayed) {
> +	if (p->psi_flags & TSK_ONCPU) {
> +		/*
> +		 * psi_enqueue() can race with psi_task_switch() where
> +		 * TSK_ONCPU will be still set for the task (see the
> +		 * comment in psi_task_switch())
> +		 *
> +		 * Reaching here with TSK_ONCPU is only possible when
> +		 * the task is being enqueued on the same CPU. Since
> +		 * psi_task_switch() has not had the chance to adjust
> +		 * the flags yet, just clear the TSK_ONCPU which yields
> +		 * the same result as sleep + wakeup without migration.
> +		 */
> +		SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
> +		clear = TSK_ONCPU;
> +	} else if (p->se.sched_delayed) {
>   		/* CPU migration of "sleeping" task */
>   		SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
>   		if (p->in_memstall)
> 
> base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by K Prateek Nayak 1 year, 1 month ago

Hello there,

Thank you for taking a look at the patch!

On 12/26/2024 4:13 PM, Chengming Zhou wrote:
> Hi,
> 
> On 2024/12/26 13:34, K Prateek Nayak wrote:
>> When running hackbench in a cgroup with bandwidth throttling enabled,
>> following PSI splat was observed:
>>
>>      psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4
>>
>> When investigating the series of events leading up to the splat,
>> following sequence was observed:
>>      [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120
>>          ...
>>      [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0
>>      [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8
>>      # CPU8 goes into newidle balance and releases the rq lock
>>          ...
>>      # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>>      [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1)
> 
> I have a question here, why TSK_ONCPU is not set in psi_flags if
> the task hasn't arrived psi_sched_switch()?

It is set. "psi_flags" is in fact a hex value so the psi_flags is "0x14"
which is (TSK_ONCPU | TSK_RUNNING)

> 
>>      [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy
>>      [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ...
>>
>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>> for the blocked entity, however, the following race is possible with
>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>> psi_sched_switch()
> 
> Yeah, this race is introduced by delayed dequeue changes.
> 
> In the past, a sleep task can't be migrated or enqueued before it's done in __schedule(). (finish_task(prev) clear prev->on_cpu.)

I see __block_task() doing:

     smp_store_release(&p->on_rq, 0);

wouldn't this allow the task to be migrated? P.S. I have not encountered a
case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() but
looking at the code, I thought it might be possible (I might very well be
wrong)

> 
> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
> to bring it schedulable.
> 
> But migration is still impossible, since it's still running on this cpu,
> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
> 
> (Actually, there we can enqueue_task() for any sleep task, including
> those are not delayed dequeue, if select_task_rq() returns same cpu
> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
> later.)
> 
>>
>>      __schedule()
>>     rq_lock(rq)
>>         try_to_block_task(p)
>>         psi_dequeue()
>>         [ psi_task_switch() is responsible
>>           for adjusting the PSI flags ]
>>         put_prev_entity(&p->se)            try_to_wake_up(p)
>>         # no runnable task on rq->cfs            ...
>>         sched_balance_newidle()
>>         raw_spin_rq_unlock(rq)                __task_rq_lock(p)
>>         ...                        psi_enqueue()/psi_ttwu_dequeue() [Woops!]
>>                                 __task_rq_unlock(p)
>>         raw_spin_rq_lock(rq)
>>         ...
>>         [ p was re-enqueued or has migrated away ]
> 
> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
> 
> migration can't happen since p->on_cpu is still true.
> 
>>         ...
>>         psi_task_switch() [Too late!]
>>     raw_spin_rq_unlock(rq)
>>
>> The wakeup context will see the flags for a running task when the flags
>> should have reflected the task being blocked. Similarly, a migration
>> context in the wakeup path can clear the flags that psi_sched_switch()
>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
> 
> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
> should do nothing at all.
> 
> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
> change any state too. (It's not a wakeup or migration from PSI POV.)

There I imagined that newidle_balance() can still pull a task that can
be selected before prev and with the current implementation where
calling try_to_block_task() would still mark it as blocked and the flags
would again be inconsistent.

> 
> And the current code of "psi_sched_switch(prev, next, block);" looks
> buggy to me too! The "block" value is from try_to_block_task(), then
> pick_next_task() may drop and gain rq lock, so we can't use the stale
> value for psi_sched_switch().
> 
> Before we used "task_on_rq_queued(prev)", now we have to also consider
> delayed dequeue case, so it should be:
> 
> "!task_on_rq_queued(prev) || prev->se.sched_delayed"

Peter had suggested the current approach as opposed to that on:
https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
we can perhaps revisit that in light of this.

Again, lot of the observations in the cover letter are from auditing the
code itself and I might have missed something; any and all comments are
greatly appreciated.

-- 
Thanks and Regards,
Prateek

> 
> Thanks!
> 
>>
>> Since the TSK_ONCPU flag has to be modified with the rq lock of
>> task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
>> prevent the race. Specifically:
>>
>> o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
>>    psi_enqueue() will only be called with TSK_ONCPU set when the task is
>>    being requeued on the same CPU. If the task was migrated,
>>    psi_ttwu_dequeue() would have already cleared the PSI flags.
>>
>>    psi_enqueue() cannot guarantee that this same task will be picked
>>    again when the scheduling CPU returns from newidle balance which is
>>    why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
>>    without migration.
>>
>> o When psi_sched_switch() observes that prev's task_cpu() has changes or
>>    the TSK_ONCPU flag is not set, a wakeup has raced with the
>>    psi_sched_switch() trying to adjust the dequeue flag. If the next is
>>    same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
>>    again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
>>    already adjusted the PSI flags and no further changes are required
>>    to prev's PSI flags.
>>
>> With the introduction of DELAY_DEQUEUE, the requeue path is considerably
>> shortened and with the addition of bandwidth throttling in the
>> __schedule() path, the race window is large enough to observed this
>> issue.
>>
>> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> This patch is based on tip:sched/core at commit af98d8a36a96
>> ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")
>>
>> Reproducer for the PSI splat:
>>
>>    mkdir /sys/fs/cgroup/test
>>    echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>    # Ridiculous limit on SMP to throttle multiple rqs at once
>>    echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
>>    perf bench sched messaging -t -p -l 100000 -g 16
>>
>> This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
>> also on a 32 vCPU VM.
>> ---
>> [..snip..]

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by Chengming Zhou 1 year, 1 month ago

On 2024/12/26 19:04, K Prateek Nayak wrote:
> Hello there,
> 
> Thank you for taking a look at the patch!
> 
> On 12/26/2024 4:13 PM, Chengming Zhou wrote:
>> Hi,
>>
>> On 2024/12/26 13:34, K Prateek Nayak wrote:
>>> When running hackbench in a cgroup with bandwidth throttling enabled,
>>> following PSI splat was observed:
>>>
>>>      psi: inconsistent task state! task=1831:hackbench cpu=8 
>>> psi_flags=14 clear=0 set=4
>>>
>>> When investigating the series of events leading up to the splat,
>>> following sequence was observed:
>>>      [008] d..2.: sched_switch: ... ==> next_comm=hackbench 
>>> next_pid=1831 next_prio=120
>>>          ...
>>>      [008] dN.2.: dequeue_entity(task delayed): task=hackbench 
>>> pid=1831 cfs_rq->throttled=0
>>>      [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled 
>>> cfs_rq on CPU8
>>>      # CPU8 goes into newidle balance and releases the rq lock
>>>          ...
>>>      # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>>>      [015] d..4.: psi_flags_change: psi: task state: 
>>> task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat 
>>> (cfs_rq->throttled=1)
>>
>> I have a question here, why TSK_ONCPU is not set in psi_flags if
>> the task hasn't arrived psi_sched_switch()?
> 
> It is set. "psi_flags" is in fact a hex value so the psi_flags is "0x14"
> which is (TSK_ONCPU | TSK_RUNNING)

Ah, right :)

> 
>>
>>>      [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 
>>> target_cpu=008 # Task has woken on a throttled hierarchy
>>>      [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 
>>> prev_prio=120 prev_state=S ==> ...
>>>
>>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>>> for the blocked entity, however, the following race is possible with
>>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>>> psi_sched_switch()
>>
>> Yeah, this race is introduced by delayed dequeue changes.
>>
>> In the past, a sleep task can't be migrated or enqueued before it's 
>> done in __schedule(). (finish_task(prev) clear prev->on_cpu.)
> 
> I see __block_task() doing:
> 
>      smp_store_release(&p->on_rq, 0);

Right, p->on_rq is cleared if not delayed dequeue.

> 
> wouldn't this allow the task to be migrated? P.S. I have not encountered a

But p->on_cpu hasn't been cleared until finish_task(prev).

We can see in `can_migrate_task()`, we can't migrate `task_on_cpu()`
task. Anyway, its code still running on the cpu, we can't migrate it
to another cpu and run its code concurrently.

> case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() but
> looking at the code, I thought it might be possible (I might very well be
> wrong)
> 
>>
>> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
>> to bring it schedulable.
>>
>> But migration is still impossible, since it's still running on this cpu,
>> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
>>
>> (Actually, there we can enqueue_task() for any sleep task, including
>> those are not delayed dequeue, if select_task_rq() returns same cpu
>> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
>> later.)
>>
>>>
>>>      __schedule()
>>>     rq_lock(rq)
>>>         try_to_block_task(p)
>>>         psi_dequeue()
>>>         [ psi_task_switch() is responsible
>>>           for adjusting the PSI flags ]
>>>         put_prev_entity(&p->se)            try_to_wake_up(p)
>>>         # no runnable task on rq->cfs            ...
>>>         sched_balance_newidle()
>>>         raw_spin_rq_unlock(rq)                __task_rq_lock(p)
>>>         ...                        psi_enqueue()/psi_ttwu_dequeue() 
>>> [Woops!]
>>>                                 __task_rq_unlock(p)
>>>         raw_spin_rq_lock(rq)
>>>         ...
>>>         [ p was re-enqueued or has migrated away ]
>>
>> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
>>
>> migration can't happen since p->on_cpu is still true.
>>
>>>         ...
>>>         psi_task_switch() [Too late!]
>>>     raw_spin_rq_unlock(rq)
>>>
>>> The wakeup context will see the flags for a running task when the flags
>>> should have reflected the task being blocked. Similarly, a migration
>>> context in the wakeup path can clear the flags that psi_sched_switch()
>>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
>>
>> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
>> should do nothing at all.
>>
>> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
>> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
>> change any state too. (It's not a wakeup or migration from PSI POV.)
> 
> There I imagined that newidle_balance() can still pull a task that can
> be selected before prev and with the current implementation where
> calling try_to_block_task() would still mark it as blocked and the flags
> would again be inconsistent.

Yes, it's blocked by try_to_block_task(), just ignored (deferred) by
PSI. During the time before `psi_sched_switch()`, it maybe enqueued
by ttwu(), which should be ignored by PSI too.

At last `psi_sched_switch()` called with "sleep == false", just don't
notice this transient dequeue & enqueue operations. And the flags are
still consistent.

> 
>>
>> And the current code of "psi_sched_switch(prev, next, block);" looks
>> buggy to me too! The "block" value is from try_to_block_task(), then
>> pick_next_task() may drop and gain rq lock, so we can't use the stale
>> value for psi_sched_switch().
>>
>> Before we used "task_on_rq_queued(prev)", now we have to also consider
>> delayed dequeue case, so it should be:
>>
>> "!task_on_rq_queued(prev) || prev->se.sched_delayed"
> 
> Peter had suggested the current approach as opposed to that on:
> https://lore.kernel.org/ 
> lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
> we can perhaps revisit that in light of this.

Ok, this "block" value is stale when `psi_sched_switch()` called.

So we would see these inconsistent psi flags changes.

> 
> Again, lot of the observations in the cover letter are from auditing the
> code itself and I might have missed something; any and all comments are
> greatly appreciated.
> 

Thanks!

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by Chengming Zhou 1 year, 1 month ago

On 2024/12/26 19:35, Chengming Zhou wrote:
> On 2024/12/26 19:04, K Prateek Nayak wrote:
>> Hello there,
>>
>> Thank you for taking a look at the patch!
>>
>> On 12/26/2024 4:13 PM, Chengming Zhou wrote:
>>> Hi,
>>>
>>> On 2024/12/26 13:34, K Prateek Nayak wrote:
>>>> When running hackbench in a cgroup with bandwidth throttling enabled,
>>>> following PSI splat was observed:
>>>>
>>>>      psi: inconsistent task state! task=1831:hackbench cpu=8 
>>>> psi_flags=14 clear=0 set=4
>>>>
>>>> When investigating the series of events leading up to the splat,
>>>> following sequence was observed:
>>>>      [008] d..2.: sched_switch: ... ==> next_comm=hackbench 
>>>> next_pid=1831 next_prio=120
>>>>          ...
>>>>      [008] dN.2.: dequeue_entity(task delayed): task=hackbench 
>>>> pid=1831 cfs_rq->throttled=0
>>>>      [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled 
>>>> cfs_rq on CPU8
>>>>      # CPU8 goes into newidle balance and releases the rq lock
>>>>          ...
>>>>      # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831)
>>>>      [015] d..4.: psi_flags_change: psi: task state: 
>>>> task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # 
>>>> Splat (cfs_rq->throttled=1)
>>>
>>> I have a question here, why TSK_ONCPU is not set in psi_flags if
>>> the task hasn't arrived psi_sched_switch()?
>>
>> It is set. "psi_flags" is in fact a hex value so the psi_flags is "0x14"
>> which is (TSK_ONCPU | TSK_RUNNING)
> 
> Ah, right :)
> 
>>
>>>
>>>>      [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 
>>>> target_cpu=008 # Task has woken on a throttled hierarchy
>>>>      [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 
>>>> prev_prio=120 prev_state=S ==> ...
>>>>
>>>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>>>> for the blocked entity, however, the following race is possible with
>>>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>>>> psi_sched_switch()
>>>
>>> Yeah, this race is introduced by delayed dequeue changes.
>>>
>>> In the past, a sleep task can't be migrated or enqueued before it's 
>>> done in __schedule(). (finish_task(prev) clear prev->on_cpu.)
>>
>> I see __block_task() doing:
>>
>>      smp_store_release(&p->on_rq, 0);
> 
> Right, p->on_rq is cleared if not delayed dequeue.
> 
>>
>> wouldn't this allow the task to be migrated? P.S. I have not 
>> encountered a
> 
> But p->on_cpu hasn't been cleared until finish_task(prev).
> 
> We can see in `can_migrate_task()`, we can't migrate `task_on_cpu()`
> task. Anyway, its code still running on the cpu, we can't migrate it
> to another cpu and run its code concurrently.
> 
>> case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() 
>> but
>> looking at the code, I thought it might be possible (I might very well be
>> wrong)
>>
>>>
>>> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
>>> to bring it schedulable.
>>>
>>> But migration is still impossible, since it's still running on this cpu,
>>> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
>>>
>>> (Actually, there we can enqueue_task() for any sleep task, including
>>> those are not delayed dequeue, if select_task_rq() returns same cpu
>>> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
>>> later.)
>>>
>>>>
>>>>      __schedule()
>>>>     rq_lock(rq)
>>>>         try_to_block_task(p)
>>>>         psi_dequeue()
>>>>         [ psi_task_switch() is responsible
>>>>           for adjusting the PSI flags ]
>>>>         put_prev_entity(&p->se)            try_to_wake_up(p)
>>>>         # no runnable task on rq->cfs            ...
>>>>         sched_balance_newidle()
>>>>         raw_spin_rq_unlock(rq)                __task_rq_lock(p)
>>>>         ...                        psi_enqueue()/psi_ttwu_dequeue() 
>>>> [Woops!]
>>>>                                 __task_rq_unlock(p)
>>>>         raw_spin_rq_lock(rq)
>>>>         ...
>>>>         [ p was re-enqueued or has migrated away ]
>>>
>>> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
>>>
>>> migration can't happen since p->on_cpu is still true.
>>>
>>>>         ...
>>>>         psi_task_switch() [Too late!]
>>>>     raw_spin_rq_unlock(rq)
>>>>
>>>> The wakeup context will see the flags for a running task when the flags
>>>> should have reflected the task being blocked. Similarly, a migration
>>>> context in the wakeup path can clear the flags that psi_sched_switch()
>>>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
>>>
>>> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
>>> should do nothing at all.
>>>
>>> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
>>> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
>>> change any state too. (It's not a wakeup or migration from PSI POV.)
>>
>> There I imagined that newidle_balance() can still pull a task that can
>> be selected before prev and with the current implementation where
>> calling try_to_block_task() would still mark it as blocked and the flags
>> would again be inconsistent.
> 
> Yes, it's blocked by try_to_block_task(), just ignored (deferred) by
> PSI. During the time before `psi_sched_switch()`, it maybe enqueued
> by ttwu(), which should be ignored by PSI too.
> 
> At last `psi_sched_switch()` called with "sleep == false", just don't
> notice this transient dequeue & enqueue operations. And the flags are
> still consistent.
> 
>>
>>>
>>> And the current code of "psi_sched_switch(prev, next, block);" looks
>>> buggy to me too! The "block" value is from try_to_block_task(), then
>>> pick_next_task() may drop and gain rq lock, so we can't use the stale
>>> value for psi_sched_switch().
>>>
>>> Before we used "task_on_rq_queued(prev)", now we have to also consider
>>> delayed dequeue case, so it should be:
>>>
>>> "!task_on_rq_queued(prev) || prev->se.sched_delayed"
>>
>> Peter had suggested the current approach as opposed to that on:
>> https://lore.kernel.org/ 
>> lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
>> we can perhaps revisit that in light of this.
> 
> Ok, this "block" value is stale when `psi_sched_switch()` called.
> 
> So we would see these inconsistent psi flags changes.
> 
>>
>> Again, lot of the observations in the cover letter are from auditing the
>> code itself and I might have missed something; any and all comments are
>> greatly appreciated.
>>
> 
> Thanks!

Just made a quick fix and tested passed using your script.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f9..065ac76c47f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
          * as a preemption by schedule_debug() and RCU.
          */
         bool preempt = sched_mode > SM_NONE;
-       bool block = false;
         unsigned long *switch_count;
         unsigned long prev_state;
         struct rq_flags rf;
@@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
                         goto picked;
                 }
         } else if (!preempt && prev_state) {
-               block = try_to_block_task(rq, prev, prev_state);
+               try_to_block_task(rq, prev, prev_state);
                 switch_count = &prev->nvcsw;
         }

@@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)

                 migrate_disable_switch(rq, prev);
                 psi_account_irqtime(rq, prev, next);
-               psi_sched_switch(prev, next, block);
+               psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
+                                               prev->se.sched_delayed);

                 trace_sched_switch(preempt, prev, next, prev_state);

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..65efe45fcc77 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct 
*p, int flags)
                 set = TSK_RUNNING;
                 if (p->in_memstall)
                         set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
-       } else {
+       } else if (!task_on_cpu(task_rq(p), p)) {
                 /* Wakeup of new or sleeping task */
                 if (p->in_iowait)
                         clear |= TSK_IOWAIT;

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by K Prateek Nayak 1 year, 1 month ago

Hello there,

On 12/26/2024 9:12 PM, Chengming Zhou wrote:
> [..snip..]
>>>>>
>>>>> psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
>>>>> for the blocked entity, however, the following race is possible with
>>>>> psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
>>>>> psi_sched_switch()
>>>>
>>>> Yeah, this race is introduced by delayed dequeue changes.
>>>>
>>>> In the past, a sleep task can't be migrated or enqueued before it's done in __schedule(). (finish_task(prev) clear prev->on_cpu.)
>>>
>>> I see __block_task() doing:
>>>
>>>      smp_store_release(&p->on_rq, 0);
>>
>> Right, p->on_rq is cleared if not delayed dequeue.
>>
>>>
>>> wouldn't this allow the task to be migrated? P.S. I have not encountered a
>>
>> But p->on_cpu hasn't been cleared until finish_task(prev).

I forgot those are two different flags! Thank you for clarifying :)

>>
>> We can see in `can_migrate_task()`, we can't migrate `task_on_cpu()`
>> task. Anyway, its code still running on the cpu, we can't migrate it
>> to another cpu and run its code concurrently.
>>
>>> case where psi_ttwu_dequeue() has occurred before a psi_sched_switch() but
>>> looking at the code, I thought it might be possible (I might very well be
>>> wrong)
>>>
>>>>
>>>> Now, ttwu_runnable() can call enqueue_task() on the delayed dequeue task
>>>> to bring it schedulable.
>>>>
>>>> But migration is still impossible, since it's still running on this cpu,
>>>> so no psi_ttwu_dequeue(), only psi_enqueue() can happen, right?
>>>>
>>>> (Actually, there we can enqueue_task() for any sleep task, including
>>>> those are not delayed dequeue, if select_task_rq() returns same cpu
>>>> as task_cpu(p) to optimize wakeup latency, maybe need to submit a patch
>>>> later.)
>>>>
>>>>>
>>>>>      __schedule()
>>>>>     rq_lock(rq)
>>>>>         try_to_block_task(p)
>>>>>         psi_dequeue()
>>>>>         [ psi_task_switch() is responsible
>>>>>           for adjusting the PSI flags ]
>>>>>         put_prev_entity(&p->se)            try_to_wake_up(p)
>>>>>         # no runnable task on rq->cfs            ...
>>>>>         sched_balance_newidle()
>>>>>         raw_spin_rq_unlock(rq)                __task_rq_lock(p)
>>>>>         ...                        psi_enqueue()/psi_ttwu_dequeue() [Woops!]
>>>>>                                 __task_rq_unlock(p)
>>>>>         raw_spin_rq_lock(rq)
>>>>>         ...
>>>>>         [ p was re-enqueued or has migrated away ]
>>>>
>>>> Here ttwu_runnable() call enqueue_task() for delayed dequeue task.
>>>>
>>>> migration can't happen since p->on_cpu is still true.
>>>>
>>>>>         ...
>>>>>         psi_task_switch() [Too late!]
>>>>>     raw_spin_rq_unlock(rq)
>>>>>
>>>>> The wakeup context will see the flags for a running task when the flags
>>>>> should have reflected the task being blocked. Similarly, a migration
>>>>> context in the wakeup path can clear the flags that psi_sched_switch()
>>>>> assumes will be set (TSK_ONCPU / TSK_RUNNING)
>>>>
>>>> In this ttwu_runnable() -> enqueue_task() case, I think psi_enqueue()
>>>> should do nothing at all.
>>>>
>>>> Why? Because psi_dequeue() is deferred to psi_sched_switch(), so from
>>>> PSI POV, this task hasn't gone sleep at all, so psi_enqueue() should NOT
>>>> change any state too. (It's not a wakeup or migration from PSI POV.)
>>>
>>> There I imagined that newidle_balance() can still pull a task that can
>>> be selected before prev and with the current implementation where
>>> calling try_to_block_task() would still mark it as blocked and the flags
>>> would again be inconsistent.
>>
>> Yes, it's blocked by try_to_block_task(), just ignored (deferred) by
>> PSI. During the time before `psi_sched_switch()`, it maybe enqueued
>> by ttwu(), which should be ignored by PSI too.
>>
>> At last `psi_sched_switch()` called with "sleep == false", just don't
>> notice this transient dequeue & enqueue operations. And the flags are
>> still consistent.
>>
>>>
>>>>
>>>> And the current code of "psi_sched_switch(prev, next, block);" looks
>>>> buggy to me too! The "block" value is from try_to_block_task(), then
>>>> pick_next_task() may drop and gain rq lock, so we can't use the stale
>>>> value for psi_sched_switch().
>>>>
>>>> Before we used "task_on_rq_queued(prev)", now we have to also consider
>>>> delayed dequeue case, so it should be:
>>>>
>>>> "!task_on_rq_queued(prev) || prev->se.sched_delayed"
>>>
>>> Peter had suggested the current approach as opposed to that on:
>>> https://lore.kernel.org/ lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/
>>> we can perhaps revisit that in light of this.
>>
>> Ok, this "block" value is stale when `psi_sched_switch()` called.
>>
>> So we would see these inconsistent psi flags changes.
>>
>>>
>>> Again, lot of the observations in the cover letter are from auditing the
>>> code itself and I might have missed something; any and all comments are
>>> greatly appreciated.
>>>
>>
>> Thanks!
> 
> Just made a quick fix and tested passed using your script.

Thank you! The diff seems to be malformed as a result of whitespaces but
I was able to test if by recreating the diff. Feel free to add:

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/20241226053441.1110-1-kprateek.nayak@amd.com/
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

If you can give your sign off, I could add a commit message and send it on
your behalf too.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e5a6bf587f9..065ac76c47f9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
>           * as a preemption by schedule_debug() and RCU.
>           */
>          bool preempt = sched_mode > SM_NONE;
> -       bool block = false;
>          unsigned long *switch_count;
>          unsigned long prev_state;
>          struct rq_flags rf;
> @@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
>                          goto picked;
>                  }
>          } else if (!preempt && prev_state) {
> -               block = try_to_block_task(rq, prev, prev_state);
> +               try_to_block_task(rq, prev, prev_state);
>                  switch_count = &prev->nvcsw;
>          }
> 
> @@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)
> 
>                  migrate_disable_switch(rq, prev);
>                  psi_account_irqtime(rq, prev, next);
> -               psi_sched_switch(prev, next, block);
> +               psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
> +                                               prev->se.sched_delayed);
> 
>                  trace_sched_switch(preempt, prev, next, prev_state);
> 
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 8ee0add5a48a..65efe45fcc77 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
>                  set = TSK_RUNNING;
>                  if (p->in_memstall)
>                          set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
> -       } else {
> +       } else if (!task_on_cpu(task_rq(p), p)) {

One small nit. here

If the task is on CPU at this point, both set and clear are 0 but
psi_task_change() is still called and I don't see it bailing out if it
doesn't have to adjust any flags.

Can we instead just do an early return if task_on_cpu(task_rq(p), p)
returns true? I've tested that version too and I haven't seen any
splats.

>                  /* Wakeup of new or sleeping task */
>                  if (p->in_iowait)
>                          clear |= TSK_IOWAIT;

-- 
Thanks and Regards,
Prateek

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by Chengming Zhou 1 year, 1 month ago

On 2024/12/27 12:10, K Prateek Nayak wrote:
> Hello there,
> 
[...]
>>
>> Just made a quick fix and tested passed using your script.
> 
> Thank you! The diff seems to be malformed as a result of whitespaces but
> I was able to test if by recreating the diff. Feel free to add:
> 
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Closes: https://lore.kernel.org/lkml/20241226053441.1110-1- 
> kprateek.nayak@amd.com/
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> If you can give your sign off, I could add a commit message and send it on
> your behalf too.

Great, thanks for your time!

Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>

> 
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3e5a6bf587f9..065ac76c47f9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int 
>> sched_mode)
>>           * as a preemption by schedule_debug() and RCU.
>>           */
>>          bool preempt = sched_mode > SM_NONE;
>> -       bool block = false;
>>          unsigned long *switch_count;
>>          unsigned long prev_state;
>>          struct rq_flags rf;
>> @@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int 
>> sched_mode)
>>                          goto picked;
>>                  }
>>          } else if (!preempt && prev_state) {
>> -               block = try_to_block_task(rq, prev, prev_state);
>> +               try_to_block_task(rq, prev, prev_state);
>>                  switch_count = &prev->nvcsw;
>>          }
>>
>> @@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int 
>> sched_mode)
>>
>>                  migrate_disable_switch(rq, prev);
>>                  psi_account_irqtime(rq, prev, next);
>> -               psi_sched_switch(prev, next, block);
>> +               psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
>> +                                               prev->se.sched_delayed);
>>
>>                  trace_sched_switch(preempt, prev, next, prev_state);
>>
>> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
>> index 8ee0add5a48a..65efe45fcc77 100644
>> --- a/kernel/sched/stats.h
>> +++ b/kernel/sched/stats.h
>> @@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct 
>> *p, int flags)
>>                  set = TSK_RUNNING;
>>                  if (p->in_memstall)
>>                          set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
>> -       } else {
>> +       } else if (!task_on_cpu(task_rq(p), p)) {
> 
> One small nit. here
> 
> If the task is on CPU at this point, both set and clear are 0 but
> psi_task_change() is still called and I don't see it bailing out if it
> doesn't have to adjust any flags.

Yes.

> 
> Can we instead just do an early return if task_on_cpu(task_rq(p), p)
> returns true? I've tested that version too and I haven't seen any
> splats.

I thought it's good to preserve the current flow that:

if (restore)
	return;

if (migrate)
	...
else if (wakeup)
	...

As for early return when `task_on_cpu()`, it looks right to me.
Anyway, it's not a migrate or wakeup from PSI POV.

Thanks!

> 
>>                  /* Wakeup of new or sleeping task */
>>                  if (p->in_iowait)
>>                          clear |= TSK_IOWAIT;
>

Re: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags

Posted by K Prateek Nayak 1 year, 1 month ago

On 12/27/2024 10:10 AM, Chengming Zhou wrote:
> On 2024/12/27 12:10, K Prateek Nayak wrote:
>> Hello there,
>>
> [...]
>>>
>>> Just made a quick fix and tested passed using your script.
>>
>> Thank you! The diff seems to be malformed as a result of whitespaces but
>> I was able to test if by recreating the diff. Feel free to add:
>>
>> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Closes: https://lore.kernel.org/lkml/20241226053441.1110-1- kprateek.nayak@amd.com/
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>
>> If you can give your sign off, I could add a commit message and send it on
>> your behalf too.
> 
> Great, thanks for your time!
> 
> Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>

Thank you!

> 
>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 3e5a6bf587f9..065ac76c47f9 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -6641,7 +6641,6 @@ static void __sched notrace __schedule(int sched_mode)
>>>           * as a preemption by schedule_debug() and RCU.
>>>           */
>>>          bool preempt = sched_mode > SM_NONE;
>>> -       bool block = false;
>>>          unsigned long *switch_count;
>>>          unsigned long prev_state;
>>>          struct rq_flags rf;
>>> @@ -6702,7 +6701,7 @@ static void __sched notrace __schedule(int sched_mode)
>>>                          goto picked;
>>>                  }
>>>          } else if (!preempt && prev_state) {
>>> -               block = try_to_block_task(rq, prev, prev_state);
>>> +               try_to_block_task(rq, prev, prev_state);
>>>                  switch_count = &prev->nvcsw;
>>>          }
>>>
>>> @@ -6748,7 +6747,8 @@ static void __sched notrace __schedule(int sched_mode)
>>>
>>>                  migrate_disable_switch(rq, prev);
>>>                  psi_account_irqtime(rq, prev, next);
>>> -               psi_sched_switch(prev, next, block);
>>> +               psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
>>> +                                               prev->se.sched_delayed);
>>>
>>>                  trace_sched_switch(preempt, prev, next, prev_state);
>>>
>>> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
>>> index 8ee0add5a48a..65efe45fcc77 100644
>>> --- a/kernel/sched/stats.h
>>> +++ b/kernel/sched/stats.h
>>> @@ -150,7 +150,7 @@ static inline void psi_enqueue(struct task_struct *p, int flags)
>>>                  set = TSK_RUNNING;
>>>                  if (p->in_memstall)
>>>                          set |= TSK_MEMSTALL | TSK_MEMSTALL_RUNNING;
>>> -       } else {
>>> +       } else if (!task_on_cpu(task_rq(p), p)) {
>>
>> One small nit. here
>>
>> If the task is on CPU at this point, both set and clear are 0 but
>> psi_task_change() is still called and I don't see it bailing out if it
>> doesn't have to adjust any flags.
> 
> Yes.
> 
>>
>> Can we instead just do an early return if task_on_cpu(task_rq(p), p)
>> returns true? I've tested that version too and I haven't seen any
>> splats.
> 
> I thought it's good to preserve the current flow that:
> 
> if (restore)
>      return;
> 
> if (migrate)
>      ...
> else if (wakeup)
>      ...
> 
> As for early return when `task_on_cpu()`, it looks right to me.

Thank you for your feedback. I agree the current approach fits the flow
well but the call to psi_task_change() which does a whole lot although
the flags remain same (like seq_count updates, etc.) seems unnecessary.

> Anyway, it's not a migrate or wakeup from PSI POV.
> 
> Thanks!
> 
>>
>>>                  /* Wakeup of new or sleeping task */
>>>                  if (p->in_iowait)
>>>                          clear |= TSK_IOWAIT;
>>

-- 
Thanks and Regards,
Prateek