[PATCH 0/3] sched/core: Fix PSI inconsistent task state splats with DELAY_DEQUEUE

K Prateek Nayak posted 3 patches 1 month, 2 weeks ago
kernel/sched/core.c  | 25 ++++++++++++++++++++++---
kernel/sched/sched.h |  1 +
kernel/sched/stats.h | 10 ++++++++++
3 files changed, 33 insertions(+), 3 deletions(-)
[PATCH 0/3] sched/core: Fix PSI inconsistent task state splats with DELAY_DEQUEUE
Posted by K Prateek Nayak 1 month, 2 weeks ago
After the introduction of DELAY_DEQUEUE, PSI consistently started
warning about inconsistent task state early into the boot. This could be
root-caused to three issues that the three patches respectively solve:

o PSI signals not being dequeued when the task is blocked, but also
  delayed since psi_sched_switch() considered "!task_on_rq_queued()" as
  the task being blocked but a delayed task will remain queued on the
  runqueue until it is picked again and goes through a full dequeue.

o enqueue_task() not using the ENQUEUE_WAKEUP alongside ENQUEUE_DELAYED
  in ttwu_runnable(). Since psi_enqueue() only considers (in terms of
  enqueue flags):

    (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)

  ... as a wakeup, the lack of ENQUEUE_WAKEUP can misguide psi_enqueue()
  which only clears TSK_IOWAIT flag on wakeups.

o When a delayed task is migrated by the load balancer, the requeue or
  the wakeup context may be aware that the task has migrated between it
  blocking and it waking up. This is necessary to be communicated to PSI
  which forgoes clearing TSK_IOWAIT since it expects the psi_.*dequeue()
  to have cleared it during migration.

The series correctly communicates the blocked status of a delayed task
to psi_dequeue(), adds the ENQUEUE_WAKEUP flag during a requeue in
ttwu_runnable(), re-arranges the psi_enqueue() to be called after a
"p->sched_class->enqueue_task()", and notify psi_enqueue() of a
migration in delayed state using "p->migration_flags" to maintain the
task state consistently.

This series was previously posted as one large diff at
https://lore.kernel.org/lkml/f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com/
and was tested by Johannes. The tags on the diff have been carried
to this series.

This series is based on:

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit 7266f0a6d3bb ("fs/bcachefs: Fix
__wait_on_freeing_inode() definition of waitqueue entry")

Any and all feedback is greatly appreciated.

--
K Prateek Nayak (2):
  sched/core: Add ENQUEUE_WAKEUP flag alongside ENQUEUE_DELAYED
  sched/core: Indicate a sched_delayed task was migrated before wakeup

Peter Zijlstra (1):
  sched/core: Dequeue PSI signals for blocked tasks that are delayed

 kernel/sched/core.c  | 25 ++++++++++++++++++++++---
 kernel/sched/sched.h |  1 +
 kernel/sched/stats.h | 10 ++++++++++
 3 files changed, 33 insertions(+), 3 deletions(-)

-- 
2.34.1
Re: [PATCH 0/3] sched/core: Fix PSI inconsistent task state splats with DELAY_DEQUEUE
Posted by Peter Zijlstra 1 month, 2 weeks ago
On Thu, Oct 10, 2024 at 08:28:35AM +0000, K Prateek Nayak wrote:
> After the introduction of DELAY_DEQUEUE, PSI consistently started
> warning about inconsistent task state early into the boot. This could be
> root-caused to three issues that the three patches respectively solve:
> 
> o PSI signals not being dequeued when the task is blocked, but also
>   delayed since psi_sched_switch() considered "!task_on_rq_queued()" as
>   the task being blocked but a delayed task will remain queued on the
>   runqueue until it is picked again and goes through a full dequeue.
> 
> o enqueue_task() not using the ENQUEUE_WAKEUP alongside ENQUEUE_DELAYED
>   in ttwu_runnable(). Since psi_enqueue() only considers (in terms of
>   enqueue flags):
> 
>     (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)
> 
>   ... as a wakeup, the lack of ENQUEUE_WAKEUP can misguide psi_enqueue()
>   which only clears TSK_IOWAIT flag on wakeups.
> 
> o When a delayed task is migrated by the load balancer, the requeue or
>   the wakeup context may be aware that the task has migrated between it
>   blocking and it waking up. This is necessary to be communicated to PSI
>   which forgoes clearing TSK_IOWAIT since it expects the psi_.*dequeue()
>   to have cleared it during migration.
> 
> The series correctly communicates the blocked status of a delayed task
> to psi_dequeue(), adds the ENQUEUE_WAKEUP flag during a requeue in
> ttwu_runnable(), re-arranges the psi_enqueue() to be called after a
> "p->sched_class->enqueue_task()", and notify psi_enqueue() of a
> migration in delayed state using "p->migration_flags" to maintain the
> task state consistently.
> 
> This series was previously posted as one large diff at
> https://lore.kernel.org/lkml/f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com/
> and was tested by Johannes. The tags on the diff have been carried
> to this series.

Thanks!

I've renamed DELAYED_MIGRATED to MF_DELAYED, and made a note to go
rename the MDF_PUSH thing to something consistent.

I've stuck then in queue.git sched/urgent along with a few other fixes
and I will hopefully push the lot into tip soon.
Re: [PATCH 0/3] sched/core: Fix PSI inconsistent task state splats with DELAY_DEQUEUE
Posted by K Prateek Nayak 1 month, 2 weeks ago
Hello Peter,

On 10/10/2024 4:17 PM, Peter Zijlstra wrote:
> On Thu, Oct 10, 2024 at 08:28:35AM +0000, K Prateek Nayak wrote:
>> After the introduction of DELAY_DEQUEUE, PSI consistently started
>> warning about inconsistent task state early into the boot. This could be
>> root-caused to three issues that the three patches respectively solve:
>>
>> o PSI signals not being dequeued when the task is blocked, but also
>>    delayed since psi_sched_switch() considered "!task_on_rq_queued()" as
>>    the task being blocked but a delayed task will remain queued on the
>>    runqueue until it is picked again and goes through a full dequeue.
>>
>> o enqueue_task() not using the ENQUEUE_WAKEUP alongside ENQUEUE_DELAYED
>>    in ttwu_runnable(). Since psi_enqueue() only considers (in terms of
>>    enqueue flags):
>>
>>      (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)
>>
>>    ... as a wakeup, the lack of ENQUEUE_WAKEUP can misguide psi_enqueue()
>>    which only clears TSK_IOWAIT flag on wakeups.
>>
>> o When a delayed task is migrated by the load balancer, the requeue or
>>    the wakeup context may be aware that the task has migrated between it
>>    blocking and it waking up. This is necessary to be communicated to PSI
>>    which forgoes clearing TSK_IOWAIT since it expects the psi_.*dequeue()
>>    to have cleared it during migration.
>>
>> The series correctly communicates the blocked status of a delayed task
>> to psi_dequeue(), adds the ENQUEUE_WAKEUP flag during a requeue in
>> ttwu_runnable(), re-arranges the psi_enqueue() to be called after a
>> "p->sched_class->enqueue_task()", and notify psi_enqueue() of a
>> migration in delayed state using "p->migration_flags" to maintain the
>> task state consistently.
>>
>> This series was previously posted as one large diff at
>> https://lore.kernel.org/lkml/f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com/
>> and was tested by Johannes. The tags on the diff have been carried
>> to this series.
> 
> Thanks!
> 
> I've renamed DELAYED_MIGRATED to MF_DELAYED, and made a note to go
> rename the MDF_PUSH thing to something consistent.

Ack. DELAYED_MIGRATED was coined in spur of the moment; MF_DELAYED fits
the "migration_flags" bill well.

> 
> I've stuck then in queue.git sched/urgent along with a few other fixes
> and I will hopefully push the lot into tip soon.

Thank you for picking up the patches! Let me know if there are any
surprises (there shouldn't be any after the thorough testing by
Johannes :) - thank you again for that)

--
Thanks and Regards,
Prateek