kernel/sched/core.c | 25 ++++++++++++++++++++++--- kernel/sched/sched.h | 1 + kernel/sched/stats.h | 10 ++++++++++ 3 files changed, 33 insertions(+), 3 deletions(-)
After the introduction of DELAY_DEQUEUE, PSI consistently started warning about inconsistent task state early into the boot. This could be root-caused to three issues that the three patches respectively solve: o PSI signals not being dequeued when the task is blocked, but also delayed since psi_sched_switch() considered "!task_on_rq_queued()" as the task being blocked but a delayed task will remain queued on the runqueue until it is picked again and goes through a full dequeue. o enqueue_task() not using the ENQUEUE_WAKEUP alongside ENQUEUE_DELAYED in ttwu_runnable(). Since psi_enqueue() only considers (in terms of enqueue flags): (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED) ... as a wakeup, the lack of ENQUEUE_WAKEUP can misguide psi_enqueue() which only clears TSK_IOWAIT flag on wakeups. o When a delayed task is migrated by the load balancer, the requeue or the wakeup context may be aware that the task has migrated between it blocking and it waking up. This is necessary to be communicated to PSI which forgoes clearing TSK_IOWAIT since it expects the psi_.*dequeue() to have cleared it during migration. The series correctly communicates the blocked status of a delayed task to psi_dequeue(), adds the ENQUEUE_WAKEUP flag during a requeue in ttwu_runnable(), re-arranges the psi_enqueue() to be called after a "p->sched_class->enqueue_task()", and notify psi_enqueue() of a migration in delayed state using "p->migration_flags" to maintain the task state consistently. This series was previously posted as one large diff at https://lore.kernel.org/lkml/f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com/ and was tested by Johannes. The tags on the diff have been carried to this series. This series is based on: git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core at commit 7266f0a6d3bb ("fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry") Any and all feedback is greatly appreciated. -- K Prateek Nayak (2): sched/core: Add ENQUEUE_WAKEUP flag alongside ENQUEUE_DELAYED sched/core: Indicate a sched_delayed task was migrated before wakeup Peter Zijlstra (1): sched/core: Dequeue PSI signals for blocked tasks that are delayed kernel/sched/core.c | 25 ++++++++++++++++++++++--- kernel/sched/sched.h | 1 + kernel/sched/stats.h | 10 ++++++++++ 3 files changed, 33 insertions(+), 3 deletions(-) -- 2.34.1
On Thu, Oct 10, 2024 at 08:28:35AM +0000, K Prateek Nayak wrote: > After the introduction of DELAY_DEQUEUE, PSI consistently started > warning about inconsistent task state early into the boot. This could be > root-caused to three issues that the three patches respectively solve: > > o PSI signals not being dequeued when the task is blocked, but also > delayed since psi_sched_switch() considered "!task_on_rq_queued()" as > the task being blocked but a delayed task will remain queued on the > runqueue until it is picked again and goes through a full dequeue. > > o enqueue_task() not using the ENQUEUE_WAKEUP alongside ENQUEUE_DELAYED > in ttwu_runnable(). Since psi_enqueue() only considers (in terms of > enqueue flags): > > (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED) > > ... as a wakeup, the lack of ENQUEUE_WAKEUP can misguide psi_enqueue() > which only clears TSK_IOWAIT flag on wakeups. > > o When a delayed task is migrated by the load balancer, the requeue or > the wakeup context may be aware that the task has migrated between it > blocking and it waking up. This is necessary to be communicated to PSI > which forgoes clearing TSK_IOWAIT since it expects the psi_.*dequeue() > to have cleared it during migration. > > The series correctly communicates the blocked status of a delayed task > to psi_dequeue(), adds the ENQUEUE_WAKEUP flag during a requeue in > ttwu_runnable(), re-arranges the psi_enqueue() to be called after a > "p->sched_class->enqueue_task()", and notify psi_enqueue() of a > migration in delayed state using "p->migration_flags" to maintain the > task state consistently. > > This series was previously posted as one large diff at > https://lore.kernel.org/lkml/f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com/ > and was tested by Johannes. The tags on the diff have been carried > to this series. Thanks! I've renamed DELAYED_MIGRATED to MF_DELAYED, and made a note to go rename the MDF_PUSH thing to something consistent. I've stuck then in queue.git sched/urgent along with a few other fixes and I will hopefully push the lot into tip soon.
Hello Peter, On 10/10/2024 4:17 PM, Peter Zijlstra wrote: > On Thu, Oct 10, 2024 at 08:28:35AM +0000, K Prateek Nayak wrote: >> After the introduction of DELAY_DEQUEUE, PSI consistently started >> warning about inconsistent task state early into the boot. This could be >> root-caused to three issues that the three patches respectively solve: >> >> o PSI signals not being dequeued when the task is blocked, but also >> delayed since psi_sched_switch() considered "!task_on_rq_queued()" as >> the task being blocked but a delayed task will remain queued on the >> runqueue until it is picked again and goes through a full dequeue. >> >> o enqueue_task() not using the ENQUEUE_WAKEUP alongside ENQUEUE_DELAYED >> in ttwu_runnable(). Since psi_enqueue() only considers (in terms of >> enqueue flags): >> >> (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED) >> >> ... as a wakeup, the lack of ENQUEUE_WAKEUP can misguide psi_enqueue() >> which only clears TSK_IOWAIT flag on wakeups. >> >> o When a delayed task is migrated by the load balancer, the requeue or >> the wakeup context may be aware that the task has migrated between it >> blocking and it waking up. This is necessary to be communicated to PSI >> which forgoes clearing TSK_IOWAIT since it expects the psi_.*dequeue() >> to have cleared it during migration. >> >> The series correctly communicates the blocked status of a delayed task >> to psi_dequeue(), adds the ENQUEUE_WAKEUP flag during a requeue in >> ttwu_runnable(), re-arranges the psi_enqueue() to be called after a >> "p->sched_class->enqueue_task()", and notify psi_enqueue() of a >> migration in delayed state using "p->migration_flags" to maintain the >> task state consistently. >> >> This series was previously posted as one large diff at >> https://lore.kernel.org/lkml/f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com/ >> and was tested by Johannes. The tags on the diff have been carried >> to this series. > > Thanks! > > I've renamed DELAYED_MIGRATED to MF_DELAYED, and made a note to go > rename the MDF_PUSH thing to something consistent. Ack. DELAYED_MIGRATED was coined in spur of the moment; MF_DELAYED fits the "migration_flags" bill well. > > I've stuck then in queue.git sched/urgent along with a few other fixes > and I will hopefully push the lot into tip soon. Thank you for picking up the patches! Let me know if there are any surprises (there shouldn't be any after the thorough testing by Johannes :) - thank you again for that) -- Thanks and Regards, Prateek
© 2016 - 2024 Red Hat, Inc.