sched/psi: Fix PSI accounting with proxy execution

[RFC PATCH 0/5] sched/psi: Fix PSI accounting with proxy execution

Posted by K Prateek Nayak 2 months, 3 weeks ago

When booting into a kernel with CONFIG_SCHED_PROXY_EXEC and CONFIG_PSI,
a inconsistent task state warning was noticed soon after the boot
similar to:

    psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4

On analysis, the following sequence of event was found to be the cause
of the splat:

o Blocked task is retained on the runqueue.
o psi_sched_switch() sees task_on_rq_queued() and retains the runnable
  signals for the task.
o Tasks blocks later via proxy_deactivate() but psi_dequeue() doesn't
  adjust the PSI flags since DEQUEUE_SLEEP is set expecting
  psi_sched_switch() to fix the signals.
o The blocked task is woken up with the PSI state still reflecting that
  the task is runnable (TSK_RUNNING) leading to the splat.


Simply tracking proxy_deactivate() is not enough since the task's
blocked_on relationship can be cleared remotely without acquiring the
runqueue lock which can force a blocked task to run before a wakeup -
pick_next_task() pickes the blocked donor and since blocked on
relationship was cleared remotely, task_is_blocked() returns false
leading to the task being run on the CPU.

If the task blocks again before it is woken up, psi_sched_switch() will
try to clear the runnable signals (TSK_RUNNING) unconditionally leading
to a different splat similar to:

    psi: inconsistent task state! task=... cpu=... psi_flags=10 clear=14 set=0


To get around this, track the complete lifecycle of a blocked doner
right from delaying the deactivation to the wakeup. When in
blocked/donor state, PSI will consider these tasks similar to delayed
tasks - blocked but migratable.

When the ttwu_runnable() finally wakeups up the task, or if the donor is
deactivated via proxy_deactivate(), the proxy indicator is cleared to
show that the task is either fully blocked or fully runnable now.

Patch 1 and 2 were cleanups to make life slightly easier when auditing
the implementation and inspecting the debug logs. Patch 3 to 5 implement
the tracking of donor states and a couple of fixes on top.

Series was tested on top of tip:sched/core for a while running
sched-messaging without observing any inconsistent task state warning
and should apply cleanly on top of:

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit 33cf66d88306 ("sched/fair: Proportional newidle balance").

---
K Prateek Nayak (5):
  sched/psi: Make psi stubs consistent for !CONFIG_PSI
  sched/psi: Prepend "0x" to format specifiers when printing PSI flags
  sched/core: Track blocked tasks retained on rq for proxy
  sched/core: Block proxy task on pick when blocked_on is cleared before
    wakeup
  sched/psi: Fix PSI signals of blocked tasks retained for proxy

 include/linux/sched.h |  4 +++
 kernel/sched/core.c   | 59 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/psi.c    |  4 +--
 kernel/sched/sched.h  |  2 ++
 kernel/sched/stats.h  |  6 ++---
 5 files changed, 68 insertions(+), 7 deletions(-)


base-commit: 33cf66d88306663d16e4759e9d24766b0aaa2e17
-- 
2.34.1

Re: [RFC PATCH 0/5] sched/psi: Fix PSI accounting with proxy execution

Posted by John Stultz 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 10:56 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> When booting into a kernel with CONFIG_SCHED_PROXY_EXEC and CONFIG_PSI,
> a inconsistent task state warning was noticed soon after the boot
> similar to:
>
>     psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4
>
> On analysis, the following sequence of event was found to be the cause
> of the splat:
>
> o Blocked task is retained on the runqueue.
> o psi_sched_switch() sees task_on_rq_queued() and retains the runnable
>   signals for the task.
> o Tasks blocks later via proxy_deactivate() but psi_dequeue() doesn't
>   adjust the PSI flags since DEQUEUE_SLEEP is set expecting
>   psi_sched_switch() to fix the signals.
> o The blocked task is woken up with the PSI state still reflecting that
>   the task is runnable (TSK_RUNNING) leading to the splat.

Hey, K Prateek!
  Thanks for chasing this down and sending this series out!

I'm still getting my head around the description above (its been
awhile since I last looked at the PSI code), but early on I often hit
PSI splats, and I thought I had addressed it with the patch here:
  https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565

And with that I've not run across any warnings since.

Now, I hadn't tripped over the issue recently with the subset of the
full series I've been pushing upstream, and as I most easily ran into
it with the sleeping owner enqueuing feature I was holding the fix
back for those changes. But I realize unfortunately CONFIG_PSI at some
point got disabled in my test defconfig, so I've not had the
opportunity to trip it, and sure enough I can trivially see it booting
with the current upstream code.

Applying that fix does seem to avoid the warnings in my trivial
testing, but again I've not dug through the logic in awhile, so you
may have a better sense of the inadequacies of that fix.

If it looks reasonable to you, I'll rework the commit message so it
isn't so focused on the sleeping-owner-enquing case and submit it.

I'll have to spend some time here looking more at your proposed
solution. On the initial glance, I do fret a little with the
task->sched_proxy bit overlapping a bit in meaning with the
task->blocked_on value.

thanks
-john

Re: [RFC PATCH 0/5] sched/psi: Fix PSI accounting with proxy execution

Posted by K Prateek Nayak 2 months, 3 weeks ago

Hello John,

On 11/18/2025 6:15 AM, John Stultz wrote:
> On Mon, Nov 17, 2025 at 10:56 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> When booting into a kernel with CONFIG_SCHED_PROXY_EXEC and CONFIG_PSI,
>> a inconsistent task state warning was noticed soon after the boot
>> similar to:
>>
>>     psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4
>>
>> On analysis, the following sequence of event was found to be the cause
>> of the splat:
>>
>> o Blocked task is retained on the runqueue.
>> o psi_sched_switch() sees task_on_rq_queued() and retains the runnable
>>   signals for the task.
>> o Tasks blocks later via proxy_deactivate() but psi_dequeue() doesn't
>>   adjust the PSI flags since DEQUEUE_SLEEP is set expecting
>>   psi_sched_switch() to fix the signals.
>> o The blocked task is woken up with the PSI state still reflecting that
>>   the task is runnable (TSK_RUNNING) leading to the splat.
> 
> Hey, K Prateek!
>   Thanks for chasing this down and sending this series out!
> 
> I'm still getting my head around the description above (its been
> awhile since I last looked at the PSI code), but early on I often hit
> PSI splats, and I thought I had addressed it with the patch here:
>   https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565

Oooo! Let me go test that.

> 
> And with that I've not run across any warnings since.
> 
> Now, I hadn't tripped over the issue recently with the subset of the
> full series I've been pushing upstream, and as I most easily ran into
> it with the sleeping owner enqueuing feature I was holding the fix
> back for those changes. But I realize unfortunately CONFIG_PSI at some
> point got disabled in my test defconfig, so I've not had the
> opportunity to trip it, and sure enough I can trivially see it booting
> with the current upstream code.

I hit this on tip:sched/core when looking at the recent sched_yield()
changes. Maybe the "blocked_on" serialization with the proxy migration
will make this all go away :)

> 
> Applying that fix does seem to avoid the warnings in my trivial
> testing, but again I've not dug through the logic in awhile, so you
> may have a better sense of the inadequacies of that fix.
> 
> If it looks reasonable to you, I'll rework the commit message so it
> isn't so focused on the sleeping-owner-enquing case and submit it.

That would be great! And it seems to be a lot more simpler than the
the stuff I'm trying to do. I'll give it a spin and get back to you.
Thank you again for pointing to the fix.

> 
> I'll have to spend some time here looking more at your proposed
> solution. On the initial glance, I do fret a little with the
> task->sched_proxy bit overlapping a bit in meaning with the
> task->blocked_on value.

Ack! I'm pretty sure with the blocked_on locking we'll not have these
"interesting" situations but I posted the RFC out just in case we
needed something in the interim but turns out its a solved problem :)

On last thing, it'll be good to get some clarification on how to treat
the blocked tasks retained on the runqueue for PSI - quick look at your
fix suggests we still consider them runnable (TSK_RUNNING) from PSI
standpoint - is this ideal or should PSI consider these tasks blocked?

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 0/5] sched/psi: Fix PSI accounting with proxy execution

Posted by John Stultz 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 5:39 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 11/18/2025 6:15 AM, John Stultz wrote:
> > I'm still getting my head around the description above (its been
> > awhile since I last looked at the PSI code), but early on I often hit
> > PSI splats, and I thought I had addressed it with the patch here:
> >   https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565
>
> Oooo! Let me go test that.
>
> >
> > And with that I've not run across any warnings since.
> >
> > Now, I hadn't tripped over the issue recently with the subset of the
> > full series I've been pushing upstream, and as I most easily ran into
> > it with the sleeping owner enqueuing feature I was holding the fix
> > back for those changes. But I realize unfortunately CONFIG_PSI at some
> > point got disabled in my test defconfig, so I've not had the
> > opportunity to trip it, and sure enough I can trivially see it booting
> > with the current upstream code.
>
> I hit this on tip:sched/core when looking at the recent sched_yield()
> changes. Maybe the "blocked_on" serialization with the proxy migration
> will make this all go away :)
>
> >
> > Applying that fix does seem to avoid the warnings in my trivial
> > testing, but again I've not dug through the logic in awhile, so you
> > may have a better sense of the inadequacies of that fix.
> >
> > If it looks reasonable to you, I'll rework the commit message so it
> > isn't so focused on the sleeping-owner-enquing case and submit it.
>
> That would be great! And it seems to be a lot more simpler than the
> the stuff I'm trying to do. I'll give it a spin and get back to you.
> Thank you again for pointing to the fix.
>
> >
> > I'll have to spend some time here looking more at your proposed
> > solution. On the initial glance, I do fret a little with the
> > task->sched_proxy bit overlapping a bit in meaning with the
> > task->blocked_on value.
>
> Ack! I'm pretty sure with the blocked_on locking we'll not have these
> "interesting" situations but I posted the RFC out just in case we
> needed something in the interim but turns out its a solved problem :)
>
> On last thing, it'll be good to get some clarification on how to treat
> the blocked tasks retained on the runqueue for PSI - quick look at your
> fix suggests we still consider them runnable (TSK_RUNNING) from PSI
> standpoint - is this ideal or should PSI consider these tasks blocked?

So my default way of thinking about mutex-blocked tasks with proxy is
that they are equivalent to runnable. They can be selected by
pick_next_task(), and they are charged for the time they donate to the
lock-owner that runs as the proxy.
To conceptualize things with ProxyExec, I often imagine the
mutex-blocked task as being in "optimistic spin" mode waiting for the
mutex, where we'd just run the task and let it spin, instead of
blocking the task (when the lock owner isn't already running). Then we
just have the optimization of instead of just wasting time spinning,
we run the lock owner to release the lock.

So, I need to further refresh myself with more of the subtleties of
PSI, but to me considering it TSK_RUNNING seems intuitive.

There are maybe some transient cases, like where the blocked task is
on one RQ, and the lock holder is on another, and thus until the
blocked task is selected (and then proxy-migrated to boost the task on
the other cpu), where if it were very far back in the runqueue it
could be contributing what could be seen as "false pressure" on that
RQ.  So maybe I need to think a bit more about that. But it still is a
task that wants to run to boost the lock owner, so I'm not sure how
different it is in the PSI view compared to transient runqueue
imbalances.

thanks
-john

Re: [RFC PATCH 0/5] sched/psi: Fix PSI accounting with proxy execution

Posted by K Prateek Nayak 2 months, 3 weeks ago

Hello John,

On 11/18/2025 9:56 AM, John Stultz wrote:
> On Mon, Nov 17, 2025 at 5:39 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 11/18/2025 6:15 AM, John Stultz wrote:
>>> I'm still getting my head around the description above (its been
>>> awhile since I last looked at the PSI code), but early on I often hit
>>> PSI splats, and I thought I had addressed it with the patch here:
>>>   https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565
>>
>> Oooo! Let me go test that.

Seems like that solution works too on top of current tip:sched/core.
I think you can send it out as a standalone patch for inclusion while
we hash out the donor migration bits (and blocked owner, and rwsem!).

>>
>>>
>>> And with that I've not run across any warnings since.
>>>
>>> Now, I hadn't tripped over the issue recently with the subset of the
>>> full series I've been pushing upstream, and as I most easily ran into
>>> it with the sleeping owner enqueuing feature I was holding the fix
>>> back for those changes. But I realize unfortunately CONFIG_PSI at some
>>> point got disabled in my test defconfig, so I've not had the
>>> opportunity to trip it, and sure enough I can trivially see it booting
>>> with the current upstream code.
>>
>> I hit this on tip:sched/core when looking at the recent sched_yield()
>> changes. Maybe the "blocked_on" serialization with the proxy migration
>> will make this all go away :)
>>
>>>
>>> Applying that fix does seem to avoid the warnings in my trivial
>>> testing, but again I've not dug through the logic in awhile, so you
>>> may have a better sense of the inadequacies of that fix.
>>>
>>> If it looks reasonable to you, I'll rework the commit message so it
>>> isn't so focused on the sleeping-owner-enquing case and submit it.
>>
>> That would be great! And it seems to be a lot more simpler than the
>> the stuff I'm trying to do. I'll give it a spin and get back to you.
>> Thank you again for pointing to the fix.
>>
>>>
>>> I'll have to spend some time here looking more at your proposed
>>> solution. On the initial glance, I do fret a little with the
>>> task->sched_proxy bit overlapping a bit in meaning with the
>>> task->blocked_on value.
>>
>> Ack! I'm pretty sure with the blocked_on locking we'll not have these
>> "interesting" situations but I posted the RFC out just in case we
>> needed something in the interim but turns out its a solved problem :)
>>
>> On last thing, it'll be good to get some clarification on how to treat
>> the blocked tasks retained on the runqueue for PSI - quick look at your
>> fix suggests we still consider them runnable (TSK_RUNNING) from PSI
>> standpoint - is this ideal or should PSI consider these tasks blocked?
> 
> So my default way of thinking about mutex-blocked tasks with proxy is
> that they are equivalent to runnable. They can be selected by
> pick_next_task(), and they are charged for the time they donate to the
> lock-owner that runs as the proxy.
> To conceptualize things with ProxyExec, I often imagine the
> mutex-blocked task as being in "optimistic spin" mode waiting for the
> mutex, where we'd just run the task and let it spin, instead of
> blocking the task (when the lock owner isn't already running). Then we
> just have the optimization of instead of just wasting time spinning,
> we run the lock owner to release the lock.

I think I can see it now. I generally considered them the other
way around as blocked tasks retained just for the vruntime context.
I'll try changing my perspective to match yours when looking at
proxy :)

As for the fix in your tree, feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> 
> So, I need to further refresh myself with more of the subtleties of
> PSI, but to me considering it TSK_RUNNING seems intuitive.
> 
> There are maybe some transient cases, like where the blocked task is
> on one RQ, and the lock holder is on another, and thus until the
> blocked task is selected (and then proxy-migrated to boost the task on
> the other cpu), where if it were very far back in the runqueue it
> could be contributing what could be seen as "false pressure" on that
> RQ.  So maybe I need to think a bit more about that. But it still is a
> task that wants to run to boost the lock owner, so I'm not sure how
> different it is in the PSI view compared to transient runqueue
> imbalances.

I think Johannes has a better understanding of how these signals are
used in the field so I'll defer to him.

-- 
Thanks and Regards,
Prateek