include/linux/sched.h | 4 +++ kernel/sched/core.c | 59 +++++++++++++++++++++++++++++++++++++++++-- kernel/sched/psi.c | 4 +-- kernel/sched/sched.h | 2 ++ kernel/sched/stats.h | 6 ++--- 5 files changed, 68 insertions(+), 7 deletions(-)
When booting into a kernel with CONFIG_SCHED_PROXY_EXEC and CONFIG_PSI,
a inconsistent task state warning was noticed soon after the boot
similar to:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4
On analysis, the following sequence of event was found to be the cause
of the splat:
o Blocked task is retained on the runqueue.
o psi_sched_switch() sees task_on_rq_queued() and retains the runnable
signals for the task.
o Tasks blocks later via proxy_deactivate() but psi_dequeue() doesn't
adjust the PSI flags since DEQUEUE_SLEEP is set expecting
psi_sched_switch() to fix the signals.
o The blocked task is woken up with the PSI state still reflecting that
the task is runnable (TSK_RUNNING) leading to the splat.
Simply tracking proxy_deactivate() is not enough since the task's
blocked_on relationship can be cleared remotely without acquiring the
runqueue lock which can force a blocked task to run before a wakeup -
pick_next_task() pickes the blocked donor and since blocked on
relationship was cleared remotely, task_is_blocked() returns false
leading to the task being run on the CPU.
If the task blocks again before it is woken up, psi_sched_switch() will
try to clear the runnable signals (TSK_RUNNING) unconditionally leading
to a different splat similar to:
psi: inconsistent task state! task=... cpu=... psi_flags=10 clear=14 set=0
To get around this, track the complete lifecycle of a blocked doner
right from delaying the deactivation to the wakeup. When in
blocked/donor state, PSI will consider these tasks similar to delayed
tasks - blocked but migratable.
When the ttwu_runnable() finally wakeups up the task, or if the donor is
deactivated via proxy_deactivate(), the proxy indicator is cleared to
show that the task is either fully blocked or fully runnable now.
Patch 1 and 2 were cleanups to make life slightly easier when auditing
the implementation and inspecting the debug logs. Patch 3 to 5 implement
the tracking of donor states and a couple of fixes on top.
Series was tested on top of tip:sched/core for a while running
sched-messaging without observing any inconsistent task state warning
and should apply cleanly on top of:
git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
at commit 33cf66d88306 ("sched/fair: Proportional newidle balance").
---
K Prateek Nayak (5):
sched/psi: Make psi stubs consistent for !CONFIG_PSI
sched/psi: Prepend "0x" to format specifiers when printing PSI flags
sched/core: Track blocked tasks retained on rq for proxy
sched/core: Block proxy task on pick when blocked_on is cleared before
wakeup
sched/psi: Fix PSI signals of blocked tasks retained for proxy
include/linux/sched.h | 4 +++
kernel/sched/core.c | 59 +++++++++++++++++++++++++++++++++++++++++--
kernel/sched/psi.c | 4 +--
kernel/sched/sched.h | 2 ++
kernel/sched/stats.h | 6 ++---
5 files changed, 68 insertions(+), 7 deletions(-)
base-commit: 33cf66d88306663d16e4759e9d24766b0aaa2e17
--
2.34.1
On Mon, Nov 17, 2025 at 10:56 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > When booting into a kernel with CONFIG_SCHED_PROXY_EXEC and CONFIG_PSI, > a inconsistent task state warning was noticed soon after the boot > similar to: > > psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4 > > On analysis, the following sequence of event was found to be the cause > of the splat: > > o Blocked task is retained on the runqueue. > o psi_sched_switch() sees task_on_rq_queued() and retains the runnable > signals for the task. > o Tasks blocks later via proxy_deactivate() but psi_dequeue() doesn't > adjust the PSI flags since DEQUEUE_SLEEP is set expecting > psi_sched_switch() to fix the signals. > o The blocked task is woken up with the PSI state still reflecting that > the task is runnable (TSK_RUNNING) leading to the splat. Hey, K Prateek! Thanks for chasing this down and sending this series out! I'm still getting my head around the description above (its been awhile since I last looked at the PSI code), but early on I often hit PSI splats, and I thought I had addressed it with the patch here: https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565 And with that I've not run across any warnings since. Now, I hadn't tripped over the issue recently with the subset of the full series I've been pushing upstream, and as I most easily ran into it with the sleeping owner enqueuing feature I was holding the fix back for those changes. But I realize unfortunately CONFIG_PSI at some point got disabled in my test defconfig, so I've not had the opportunity to trip it, and sure enough I can trivially see it booting with the current upstream code. Applying that fix does seem to avoid the warnings in my trivial testing, but again I've not dug through the logic in awhile, so you may have a better sense of the inadequacies of that fix. If it looks reasonable to you, I'll rework the commit message so it isn't so focused on the sleeping-owner-enquing case and submit it. I'll have to spend some time here looking more at your proposed solution. On the initial glance, I do fret a little with the task->sched_proxy bit overlapping a bit in meaning with the task->blocked_on value. thanks -john
Hello John, On 11/18/2025 6:15 AM, John Stultz wrote: > On Mon, Nov 17, 2025 at 10:56 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> >> When booting into a kernel with CONFIG_SCHED_PROXY_EXEC and CONFIG_PSI, >> a inconsistent task state warning was noticed soon after the boot >> similar to: >> >> psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4 >> >> On analysis, the following sequence of event was found to be the cause >> of the splat: >> >> o Blocked task is retained on the runqueue. >> o psi_sched_switch() sees task_on_rq_queued() and retains the runnable >> signals for the task. >> o Tasks blocks later via proxy_deactivate() but psi_dequeue() doesn't >> adjust the PSI flags since DEQUEUE_SLEEP is set expecting >> psi_sched_switch() to fix the signals. >> o The blocked task is woken up with the PSI state still reflecting that >> the task is runnable (TSK_RUNNING) leading to the splat. > > Hey, K Prateek! > Thanks for chasing this down and sending this series out! > > I'm still getting my head around the description above (its been > awhile since I last looked at the PSI code), but early on I often hit > PSI splats, and I thought I had addressed it with the patch here: > https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565 Oooo! Let me go test that. > > And with that I've not run across any warnings since. > > Now, I hadn't tripped over the issue recently with the subset of the > full series I've been pushing upstream, and as I most easily ran into > it with the sleeping owner enqueuing feature I was holding the fix > back for those changes. But I realize unfortunately CONFIG_PSI at some > point got disabled in my test defconfig, so I've not had the > opportunity to trip it, and sure enough I can trivially see it booting > with the current upstream code. I hit this on tip:sched/core when looking at the recent sched_yield() changes. Maybe the "blocked_on" serialization with the proxy migration will make this all go away :) > > Applying that fix does seem to avoid the warnings in my trivial > testing, but again I've not dug through the logic in awhile, so you > may have a better sense of the inadequacies of that fix. > > If it looks reasonable to you, I'll rework the commit message so it > isn't so focused on the sleeping-owner-enquing case and submit it. That would be great! And it seems to be a lot more simpler than the the stuff I'm trying to do. I'll give it a spin and get back to you. Thank you again for pointing to the fix. > > I'll have to spend some time here looking more at your proposed > solution. On the initial glance, I do fret a little with the > task->sched_proxy bit overlapping a bit in meaning with the > task->blocked_on value. Ack! I'm pretty sure with the blocked_on locking we'll not have these "interesting" situations but I posted the RFC out just in case we needed something in the interim but turns out its a solved problem :) On last thing, it'll be good to get some clarification on how to treat the blocked tasks retained on the runqueue for PSI - quick look at your fix suggests we still consider them runnable (TSK_RUNNING) from PSI standpoint - is this ideal or should PSI consider these tasks blocked? -- Thanks and Regards, Prateek
On Mon, Nov 17, 2025 at 5:39 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > On 11/18/2025 6:15 AM, John Stultz wrote: > > I'm still getting my head around the description above (its been > > awhile since I last looked at the PSI code), but early on I often hit > > PSI splats, and I thought I had addressed it with the patch here: > > https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565 > > Oooo! Let me go test that. > > > > > And with that I've not run across any warnings since. > > > > Now, I hadn't tripped over the issue recently with the subset of the > > full series I've been pushing upstream, and as I most easily ran into > > it with the sleeping owner enqueuing feature I was holding the fix > > back for those changes. But I realize unfortunately CONFIG_PSI at some > > point got disabled in my test defconfig, so I've not had the > > opportunity to trip it, and sure enough I can trivially see it booting > > with the current upstream code. > > I hit this on tip:sched/core when looking at the recent sched_yield() > changes. Maybe the "blocked_on" serialization with the proxy migration > will make this all go away :) > > > > > Applying that fix does seem to avoid the warnings in my trivial > > testing, but again I've not dug through the logic in awhile, so you > > may have a better sense of the inadequacies of that fix. > > > > If it looks reasonable to you, I'll rework the commit message so it > > isn't so focused on the sleeping-owner-enquing case and submit it. > > That would be great! And it seems to be a lot more simpler than the > the stuff I'm trying to do. I'll give it a spin and get back to you. > Thank you again for pointing to the fix. > > > > > I'll have to spend some time here looking more at your proposed > > solution. On the initial glance, I do fret a little with the > > task->sched_proxy bit overlapping a bit in meaning with the > > task->blocked_on value. > > Ack! I'm pretty sure with the blocked_on locking we'll not have these > "interesting" situations but I posted the RFC out just in case we > needed something in the interim but turns out its a solved problem :) > > On last thing, it'll be good to get some clarification on how to treat > the blocked tasks retained on the runqueue for PSI - quick look at your > fix suggests we still consider them runnable (TSK_RUNNING) from PSI > standpoint - is this ideal or should PSI consider these tasks blocked? So my default way of thinking about mutex-blocked tasks with proxy is that they are equivalent to runnable. They can be selected by pick_next_task(), and they are charged for the time they donate to the lock-owner that runs as the proxy. To conceptualize things with ProxyExec, I often imagine the mutex-blocked task as being in "optimistic spin" mode waiting for the mutex, where we'd just run the task and let it spin, instead of blocking the task (when the lock owner isn't already running). Then we just have the optimization of instead of just wasting time spinning, we run the lock owner to release the lock. So, I need to further refresh myself with more of the subtleties of PSI, but to me considering it TSK_RUNNING seems intuitive. There are maybe some transient cases, like where the blocked task is on one RQ, and the lock holder is on another, and thus until the blocked task is selected (and then proxy-migrated to boost the task on the other cpu), where if it were very far back in the runqueue it could be contributing what could be seen as "false pressure" on that RQ. So maybe I need to think a bit more about that. But it still is a task that wants to run to boost the lock owner, so I'm not sure how different it is in the PSI view compared to transient runqueue imbalances. thanks -john
Hello John, On 11/18/2025 9:56 AM, John Stultz wrote: > On Mon, Nov 17, 2025 at 5:39 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> On 11/18/2025 6:15 AM, John Stultz wrote: >>> I'm still getting my head around the description above (its been >>> awhile since I last looked at the PSI code), but early on I often hit >>> PSI splats, and I thought I had addressed it with the patch here: >>> https://github.com/johnstultz-work/linux-dev/commit/f60923a6176b3778a8fc9b9b0bbe4953153ce565 >> >> Oooo! Let me go test that. Seems like that solution works too on top of current tip:sched/core. I think you can send it out as a standalone patch for inclusion while we hash out the donor migration bits (and blocked owner, and rwsem!). >> >>> >>> And with that I've not run across any warnings since. >>> >>> Now, I hadn't tripped over the issue recently with the subset of the >>> full series I've been pushing upstream, and as I most easily ran into >>> it with the sleeping owner enqueuing feature I was holding the fix >>> back for those changes. But I realize unfortunately CONFIG_PSI at some >>> point got disabled in my test defconfig, so I've not had the >>> opportunity to trip it, and sure enough I can trivially see it booting >>> with the current upstream code. >> >> I hit this on tip:sched/core when looking at the recent sched_yield() >> changes. Maybe the "blocked_on" serialization with the proxy migration >> will make this all go away :) >> >>> >>> Applying that fix does seem to avoid the warnings in my trivial >>> testing, but again I've not dug through the logic in awhile, so you >>> may have a better sense of the inadequacies of that fix. >>> >>> If it looks reasonable to you, I'll rework the commit message so it >>> isn't so focused on the sleeping-owner-enquing case and submit it. >> >> That would be great! And it seems to be a lot more simpler than the >> the stuff I'm trying to do. I'll give it a spin and get back to you. >> Thank you again for pointing to the fix. >> >>> >>> I'll have to spend some time here looking more at your proposed >>> solution. On the initial glance, I do fret a little with the >>> task->sched_proxy bit overlapping a bit in meaning with the >>> task->blocked_on value. >> >> Ack! I'm pretty sure with the blocked_on locking we'll not have these >> "interesting" situations but I posted the RFC out just in case we >> needed something in the interim but turns out its a solved problem :) >> >> On last thing, it'll be good to get some clarification on how to treat >> the blocked tasks retained on the runqueue for PSI - quick look at your >> fix suggests we still consider them runnable (TSK_RUNNING) from PSI >> standpoint - is this ideal or should PSI consider these tasks blocked? > > So my default way of thinking about mutex-blocked tasks with proxy is > that they are equivalent to runnable. They can be selected by > pick_next_task(), and they are charged for the time they donate to the > lock-owner that runs as the proxy. > To conceptualize things with ProxyExec, I often imagine the > mutex-blocked task as being in "optimistic spin" mode waiting for the > mutex, where we'd just run the task and let it spin, instead of > blocking the task (when the lock owner isn't already running). Then we > just have the optimization of instead of just wasting time spinning, > we run the lock owner to release the lock. I think I can see it now. I generally considered them the other way around as blocked tasks retained just for the vruntime context. I'll try changing my perspective to match yours when looking at proxy :) As for the fix in your tree, feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > So, I need to further refresh myself with more of the subtleties of > PSI, but to me considering it TSK_RUNNING seems intuitive. > > There are maybe some transient cases, like where the blocked task is > on one RQ, and the lock holder is on another, and thus until the > blocked task is selected (and then proxy-migrated to boost the task on > the other cpu), where if it were very far back in the runqueue it > could be contributing what could be seen as "false pressure" on that > RQ. So maybe I need to think a bit more about that. But it still is a > task that wants to run to boost the lock owner, so I'm not sure how > different it is in the PSI view compared to transient runqueue > imbalances. I think Johannes has a better understanding of how these signals are used in the field so I'll defer to him. -- Thanks and Regards, Prateek
© 2016 - 2025 Red Hat, Inc.