sched/core: Stash task priority after dequeue and put_prev_task() in sched_change_begin()

[RFC PATCH] sched/core: Stash task priority after dequeue and put_prev_task() in sched_change_begin()

Posted by K Prateek Nayak 1 month ago

When running amd-pstate driver on a PREEMPT_RT kernel on a shared memory
system (Zen3 and prior), the following splat was observed from
triggering the WARN_ON_ONCE() in rq_pin_lock():

    ------------[ cut here ]------------
    WARNING: kernel/sched/sched.h:1807 at __schedule+0x122/0x17c0, CPU#8: swapper/0/1
    Modules linked in:
    CPU: 8 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0-rc1-rt-amd-pstate+ #153 PREEMPT_{RT,(full)}
    Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    RIP: 0010:__schedule+0x122/0x17c0
    Code: 3...
    RSP: 0018:ffffd2f8800e7a50 EFLAGS: 00010082
    RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000005
    RDX: ffff89f2fd41d1e0 RSI: 0000000000000000 RDI: ffff89f2fd432480
    RBP: ffffd2f8800e7af8 R08: 0000000000000643 R09: 000000037328de2f
    R10: 0000000373168f59 R11: 000000037328de2f R12: 0000000000000001
    R13: ffff89f2fd432480 R14: 0000000000000008 R15: ffff89b4d9072810
    FS:  0000000000000000(0000) GS:ffff89f34fee5000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000807dc4a001 CR4: 0000000000f70ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? psi_group_change+0x1ff/0x460
     ? srso_alias_return_thunk+0x5/0xfbef5
     preempt_schedule+0x41/0x60
     preempt_schedule_thunk+0x16/0x30
     try_to_wake_up+0x341/0x7c0
     autoremove_wake_function+0x12/0x40
     __wake_up_common+0x78/0xa0
     __wake_up+0x31/0x50
     send_pcc_cmd+0x133/0x310
     cppc_set_reg_val+0x10e/0x220
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? amd_pstate_init_boost_support+0x33/0xb0
     amd_pstate_cpu_init+0x159/0x270
     ? srso_alias_return_thunk+0x5/0xfbef5
     cpufreq_online+0x6b0/0xd90
     ? rtlock_slowlock_locked+0xce1/0xd30
     cpufreq_add_dev+0xa9/0xd0
     subsys_interface_register+0x10b/0x120
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? __pfx_amd_pstate_init+0x10/0x10
     cpufreq_register_driver+0x1a7/0x370
     amd_pstate_register_driver.part.0+0x2a/0xa0
     amd_pstate_init+0xe3/0x3a0
     ? __pfx_amd_pstate_init+0x10/0x10
     do_one_initcall+0x47/0x310
     kernel_init_freeable+0x33c/0x500
     ? __pfx_kernel_init+0x10/0x10
     kernel_init+0x1b/0x1f0
     ? __pfx_kernel_init+0x10/0x10
     ret_from_fork+0x222/0x280
     ? __pfx_kernel_init+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    ---[ end trace 0000000000000000 ]---

Inspecting the set of events that led to the warning being triggered
showed the following:

    systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!

    systemd-1  [008] dN.31 ...: sched_change_begin: Begin!
    systemd-1  [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
    systemd-1  [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
    systemd-1  [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
    systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
    systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
    systemd-1  [008] dN.31 ...: sched_change_begin: Before put_prev_task()!

    systemd-1  [008] dN.31 ...: sched_change_end: Before enqueue_task()!
    systemd-1  [008] dN.31 ...: sched_change_end: Before put_prev_task()!
    systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
    systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
    systemd-1  [008] dN.31 ...: sched_change_end: End!

    systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
    systemd-1  [008] dN.21 ...: __schedule: Woops! Balance callback found!

1. sched_change_begin() from guard(sched_change) in
   do_set_cpus_allowed() stashes the priority, which for the deadline
   task, is "p->dl.deadline".
2. The dequeue of the deadline task replenishes the deadline.
3. The task is enqueued back after guard's scope ends and since there is
   no *_CLASS flags set, sched_change_end() calls
   dl_sched_class->prio_changed() which compares the deadline.
4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
   differ from the stashed value and queues a balance pull callback.
5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
   do_balance_callbacks().
6. Grabbing the rq_lock() at subsequent __schedule() triggers the
   warning since the balance pull callback was never executed before
   dropping the lock.

Since the dequeue on a deadline task can push its deadline, stash the
task prio towards the end of sched_change_begin().

The modification to priority within the sched_change guard's scope will
still be considered as sched_change_end() will supply the priority
stashed at the end of constructor's execution as the old priority to
sched_class->prio_changed().

Fixes: 6455ad5346c9c ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Since I'm not too familiar with the deadline bits, I've marked this as
RFC for now. If you require any data from my setup, please do let me
know.

Patches are based on:

  git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit 6ab7973f2540 ("sched/fair: Fix sched_avg fold").

To run with amd-pstate on PREEMPT_RT, you'll first need the patches from
https://lore.kernel.org/lkml/20260106073608.278644-1-kprateek.nayak@amd.com/
Most of the testing was done on top of Rafael's tree (v6.19.0-rc4 based)
with the above series where the issue was first seen.
---
 kernel/sched/core.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b17d8e3cb55..ce05957e8055 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10791,20 +10791,19 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int
 		.running = task_current_donor(rq, p),
 	};
 
-	if (!(flags & DEQUEUE_CLASS)) {
-		if (p->sched_class->get_prio)
-			ctx->prio = p->sched_class->get_prio(rq, p);
-		else
-			ctx->prio = p->prio;
-	}
-
 	if (ctx->queued)
 		dequeue_task(rq, p, flags);
 	if (ctx->running)
 		put_prev_task(rq, p);
 
-	if ((flags & DEQUEUE_CLASS) && p->sched_class->switched_from)
+	if (!(flags & DEQUEUE_CLASS)) {
+		if (p->sched_class->get_prio)
+			ctx->prio = p->sched_class->get_prio(rq, p);
+		else
+			ctx->prio = p->prio;
+	} else if (p->sched_class->switched_from) {
 		p->sched_class->switched_from(rq, p);
+	}
 
 	return ctx;
 }

base-commit: 6ab7973f254071faf20fe5fcc502a3fe9ca14a47
-- 
2.34.1

Re: [RFC PATCH] sched/core: Stash task priority after dequeue and put_prev_task() in sched_change_begin()

Posted by Peter Zijlstra 1 month ago

On Tue, Jan 06, 2026 at 07:52:39AM +0000, K Prateek Nayak wrote:
> When running amd-pstate driver on a PREEMPT_RT kernel on a shared memory
> system (Zen3 and prior), the following splat was observed from
> triggering the WARN_ON_ONCE() in rq_pin_lock():
> 
>     ------------[ cut here ]------------
>     WARNING: kernel/sched/sched.h:1807 at __schedule+0x122/0x17c0, CPU#8: swapper/0/1

Can you enable CONFIG_DEBUG_BUGVERBOSE_DETAILED?

(not critical this time, since you already said rq_pin_lock() and that
only has the one WARN; it does help in general because 'obviously' 1807
isn't actually in rq_pin_lock() for me).

>     Call Trace:
>      <TASK>
>      preempt_schedule+0x41/0x60
>      preempt_schedule_thunk+0x16/0x30
>      try_to_wake_up+0x341/0x7c0
>      autoremove_wake_function+0x12/0x40
>      __wake_up_common+0x78/0xa0
>      __wake_up+0x31/0x50
>      send_pcc_cmd+0x133/0x310
>      cppc_set_reg_val+0x10e/0x220

> 
> Inspecting the set of events that led to the warning being triggered
> showed the following:
> 
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
> 
>     systemd-1  [008] dN.31 ...: sched_change_begin: Begin!
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
>     systemd-1  [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
> 
>     systemd-1  [008] dN.31 ...: sched_change_end: Before enqueue_task()!
>     systemd-1  [008] dN.31 ...: sched_change_end: Before put_prev_task()!
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
>     systemd-1  [008] dN.31 ...: sched_change_end: End!
> 
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
>     systemd-1  [008] dN.21 ...: __schedule: Woops! Balance callback found!
> 
> 1. sched_change_begin() from guard(sched_change) in
>    do_set_cpus_allowed() stashes the priority, which for the deadline
>    task, is "p->dl.deadline".
> 2. The dequeue of the deadline task replenishes the deadline.
> 3. The task is enqueued back after guard's scope ends and since there is
>    no *_CLASS flags set, sched_change_end() calls
>    dl_sched_class->prio_changed() which compares the deadline.
> 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
>    differ from the stashed value and queues a balance pull callback.
> 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
>    do_balance_callbacks().
> 6. Grabbing the rq_lock() at subsequent __schedule() triggers the
>    warning since the balance pull callback was never executed before
>    dropping the lock.
> 
> Since the dequeue on a deadline task can push its deadline, stash the
> task prio towards the end of sched_change_begin().
> 
> The modification to priority within the sched_change guard's scope will
> still be considered as sched_change_end() will supply the priority
> stashed at the end of constructor's execution as the old priority to
> sched_class->prio_changed().
> 

Would not something like so make more sense?

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 80c9559a3e30..60e0c25aae78 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3306,6 +3306,8 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 
 static u64 get_prio_dl(struct rq *rq, struct task_struct *p)
 {
+	if (task_current_donor(rq, p))
+		update_curr_dl(rq);
 	return p->dl.deadline;
 }

Re: [RFC PATCH] sched/core: Stash task priority after dequeue and put_prev_task() in sched_change_begin()

Posted by K Prateek Nayak 1 month ago

Hello Peter,

On 1/6/2026 4:11 PM, Peter Zijlstra wrote:
> On Tue, Jan 06, 2026 at 07:52:39AM +0000, K Prateek Nayak wrote:
>> When running amd-pstate driver on a PREEMPT_RT kernel on a shared memory
>> system (Zen3 and prior), the following splat was observed from
>> triggering the WARN_ON_ONCE() in rq_pin_lock():
>>
>>     ------------[ cut here ]------------
>>     WARNING: kernel/sched/sched.h:1807 at __schedule+0x122/0x17c0, CPU#8: swapper/0/1
> 
> Can you enable CONFIG_DEBUG_BUGVERBOSE_DETAILED?

Ack! I didn't know this was the config that printed the condition for
the warn on. I'll make sure to enable it henceforth.

[..snip..]

>> Since the dequeue on a deadline task can push its deadline, stash the
>> task prio towards the end of sched_change_begin().
>>
>> The modification to priority within the sched_change guard's scope will
>> still be considered as sched_change_end() will supply the priority
>> stashed at the end of constructor's execution as the old priority to
>> sched_class->prio_changed().
>>
> 
> Would not something like so make more sense?
> 
> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 80c9559a3e30..60e0c25aae78 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -3306,6 +3306,8 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>  
>  static u64 get_prio_dl(struct rq *rq, struct task_struct *p)
>  {
> +	if (task_current_donor(rq, p))
> +		update_curr_dl(rq);
>  	return p->dl.deadline;
>  }
>  

Yup this makes sense too! Feel free to add a small note like:

Catch up the deadline before returning it from get_prio() for the
current donor. This ensures the sched_change guard caches the latest
value and doesn't mistake the subsequent dequeue to have changed the
task's priority.

and include:

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

[tip: sched/urgent] sched/deadline: Ensure get_prio_dl() is up-to-date

Posted by tip-bot2 for Peter Zijlstra 3 weeks, 2 days ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     375410bb9a403009a44af3cc7f087090da076e09
Gitweb:        https://git.kernel.org/tip/375410bb9a403009a44af3cc7f087090da076e09
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 06 Jan 2026 11:41:13 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 15 Jan 2026 21:57:52 +01:00

sched/deadline: Ensure get_prio_dl() is up-to-date

Pratheek tripped a WARN and noted the following issue:

> Inspecting the set of events that led to the warning being triggered
> showed the following:
>
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
>
>     systemd-1  [008] dN.31 ...: sched_change_begin: Begin!
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
>     systemd-1  [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
>
>     systemd-1  [008] dN.31 ...: sched_change_end: Before enqueue_task()!
>     systemd-1  [008] dN.31 ...: sched_change_end: Before put_prev_task()!
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
>     systemd-1  [008] dN.31 ...: sched_change_end: End!
>
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
>     systemd-1  [008] dN.21 ...: __schedule: Woops! Balance callback found!
>
> 1. sched_change_begin() from guard(sched_change) in
>    do_set_cpus_allowed() stashes the priority, which for the deadline
>    task, is "p->dl.deadline".
> 2. The dequeue of the deadline task replenishes the deadline.
> 3. The task is enqueued back after guard's scope ends and since there is
>    no *_CLASS flags set, sched_change_end() calls
>    dl_sched_class->prio_changed() which compares the deadline.
> 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
>    differ from the stashed value and queues a balance pull callback.
> 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
>    do_balance_callbacks().
> 6. Grabbing the rq_lock() at subsequent __schedule() triggers the
>    warning since the balance pull callback was never executed before
>    dropping the lock.

Meaning get_prio_dl() ought to update current and return an up-to-date
value.

Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
---
 kernel/sched/deadline.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b5c19b1..b7acf74 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3296,6 +3296,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 
 static u64 get_prio_dl(struct rq *rq, struct task_struct *p)
 {
+	/*
+	 * Make sure to update current so we don't return a stale value.
+	 */
+	if (task_current_donor(rq, p))
+		update_curr_dl(rq);
+
 	return p->dl.deadline;
 }