[PATCH v2] sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting

Juri Lelli posted 1 patch 1 month, 1 week ago
There is a newer version of this series
kernel/sched/syscalls.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
[PATCH v2] sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
Posted by Juri Lelli 1 month, 1 week ago
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine
might lead to the following WARNINGs (edited).

 sched: DL de-boosted task PID 22725: REPLENISH flag missing

 WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8
 ... (running_bw underflow)
 Call trace:
  dequeue_task_dl+0x15c/0x1f8 (P)
  dequeue_task+0x80/0x168
  deactivate_task+0x24/0x50
  push_dl_task+0x264/0x2e0
  dl_task_timer+0x1b0/0x228
  __hrtimer_run_queues+0x188/0x378
  hrtimer_interrupt+0xfc/0x260
  arch_timer_handler_phys+0x34/0x60
  handle_percpu_devid_irq+0xa4/0x230
  generic_handle_domain_irq+0x34/0x60
  __gic_handle_irq_from_irqson.isra.0+0x158/0x298
  gic_handle_irq+0x28/0x80
  call_on_irq_stack+0x30/0x48
  do_interrupt_handler+0xdc/0xe8
  el1_interrupt+0x44/0xc0
  el1h_64_irq_handler+0x18/0x28
  el1h_64_irq+0x80/0x88
  cpuidle_enter_state+0xc4/0x520 (P)
  cpuidle_enter+0x40/0x60
  cpuidle_idle_call+0x13c/0x220
  do_idle+0xa4/0x120
  cpu_startup_entry+0x40/0x50
  secondary_start_kernel+0xe4/0x128
  __secondary_switched+0xc0/0xc8

The problem is that when a SCHED_DEADLINE task (lock holder) is
changed to a lower priority class via sched_setscheduler(), it may
fail to properly inherit the parameters of potential DEADLINE donors
if it didn't already inherit them in the past (shorter deadline than
donor's at that time). This might lead to bandwidth accounting
corruption, as enqueue_task_dl() won't recognize the lock holder as
boosted.

The scenario occurs when:
1. A DEADLINE task (donor) blocks on a PI mutex held by another
   DEADLINE task (holder), but the holder doesn't inherit parameters
   (e.g., it already has a shorter deadline)
2. sched_setscheduler() changes the holder from DEADLINE to a lower
   class while still holding the mutex
3. The holder should now inherit DEADLINE parameters from the donor
   and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen

Fix the issue by introducing __setscheduler_dl_pi(), which detects when
a DEADLINE (proper or boosted) task gets setscheduled to a lower
priority class. In case, the function makes the task inherit DEADLINE
parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to
ensure proper bandwidth accounting during the next enqueue operation.

Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
Hello,

v2 of the fix for the issue described in the changelog.

The issue was discovered by Bruno Goncalves while running stress-ng
--schedpolicy 0 on RT kernels on large systems (I believe lots of CPUs
and PI enabled in-kernel mutexes makes it easier to trigger). Later on a
simpler and more focused reproducer was created (with Claude Code help)
and is available at

https://github.com/jlelli/sched-deadline-tests/blob/master/test_dl_replenish_bug.c

Fix also available from

git@github.com:jlelli/linux.git fix-deadline-piboost-v2
---
Changes in v2:
- Rebased to tip/sched/core as of today
- Fix things inside !KEEP_PARAMS (Peter)
- Create a different helper function
- Link to v1: https://patch.msgid.link/20260206-upstream-fix-deadline-piboost-b4-v1-1-14043567b89c@redhat.com
---
 kernel/sched/syscalls.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index a288ac0a633d7..b215b0ead9a60 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -284,6 +284,35 @@ static bool check_same_owner(struct task_struct *p)
 		uid_eq(cred->euid, pcred->uid));
 }
 
+#ifdef CONFIG_RT_MUTEXES
+static inline void __setscheduler_dl_pi(int newprio, int policy,
+			      struct task_struct *p,
+			      struct sched_change_ctx *scope)
+{
+	/*
+	 * In case a DEADLINE task (either proper or boosted) gets
+	 * setscheduled to a lower priority class, check if it neeeds to
+	 * inherit parameters from a potential pi_task. In that case make
+	 * sure replenishment happens with the next enqueue.
+	 */
+
+	if (dl_prio(newprio) && !dl_policy(policy)) {
+		struct task_struct *pi_task = rt_mutex_get_top_task(p);
+
+		if (pi_task) {
+			p->dl.pi_se = pi_task->dl.pi_se;
+			scope->flags |= ENQUEUE_REPLENISH;
+		}
+	}
+}
+#else /* !CONFIG_RT_MUTEXES */
+static inline void __setscheduler_dl_pi(int newprio, int policy,
+			      struct task_struct *p,
+			      struct sched_change_ctx *scope)
+{
+}
+#endif /* !CONFIG_RT_MUTEXES */
+
 #ifdef CONFIG_UCLAMP_TASK
 
 static int uclamp_validate(struct task_struct *p,
@@ -655,6 +684,7 @@ int __sched_setscheduler(struct task_struct *p,
 			__setscheduler_params(p, attr);
 			p->sched_class = next_class;
 			p->prio = newprio;
+			__setscheduler_dl_pi(newprio, policy, p, scope);
 		}
 		__setscheduler_uclamp(p, attr);
 

---
base-commit: 2e7af192697ef2a71c76fd57860b0fcd02754e14
change-id: 20260205-upstream-fix-deadline-piboost-b4-2d924be17182

Best regards,
--  
Juri Lelli <juri.lelli@redhat.com>
Re: [PATCH v2] sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
Posted by Peter Zijlstra 1 month, 1 week ago
On Mon, Mar 02, 2026 at 11:01:00AM +0100, Juri Lelli wrote:
> Running stress-ng --schedpolicy 0 on an RT kernel on a big machine
> might lead to the following WARNINGs (edited).
> 
>  sched: DL de-boosted task PID 22725: REPLENISH flag missing
> 
>  WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8
>  ... (running_bw underflow)
>  Call trace:
>   dequeue_task_dl+0x15c/0x1f8 (P)
>   dequeue_task+0x80/0x168
>   deactivate_task+0x24/0x50
>   push_dl_task+0x264/0x2e0
>   dl_task_timer+0x1b0/0x228
>   __hrtimer_run_queues+0x188/0x378
>   hrtimer_interrupt+0xfc/0x260
>   arch_timer_handler_phys+0x34/0x60
>   handle_percpu_devid_irq+0xa4/0x230
>   generic_handle_domain_irq+0x34/0x60
>   __gic_handle_irq_from_irqson.isra.0+0x158/0x298
>   gic_handle_irq+0x28/0x80
>   call_on_irq_stack+0x30/0x48
>   do_interrupt_handler+0xdc/0xe8
>   el1_interrupt+0x44/0xc0
>   el1h_64_irq_handler+0x18/0x28
>   el1h_64_irq+0x80/0x88
>   cpuidle_enter_state+0xc4/0x520 (P)
>   cpuidle_enter+0x40/0x60
>   cpuidle_idle_call+0x13c/0x220
>   do_idle+0xa4/0x120
>   cpu_startup_entry+0x40/0x50
>   secondary_start_kernel+0xe4/0x128
>   __secondary_switched+0xc0/0xc8
> 
> The problem is that when a SCHED_DEADLINE task (lock holder) is
> changed to a lower priority class via sched_setscheduler(), it may
> fail to properly inherit the parameters of potential DEADLINE donors
> if it didn't already inherit them in the past (shorter deadline than
> donor's at that time). This might lead to bandwidth accounting
> corruption, as enqueue_task_dl() won't recognize the lock holder as
> boosted.
> 
> The scenario occurs when:
> 1. A DEADLINE task (donor) blocks on a PI mutex held by another
>    DEADLINE task (holder), but the holder doesn't inherit parameters
>    (e.g., it already has a shorter deadline)
> 2. sched_setscheduler() changes the holder from DEADLINE to a lower
>    class while still holding the mutex
> 3. The holder should now inherit DEADLINE parameters from the donor
>    and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen
> 
> Fix the issue by introducing __setscheduler_dl_pi(), which detects when
> a DEADLINE (proper or boosted) task gets setscheduled to a lower
> priority class. In case, the function makes the task inherit DEADLINE
> parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to
> ensure proper bandwidth accounting during the next enqueue operation.
> 
> Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>

Does this thing want a Fixes?

Also, perhaps trim the WARN to the bare minimum required?
Re: [PATCH v2] sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
Posted by Juri Lelli 1 month, 1 week ago
On 02/03/26 15:48, Peter Zijlstra wrote:
> On Mon, Mar 02, 2026 at 11:01:00AM +0100, Juri Lelli wrote:

...

> 
> Does this thing want a Fixes?
> 
> Also, perhaps trim the WARN to the bare minimum required?
> 

Just sent out v3 addressing both points.

Thanks,
Juri