[v2] Fair scheduling deadline server fixes

[PATCH v2 11/15] sched/deadline: Mark DL server as unthrottled before enqueue

Posted by Joel Fernandes (Google) 1 year, 11 months ago

The DL server may not have had its timer started if start_dl_timer()
returns 0 (say the zero-laxity time has already passed). In such cases,
mark the DL task which is about to be enqueued as not throttled and
cancel any previous timers, then do the enqueue.

This fixes the following crash:

[    9.263331] kernel BUG at kernel/sched/deadline.c:1765!
[    9.282382] Call Trace:
[    9.282767]  <TASK>
[    9.283086]  ? __die_body+0x62/0xb0
[    9.283602]  ? die+0x9b/0xc0
[    9.284036]  ? do_trap+0xa3/0x170
[    9.284528]  ? enqueue_dl_entity+0x45e/0x460
[    9.285158]  ? enqueue_dl_entity+0x45e/0x460
[    9.285791]  ? handle_invalid_op+0x65/0x80
[    9.286392]  ? enqueue_dl_entity+0x45e/0x460
[    9.287021]  ? exc_invalid_op+0x2f/0x40
[    9.287585]  ? asm_exc_invalid_op+0x16/0x20
[    9.288200]  ? find_later_rq+0x120/0x120
[    9.288775]  ? fair_server_init+0x40/0x40
[    9.289364]  ? enqueue_dl_entity+0x45e/0x460
[    9.289989]  ? find_later_rq+0x120/0x120
[    9.290564]  dl_task_timer+0x1d7/0x2f0
[    9.291120]  ? find_later_rq+0x120/0x120
[    9.291695]  __run_hrtimer+0x73/0x1b0
[    9.292238]  hrtimer_interrupt+0x216/0x2c0
[    9.292841]  __sysvec_apic_timer_interrupt+0x53/0x140
[    9.293581]  sysvec_apic_timer_interrupt+0x2d/0x80
[    9.294285]  asm_sysvec_apic_timer_interrupt+0x16/0x20

The crash can easily be reproduced by adding a 100ms delay as follows:

+int delay_inject_count;
+
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
@@ -1827,6 +1830,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
                setup_new_dl_entity(dl_se);
        }

+       // 100ms delay every 20 enqueues.
+       if (delay_inject_count++ > 20) {
+               mdelay(100);
+               delay_inject_count = 0;
+       }
+
        /*
         * If we are still throttled, eg. we got replenished but are a
         * zero-laxity task and still got to wait, don't enqueue.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/deadline.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5adfc15803c3..1d54231fbaa6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1949,6 +1949,18 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	if (dl_se->dl_throttled && start_dl_timer(dl_se))
 		return;
 
+	/*
+	 * We're about to enqueue, make sure we're not ->dl_throttled!
+	 * In case the timer was not started, say because the 0-lax time
+	 * has passed, mark as not throttled and mark unarmed.
+	 * Also cancel earlier timers, since letting those run is pointless.
+	 */
+	if (dl_se->dl_throttled) {
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+		dl_se->dl_defer_armed = 0;
+		dl_se->dl_throttled = 0;
+	}
+
 	__enqueue_dl_entity(dl_se);
 }
 
-- 
2.34.1

Re: [PATCH v2 11/15] sched/deadline: Mark DL server as unthrottled before enqueue

Posted by Daniel Bristot de Oliveira 1 year, 10 months ago

On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> The DL server may not have had its timer started if start_dl_timer()
> returns 0 (say the zero-laxity time has already passed). In such cases,
> mark the DL task which is about to be enqueued as not throttled and
> cancel any previous timers, then do the enqueue.
> 
> This fixes the following crash:
> 
> [    9.263331] kernel BUG at kernel/sched/deadline.c:1765!
> [    9.282382] Call Trace:
> [    9.282767]  <TASK>
> [    9.283086]  ? __die_body+0x62/0xb0
> [    9.283602]  ? die+0x9b/0xc0
> [    9.284036]  ? do_trap+0xa3/0x170
> [    9.284528]  ? enqueue_dl_entity+0x45e/0x460
> [    9.285158]  ? enqueue_dl_entity+0x45e/0x460
> [    9.285791]  ? handle_invalid_op+0x65/0x80
> [    9.286392]  ? enqueue_dl_entity+0x45e/0x460
> [    9.287021]  ? exc_invalid_op+0x2f/0x40
> [    9.287585]  ? asm_exc_invalid_op+0x16/0x20
> [    9.288200]  ? find_later_rq+0x120/0x120
> [    9.288775]  ? fair_server_init+0x40/0x40
> [    9.289364]  ? enqueue_dl_entity+0x45e/0x460
> [    9.289989]  ? find_later_rq+0x120/0x120
> [    9.290564]  dl_task_timer+0x1d7/0x2f0
> [    9.291120]  ? find_later_rq+0x120/0x120
> [    9.291695]  __run_hrtimer+0x73/0x1b0
> [    9.292238]  hrtimer_interrupt+0x216/0x2c0
> [    9.292841]  __sysvec_apic_timer_interrupt+0x53/0x140
> [    9.293581]  sysvec_apic_timer_interrupt+0x2d/0x80
> [    9.294285]  asm_sysvec_apic_timer_interrupt+0x16/0x20
> 
> The crash can easily be reproduced by adding a 100ms delay as follows:
> 
> +int delay_inject_count;
> +
>  static void
>  enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
>  {
> @@ -1827,6 +1830,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
>                 setup_new_dl_entity(dl_se);
>         }
> 
> +       // 100ms delay every 20 enqueues.
> +       if (delay_inject_count++ > 20) {
> +               mdelay(100);
> +               delay_inject_count = 0;
> +       }
> +
>         /*
>          * If we are still throttled, eg. we got replenished but are a
>          * zero-laxity task and still got to wait, don't enqueue.


Makes sense, I am adding this in the defer patch v6 as it is a fix for it...

-- Daniel