[v1] sched/rt: Fix race in push_rt_task

[PATCH] sched/rt: Fix race in push_rt_task

Posted by Harshit Agarwal 12 months ago

Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))

-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X < Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this (logically same) was
deployed on such cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Tested-by: Will Ton <william.ton@nutanix.com>
---
 kernel/sched/rt.c | 49 ++++++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..d48a9cb9ac92 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
 	return -1;
 }
 
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->rt.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current_donor(rq, p));
+	BUG_ON(p->nr_cpus_allowed <= 1);
+
+	BUG_ON(!task_on_rq_queued(p));
+	BUG_ON(!rt_task(p));
+
+	return p;
+}
+
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 {
@@ -1920,13 +1941,18 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 			 * It is possible the task was scheduled, set
 			 * "migrate_disabled" and then got preempted, so we must
 			 * check the task migration disable flag here too.
+			 * Also, the task may have been dequeued and completed
+			 * execution on the same CPU during this time, therefore
+			 * check if the task is still at the head of the
+			 * pushable tasks list.
 			 */
 			if (unlikely(task_rq(task) != rq ||
 				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
 				     task_on_cpu(rq, task) ||
 				     !rt_task(task) ||
 				     is_migration_disabled(task) ||
-				     !task_on_rq_queued(task))) {
+				     !task_on_rq_queued(task) ||
+				     task != pick_next_pushable_task(rq))) {
 
 				double_unlock_balance(rq, lowest_rq);
 				lowest_rq = NULL;
@@ -1946,27 +1972,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	return lowest_rq;
 }
 
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
-	struct task_struct *p;
-
-	if (!has_pushable_tasks(rq))
-		return NULL;
-
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
-
-	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p));
-	BUG_ON(task_current_donor(rq, p));
-	BUG_ON(p->nr_cpus_allowed <= 1);
-
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!rt_task(p));
-
-	return p;
-}
-
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
-- 
2.22.3

Re: [PATCH] sched/rt: Fix race in push_rt_task

Posted by Steven Rostedt 12 months ago

On Tue, 11 Feb 2025 05:46:45 +0000
Harshit Agarwal <harshit@nutanix.com> wrote:

> Overview
> ========
> When a CPU chooses to call push_rt_task and picks a task to push to
> another CPU's runqueue then it will call find_lock_lowest_rq method
> which would take a double lock on both CPUs' runqueues. If one of the
> locks aren't readily available, it may lead to dropping the current
> runqueue lock and reacquiring both the locks at once. During this window
> it is possible that the task is already migrated and is running on some
> other CPU. These cases are already handled. However, if the task is
> migrated and has already been executed and another CPU is now trying to
> wake it up (ttwu) such that it is queued again on the runqeue
> (on_rq is 1) and also if the task was run by the same CPU, then the
> current checks will pass even though the task was migrated out and is no
> longer in the pushable tasks list.

Nice catch! And nice analysis.

> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 4b8e33c615b1..d48a9cb9ac92 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
>  	return -1;
>  }
>  
> +static struct task_struct *pick_next_pushable_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_tasks(rq))
> +		return NULL;
> +
> +	p = plist_first_entry(&rq->rt.pushable_tasks,
> +			      struct task_struct, pushable_tasks);
> +
> +	BUG_ON(rq->cpu != task_cpu(p));
> +	BUG_ON(task_current(rq, p));
> +	BUG_ON(task_current_donor(rq, p));
> +	BUG_ON(p->nr_cpus_allowed <= 1);
> +
> +	BUG_ON(!task_on_rq_queued(p));
> +	BUG_ON(!rt_task(p));
> +
> +	return p;
> +}
> +
>  /* Will lock the rq it finds */
>  static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>  {
> @@ -1920,13 +1941,18 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>  			 * It is possible the task was scheduled, set
>  			 * "migrate_disabled" and then got preempted, so we must
>  			 * check the task migration disable flag here too.
> +			 * Also, the task may have been dequeued and completed
> +			 * execution on the same CPU during this time, therefore
> +			 * check if the task is still at the head of the
> +			 * pushable tasks list.
>  			 */
>  			if (unlikely(task_rq(task) != rq ||
>  				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
>  				     task_on_cpu(rq, task) ||
>  				     !rt_task(task) ||
>  				     is_migration_disabled(task) ||
> -				     !task_on_rq_queued(task))) {
> +				     !task_on_rq_queued(task) ||
> +				     task != pick_next_pushable_task(rq))) {

I'm wondering? Could we replace all of these checks with just:

				task != pick_next_pushable_task(rq)
?

I mean, what check above would fail and not cause that one to fail?

Maybe the is_migration_disabled(), but I'd like to remove any of those
checks that would be covered with the "pick_next_pushable_task()".

-- Steve



>  
>  				double_unlock_balance(rq, lowest_rq);
>  				lowest_rq = NULL;
> @@ -1946,27 +1972,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>  	return lowest_rq;
>  }
>  
> -static struct task_struct *pick_next_pushable_task(struct rq *rq)
> -{
> -	struct task_struct *p;
> -
> -	if (!has_pushable_tasks(rq))
> -		return NULL;
> -
> -	p = plist_first_entry(&rq->rt.pushable_tasks,
> -			      struct task_struct, pushable_tasks);
> -
> -	BUG_ON(rq->cpu != task_cpu(p));
> -	BUG_ON(task_current(rq, p));
> -	BUG_ON(task_current_donor(rq, p));
> -	BUG_ON(p->nr_cpus_allowed <= 1);
> -
> -	BUG_ON(!task_on_rq_queued(p));
> -	BUG_ON(!rt_task(p));
> -
> -	return p;
> -}
> -
>  /*
>   * If the current CPU has more than one RT task, see if the non
>   * running task can migrate over to a CPU that is running a task

Re: [PATCH] sched/rt: Fix race in push_rt_task

Posted by Harshit Agarwal 12 months ago


> On Feb 11, 2025, at 7:11 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue, 11 Feb 2025 05:46:45 +0000
> Harshit Agarwal <harshit@nutanix.com> wrote:
> 
>> Overview
>> ========
>> When a CPU chooses to call push_rt_task and picks a task to push to
>> another CPU's runqueue then it will call find_lock_lowest_rq method
>> which would take a double lock on both CPUs' runqueues. If one of the
>> locks aren't readily available, it may lead to dropping the current
>> runqueue lock and reacquiring both the locks at once. During this window
>> it is possible that the task is already migrated and is running on some
>> other CPU. These cases are already handled. However, if the task is
>> migrated and has already been executed and another CPU is now trying to
>> wake it up (ttwu) such that it is queued again on the runqeue
>> (on_rq is 1) and also if the task was run by the same CPU, then the
>> current checks will pass even though the task was migrated out and is no
>> longer in the pushable tasks list.
> 
> Nice catch! And nice analysis.
> 
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 4b8e33c615b1..d48a9cb9ac92 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
>> return -1;
>> }
>> 
>> +static struct task_struct *pick_next_pushable_task(struct rq *rq)
>> +{
>> + struct task_struct *p;
>> +
>> + if (!has_pushable_tasks(rq))
>> + return NULL;
>> +
>> + p = plist_first_entry(&rq->rt.pushable_tasks,
>> +       struct task_struct, pushable_tasks);
>> +
>> + BUG_ON(rq->cpu != task_cpu(p));
>> + BUG_ON(task_current(rq, p));
>> + BUG_ON(task_current_donor(rq, p));
>> + BUG_ON(p->nr_cpus_allowed <= 1);
>> +
>> + BUG_ON(!task_on_rq_queued(p));
>> + BUG_ON(!rt_task(p));
>> +
>> + return p;
>> +}
>> +
>> /* Will lock the rq it finds */
>> static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>> {
>> @@ -1920,13 +1941,18 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>>  * It is possible the task was scheduled, set
>>  * "migrate_disabled" and then got preempted, so we must
>>  * check the task migration disable flag here too.
>> +  * Also, the task may have been dequeued and completed
>> +  * execution on the same CPU during this time, therefore
>> +  * check if the task is still at the head of the
>> +  * pushable tasks list.
>>  */
>> if (unlikely(task_rq(task) != rq ||
>>      !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
>>      task_on_cpu(rq, task) ||
>>      !rt_task(task) ||
>>      is_migration_disabled(task) ||
>> -      !task_on_rq_queued(task))) {
>> +      !task_on_rq_queued(task) ||
>> +      task != pick_next_pushable_task(rq))) {
> 
> I'm wondering? Could we replace all of these checks with just:
> 
> task != pick_next_pushable_task(rq)
> ?
> 
> I mean, what check above would fail and not cause that one to fail?
> 
> Maybe the is_migration_disabled(), but I'd like to remove any of those
> checks that would be covered with the "pick_next_pushable_task()".
> 
> -- Steve
> 

Thanks Steve for taking a look. Yes we should ideally remove any conditions that are 
subsumed by task != pick_next_pushable_task condition, however I am not sure if anyone (say ttwu())
will try to modify p’s state without removing it from the pushable tasks list first. In such a case
pick_next_pushable_task will still pick p again and then it will match and will proceed to do the migration
while ttwu() is trying to wake it up. Most likely, some asserts like !task_on_rq_queued etc will be hit in
pick_next_pushable_task as soon as p is picked. If we can be sure that p’s state is modified by others
only after it being removed from pushable tasks list of this CPU then it should be safe to remove it
else not. 

Please me know what do you think.

Regards,
Harshit


> 
> 
>> 
>> double_unlock_balance(rq, lowest_rq);
>> lowest_rq = NULL;
>> @@ -1946,27 +1972,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>> return lowest_rq;
>> }
>> 
>> -static struct task_struct *pick_next_pushable_task(struct rq *rq)
>> -{
>> - struct task_struct *p;
>> -
>> - if (!has_pushable_tasks(rq))
>> - return NULL;
>> -
>> - p = plist_first_entry(&rq->rt.pushable_tasks,
>> -       struct task_struct, pushable_tasks);
>> -
>> - BUG_ON(rq->cpu != task_cpu(p));
>> - BUG_ON(task_current(rq, p));
>> - BUG_ON(task_current_donor(rq, p));
>> - BUG_ON(p->nr_cpus_allowed <= 1);
>> -
>> - BUG_ON(!task_on_rq_queued(p));
>> - BUG_ON(!rt_task(p));
>> -
>> - return p;
>> -}
>> -
>> /*
>>  * If the current CPU has more than one RT task, see if the non
>>  * running task can migrate over to a CPU that is running a task

Re: [PATCH] sched/rt: Fix race in push_rt_task

Posted by Steven Rostedt 12 months ago

On Tue, 11 Feb 2025 21:08:22 +0000
Harshit Agarwal <harshit@nutanix.com> wrote:

> Thanks Steve for taking a look. Yes we should ideally remove any conditions that are 
> subsumed by task != pick_next_pushable_task condition, however I am not sure if anyone (say ttwu())
> will try to modify p’s state without removing it from the pushable tasks list first. In such a case
> pick_next_pushable_task will still pick p again and then it will match and will proceed to do the migration
> while ttwu() is trying to wake it up. Most likely, some asserts like !task_on_rq_queued etc will be hit in
> pick_next_pushable_task as soon as p is picked. If we can be sure that p’s state is modified by others
> only after it being removed from pushable tasks list of this CPU then it should be safe to remove it
> else not. 

Note that all tasks on the pick_next_pushable list is in the running state.
Nothing should be trying to wake it up while its on that list. That is, if
p is still on the pushable tasks then everything should be still fine.
Especially now that the rq locks are still held again.

I think that is the only check we need. The is_migration_disabled() check
should probably be checked earlier, as the only way it could be set is if
the current task preempted it. If it had migrated, and migrated back, it
couldn't have that set on this current CPU otherwise it would not have
migrated.

Here's the current checks again:

	if (unlikely(task_rq(task) != rq ||
		     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
		     task_on_cpu(rq, task) ||
		     !rt_task(task) ||
		     is_migration_disabled(task) ||
		     !task_on_rq_queued(task))) {

Let's look at pick_next_pushable_task():

	p = plist_first_entry(&rq->rt.pushable_tasks,
			      struct task_struct, pushable_tasks);

	BUG_ON(rq->cpu != task_cpu(p));
	BUG_ON(task_current(rq, p));
	BUG_ON(task_current_donor(rq, p));
	BUG_ON(p->nr_cpus_allowed <= 1);

	BUG_ON(!task_on_rq_queued(p));
	BUG_ON(!rt_task(p));

If task_rq(task) != rq is true, then BUG_ON(rq->cpu != task_cpu(p)) would trigger.

  !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask

We still need that check, to make sure the CPU we picked is in the task's affinity.

If task_on_cpu(rq, task) is true, then BUG_ON(task_current(rq, p)) would trigger.

If !rt_task(task) is true then BUG_ON(!rt_task(p)) would trigger.

   is_migration_disabled(task)

Is still needed, but could possibly be moved up? (unrelated change though)

If !task_on_rq_queued(task) is true then BUG_ON(!task_on_rq_queued(p)) would trigger.

Thus, I think we can change that condition to just:

	if (is_migration_disabled(task) ||
	    !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
	    task != pick_next_pushable_task(rq)) {

I moved the is_migration_disabled() up as that's the quickest check. If
that's true, no need to test the other conditions.

-- Steve

Re: [PATCH] sched/rt: Fix race in push_rt_task

Posted by Harshit Agarwal 12 months ago


> On Feb 11, 2025, at 1:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue, 11 Feb 2025 21:08:22 +0000
> Harshit Agarwal <harshit@nutanix.com> wrote:
> 
>> Thanks Steve for taking a look. Yes we should ideally remove any conditions that are 
>> subsumed by task != pick_next_pushable_task condition, however I am not sure if anyone (say ttwu())
>> will try to modify p’s state without removing it from the pushable tasks list first. In such a case
>> pick_next_pushable_task will still pick p again and then it will match and will proceed to do the migration
>> while ttwu() is trying to wake it up. Most likely, some asserts like !task_on_rq_queued etc will be hit in
>> pick_next_pushable_task as soon as p is picked. If we can be sure that p’s state is modified by others
>> only after it being removed from pushable tasks list of this CPU then it should be safe to remove it
>> else not.
> 
> Note that all tasks on the pick_next_pushable list is in the running state.
> Nothing should be trying to wake it up while its on that list. That is, if
> p is still on the pushable tasks then everything should be still fine.
> Especially now that the rq locks are still held again.
> 
> I think that is the only check we need. The is_migration_disabled() check
> should probably be checked earlier, as the only way it could be set is if
> the current task preempted it. If it had migrated, and migrated back, it
> couldn't have that set on this current CPU otherwise it would not have
> migrated.
> 
> Here's the current checks again:
> 
> if (unlikely(task_rq(task) != rq ||
>     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
>     task_on_cpu(rq, task) ||
>     !rt_task(task) ||
>     is_migration_disabled(task) ||
>     !task_on_rq_queued(task))) {
> 
> 
> Let's look at pick_next_pushable_task():
> 
> p = plist_first_entry(&rq->rt.pushable_tasks,
>      struct task_struct, pushable_tasks);
> 
> BUG_ON(rq->cpu != task_cpu(p));
> BUG_ON(task_current(rq, p));
> BUG_ON(task_current_donor(rq, p));
> BUG_ON(p->nr_cpus_allowed <= 1);
> 
> BUG_ON(!task_on_rq_queued(p));
> BUG_ON(!rt_task(p));
> 
> If task_rq(task) != rq is true, then BUG_ON(rq->cpu != task_cpu(p)) would trigger.
> 
>  !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask
> 
> We still need that check, to make sure the CPU we picked is in the task's affinity.
> 
> If task_on_cpu(rq, task) is true, then BUG_ON(task_current(rq, p)) would trigger.
> 
> If !rt_task(task) is true then BUG_ON(!rt_task(p)) would trigger.
> 
>   is_migration_disabled(task)
> 
> Is still needed, but could possibly be moved up? (unrelated change though)
> 
> If !task_on_rq_queued(task) is true then BUG_ON(!task_on_rq_queued(p)) would trigger.
> 
> Thus, I think we can change that condition to just:
> 
> if (is_migration_disabled(task) ||
>    !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
>    task != pick_next_pushable_task(rq)) {
> 
> I moved the is_migration_disabled() up as that's the quickest check. If
> that's true, no need to test the other conditions.
> 
> -- Steve

Makes sense and thanks for all the details. This simplifies it. 
I will make this change as per your suggestion and send the
updated patch after testing it.

Regards,
Harshit

[PATCH v2] sched/rt: Fix race in push_rt_task

Posted by Harshit Agarwal 11 months, 4 weeks ago

Overview
========
When a CPU chooses to call push_rt_task and picks a task to push to
another CPU's runqueue then it will call find_lock_lowest_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.

Crashes
=======
This bug resulted in quite a few flavors of crashes triggering kernel
panics with various crash signatures such as assert failures, page
faults, null pointer dereferences, and queue corruption errors all
coming from scheduler itself.

Some of the crashes:
-> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? pick_next_task_rt+0x6e/0x1d0
   ? do_error_trap+0x64/0xa0
   ? pick_next_task_rt+0x6e/0x1d0
   ? exc_invalid_op+0x4c/0x60
   ? pick_next_task_rt+0x6e/0x1d0
   ? asm_exc_invalid_op+0x12/0x20
   ? pick_next_task_rt+0x6e/0x1d0
   __schedule+0x5cb/0x790
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: kernel NULL pointer dereference, address: 00000000000000c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? __warn+0x8a/0xe0
   ? exc_page_fault+0x3d6/0x520
   ? asm_exc_page_fault+0x1e/0x30
   ? pick_next_task_rt+0xb5/0x1d0
   ? pick_next_task_rt+0x8c/0x1d0
   __schedule+0x583/0x7e0
   ? update_ts_time_stats+0x55/0x70
   schedule_idle+0x1e/0x40
   do_idle+0x15e/0x200
   cpu_startup_entry+0x19/0x20
   start_secondary+0x117/0x160
   secondary_startup_64_no_verify+0xb0/0xbb

-> BUG: unable to handle page fault for address: ffff9464daea5900
   kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p))

-> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running)
   Call Trace:
   ? __die_body+0x1a/0x60
   ? die+0x2a/0x50
   ? do_trap+0x85/0x100
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? do_error_trap+0x64/0xa0
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? exc_invalid_op+0x4c/0x60
   ? dequeue_top_rt_rq+0xa2/0xb0
   ? asm_exc_invalid_op+0x12/0x20
   ? dequeue_top_rt_rq+0xa2/0xb0
   dequeue_rt_entity+0x1f/0x70
   dequeue_task_rt+0x2d/0x70
   __schedule+0x1a8/0x7e0
   ? blk_finish_plug+0x25/0x40
   schedule+0x3c/0xb0
   futex_wait_queue_me+0xb6/0x120
   futex_wait+0xd9/0x240
   do_futex+0x344/0xa90
   ? get_mm_exe_file+0x30/0x60
   ? audit_exe_compare+0x58/0x70
   ? audit_filter_rules.constprop.26+0x65e/0x1220
   __x64_sys_futex+0x148/0x1f0
   do_syscall_64+0x30/0x80
   entry_SYSCALL_64_after_hwframe+0x62/0xc7

-> BUG: unable to handle page fault for address: ffff8cf3608bc2c0
   Call Trace:
   ? __die_body+0x1a/0x60
   ? no_context+0x183/0x350
   ? spurious_kernel_fault+0x171/0x1c0
   ? exc_page_fault+0x3b6/0x520
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? asm_exc_page_fault+0x1e/0x30
   ? _cond_resched+0x15/0x30
   ? futex_wait_queue_me+0xc8/0x120
   ? futex_wait+0xd9/0x240
   ? try_to_wake_up+0x1b8/0x490
   ? futex_wake+0x78/0x160
   ? do_futex+0xcd/0xa90
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? plist_del+0x6a/0xd0
   ? plist_check_list+0x15/0x40
   ? plist_check_list+0x2e/0x40
   ? dequeue_pushable_task+0x20/0x70
   ? __schedule+0x382/0x7e0
   ? asm_sysvec_reschedule_ipi+0xa/0x20
   ? schedule+0x3c/0xb0
   ? exit_to_user_mode_prepare+0x9e/0x150
   ? irqentry_exit_to_user_mode+0x5/0x30
   ? asm_sysvec_reschedule_ipi+0x12/0x20

Above are some of the common examples of the crashes that were observed
due to this issue.

Details
=======
Let's look at the following scenario to understand this race.

1) CPU A enters push_rt_task
  a) CPU A has chosen next_task = task p.
  b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq).
  c) CPU A identifies CPU X as a destination CPU (X < Z).
  d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq).
  e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has
     locked CPU X’s rq, and thus, CPU A must wait.

2) At CPU Z
  a) Previous task has completed execution and thus, CPU Z enters
     schedule, locks its own rq after CPU A releases it.
  b) CPU Z dequeues previous task and begins executing task p.
  c) CPU Z unlocks its rq.
  d) Task p yields the CPU (ex. by doing IO or waiting to acquire a
     lock) which triggers the schedule function on CPU Z.
  e) CPU Z enters schedule again, locks its own rq, and dequeues task p.
  f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.

3) At CPU B
  a) CPU B enters try_to_wake_up with input task p.
  b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates
     B.state = WAKING.
  c) CPU B via select_task_rq determines CPU Y as the target CPU.

4) The race
  a) CPU A acquires CPU X’s lock and relocks CPU Z.
  b) CPU A reads task p.cpu = Z and incorrectly concludes task p is
     still on CPU Z.
  c) CPU A failed to notice task p had been dequeued from CPU Z while
     CPU A was waiting for locks in double_lock_balance. If CPU A knew
     that task p had been dequeued, it would return NULL forcing
     push_rt_task to give up the task p's migration.
  d) CPU B updates task p.cpu = Y and calls ttwu_queue.
  e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task
     p.on_rq = 1.
  f) CPU B unlocks CPU Y, triggering memory synchronization.
  g) CPU A reads task p.on_rq = 1, cementing its assumption that task p
     has not migrated.
  h) CPU A decides to migrate p to CPU X.

This leads to A dequeuing p from Y's queue and various crashes down the
line.

Solution
========
The solution here is fairly simple. After obtaining the lock (at 4a),
the check is enhanced to make sure that the task is still at the head of
the pushable tasks list. If not, then it is anyway not suitable for
being pushed out. The fix also removes any conditions that are no longer
needed.

Testing
=======
The fix is tested on a cluster of 3 nodes, where the panics due to this
are hit every couple of days. A fix similar to this was deployed on such
cluster and was stable for more than 30 days.

Co-developed-by: Jon Kohler <jon@nutanix.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Tested-by: Will Ton <william.ton@nutanix.com>
---
 kernel/sched/rt.c | 54 +++++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4762dd3f50c5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
 	return -1;
 }
 
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->rt.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current_donor(rq, p));
+	BUG_ON(p->nr_cpus_allowed <= 1);
+
+	BUG_ON(!task_on_rq_queued(p));
+	BUG_ON(!rt_task(p));
+
+	return p;
+}
+
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 {
@@ -1915,18 +1936,16 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 			/*
 			 * We had to unlock the run queue. In
 			 * the mean time, task could have
-			 * migrated already or had its affinity changed.
-			 * Also make sure that it wasn't scheduled on its rq.
+			 * migrated already or had its affinity changed,
+			 * therefore check if the task is still at the
+			 * head of the pushable tasks list.
 			 * It is possible the task was scheduled, set
 			 * "migrate_disabled" and then got preempted, so we must
 			 * check the task migration disable flag here too.
 			 */
-			if (unlikely(task_rq(task) != rq ||
+			if (unlikely(is_migration_disabled(task) ||
 				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
-				     task_on_cpu(rq, task) ||
-				     !rt_task(task) ||
-				     is_migration_disabled(task) ||
-				     !task_on_rq_queued(task))) {
+				     task != pick_next_pushable_task(rq))) {
 
 				double_unlock_balance(rq, lowest_rq);
 				lowest_rq = NULL;
@@ -1946,27 +1965,6 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	return lowest_rq;
 }
 
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
-	struct task_struct *p;
-
-	if (!has_pushable_tasks(rq))
-		return NULL;
-
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
-
-	BUG_ON(rq->cpu != task_cpu(p));
-	BUG_ON(task_current(rq, p));
-	BUG_ON(task_current_donor(rq, p));
-	BUG_ON(p->nr_cpus_allowed <= 1);
-
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!rt_task(p));
-
-	return p;
-}
-
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
-- 
2.22.3

Re: [PATCH v2] sched/rt: Fix race in push_rt_task

Posted by Steven Rostedt 11 months, 3 weeks ago

FYI,

You should always send a new patch version as a separate thread. That's
because they can get lost in the thread and makes it harder for maintainers
to know what the next version of the patch is. I've picked the wrong patch
version before because there was another version sent that I missed.

On Thu, 13 Feb 2025 17:54:34 +0000
Harshit Agarwal <harshit@nutanix.com> wrote:

> Solution
> ========
> The solution here is fairly simple. After obtaining the lock (at 4a),
> the check is enhanced to make sure that the task is still at the head of
> the pushable tasks list. If not, then it is anyway not suitable for
> being pushed out. The fix also removes any conditions that are no longer
> needed.
> 
> Testing
> =======
> The fix is tested on a cluster of 3 nodes, where the panics due to this
> are hit every couple of days. A fix similar to this was deployed on such
> cluster and was stable for more than 30 days.

May also want to add:

  Since 'is_migration_disabled()' a faster check than the others, it was moved
  to be the first check for consistency.

> 
> Co-developed-by: Jon Kohler <jon@nutanix.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
> Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
> Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
> Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
> Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
> Tested-by: Will Ton <william.ton@nutanix.com>
> ---

You can add here (after the above three dashes), how this version is
different from the last version. The text below the dashes and before the
patch is ignored by git, but is useful for reviewers. For instance:

Changes since v1: https://lore.kernel.org/all/20250211054646.23987-1-harshit@nutanix.com/

- Removed the redundant checks that task != pick_next_pushable_task() already has

Notice I added a link to the previous version. This helps find the previous
version without having to make this version a reply to it.

>  kernel/sched/rt.c | 54 +++++++++++++++++++++++------------------------
>  1 file changed, 26 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 4b8e33c615b1..4762dd3f50c5 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
>  	return -1;
>  }
>  

Otherwise,

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Peter, could you pick this up?

-- Steve

Re: [PATCH v2] sched/rt: Fix race in push_rt_task

Posted by Harshit Agarwal 11 months, 3 weeks ago

Hi Steve,

Thanks for the information. I realized my mistake this after updating this 
thread and then later I sent the patch as a separate thread here 
https://lore.kernel.org/lkml/20250214170844.201692-1-harshit@nutanix.com/
and included the link to v1 along with the changes made.

I had also added a comment on is_migration_disabled like you suggested  here. 
This separate patch addresses your comments already.
Sorry for the confusion.

Regards,
Harshit

> On Feb 17, 2025, at 7:50 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> 
> FYI,
> 
> You should always send a new patch version as a separate thread. That's
> because they can get lost in the thread and makes it harder for maintainers
> to know what the next version of the patch is. I've picked the wrong patch
> version before because there was another version sent that I missed.
> 
> On Thu, 13 Feb 2025 17:54:34 +0000
> Harshit Agarwal <harshit@nutanix.com> wrote:
> 
>> Solution
>> ========
>> The solution here is fairly simple. After obtaining the lock (at 4a),
>> the check is enhanced to make sure that the task is still at the head of
>> the pushable tasks list. If not, then it is anyway not suitable for
>> being pushed out. The fix also removes any conditions that are no longer
>> needed.
>> 
>> Testing
>> =======
>> The fix is tested on a cluster of 3 nodes, where the panics due to this
>> are hit every couple of days. A fix similar to this was deployed on such
>> cluster and was stable for more than 30 days.
> 
> May also want to add:
> 
>  Since 'is_migration_disabled()' a faster check than the others, it was moved
>  to be the first check for consistency.
> 
>> 
>> Co-developed-by: Jon Kohler <jon@nutanix.com>
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
>> Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com>
>> Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
>> Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com>
>> Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
>> Tested-by: Will Ton <william.ton@nutanix.com>
>> ---
> 
> You can add here (after the above three dashes), how this version is
> different from the last version. The text below the dashes and before the
> patch is ignored by git, but is useful for reviewers. For instance:
> 
> Changes since v1: https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_all_20250211054646.23987-2D1-2Dharshit-40nutanix.com_&d=DwICAg&c=s883GpUCOChKOHiocYtGcg&r=QTPVhNgH716-zU_kPmte39o3vGFVupnGmmfiVBpq9PU&m=YHnhqY1UaVeahgoVKVWuMaCw-TJQQg9Sdhif34WqstcHXYe_sUHfix5ImyJwyIHl&s=lhfXqBLkfeyyKLXqPABmYuQZIqCZzHx0-dQk2i3k49w&e= 
> 
> - Removed the redundant checks that task != pick_next_pushable_task() already has
> 
> 
> Notice I added a link to the previous version. This helps find the previous
> version without having to make this version a reply to it.
> 
>> kernel/sched/rt.c | 54 +++++++++++++++++++++++------------------------
>> 1 file changed, 26 insertions(+), 28 deletions(-)
>> 
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 4b8e33c615b1..4762dd3f50c5 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1885,6 +1885,27 @@ static int find_lowest_rq(struct task_struct *task)
>> return -1;
>> }
>> 
> 
> Otherwise,
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> Peter, could you pick this up?
> 
> -- Steve

Re: [PATCH v2] sched/rt: Fix race in push_rt_task

Posted by Steven Rostedt 11 months, 3 weeks ago

On Mon, 17 Feb 2025 16:36:47 +0000
Harshit Agarwal <harshit@nutanix.com> wrote:

> Hi Steve,
> 
> Thanks for the information. I realized my mistake this after updating this 
> thread and then later I sent the patch as a separate thread here 
> https://lore.kernel.org/lkml/20250214170844.201692-1-harshit@nutanix.com/
> and included the link to v1 along with the changes made.

Yeah I just noticed it ;-)


> 
> I had also added a comment on is_migration_disabled like you suggested  here. 
> This separate patch addresses your comments already.
> Sorry for the confusion.

I'm behind in my email and I'm just chugging away in order.

-- Steve