[v1] sched/topology: clear freecpu bit on detach

[RFC PATCH] sched/topology: clear freecpu bit on detach

Posted by Doug Berger 1 year ago

There is a hazard in the deadline scheduler where an offlined CPU
can have its free_cpus bit left set in the def_root_domain when
the schedutil cpufreq governor is used. This can allow a deadline
thread to be pushed to the runqueue of a powered down CPU which
breaks scheduling. The details can be found here:
https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com

The free_cpus mask is expected to be cleared by set_rq_offline();
however, the hazard occurs before the root domain is made online
during CPU hotplug so that function is not invoked for the CPU
that is being made active.

This commit works around the issue by ensuring the free_cpus bit
for a CPU is always cleared when the CPU is removed from a
root_domain. This likely makes the call of cpudl_clear_freecpu()
in rq_offline_dl() fully redundant, but I have not removed it
here because I am not certain of all flows.

It seems likely that a better solution is possible from someone
more familiar with the scheduler implementation, but this
approach is minimally invasive from someone who is not.

Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control")
Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 kernel/sched/topology.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index da33ec9e94ab..3cbc14953c36 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -499,6 +499,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 			set_rq_offline(rq);
 
 		cpumask_clear_cpu(rq->cpu, old_rd->span);
+		cpudl_clear_freecpu(&old_rd->cpudl, rq->cpu);
 
 		/*
 		 * If we don't want to free the old_rd yet then
-- 
2.34.1

Re: [RFC PATCH] sched/topology: clear freecpu bit on detach

Posted by Doug Berger 11 months, 3 weeks ago

Does anyone have any additional feedback or suggestions for a better 
solution to this issue, or should I resubmit without the RFC prefix?

Thanks,
     Doug

On 1/14/2025 3:04 PM, Doug Berger wrote:
> There is a hazard in the deadline scheduler where an offlined CPU
> can have its free_cpus bit left set in the def_root_domain when
> the schedutil cpufreq governor is used. This can allow a deadline
> thread to be pushed to the runqueue of a powered down CPU which
> breaks scheduling. The details can be found here:
> https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com
> 
> The free_cpus mask is expected to be cleared by set_rq_offline();
> however, the hazard occurs before the root domain is made online
> during CPU hotplug so that function is not invoked for the CPU
> that is being made active.
> 
> This commit works around the issue by ensuring the free_cpus bit
> for a CPU is always cleared when the CPU is removed from a
> root_domain. This likely makes the call of cpudl_clear_freecpu()
> in rq_offline_dl() fully redundant, but I have not removed it
> here because I am not certain of all flows.
> 
> It seems likely that a better solution is possible from someone
> more familiar with the scheduler implementation, but this
> approach is minimally invasive from someone who is not.
> 
> Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control")
> Signed-off-by: Doug Berger <opendmb@gmail.com>
> ---
>   kernel/sched/topology.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index da33ec9e94ab..3cbc14953c36 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -499,6 +499,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
>   			set_rq_offline(rq);
>   
>   		cpumask_clear_cpu(rq->cpu, old_rd->span);
> +		cpudl_clear_freecpu(&old_rd->cpudl, rq->cpu);
>   
>   		/*
>   		 * If we don't want to free the old_rd yet then

Re: [RFC PATCH] sched/topology: clear freecpu bit on detach

Posted by Juri Lelli 1 year ago

Hi Doug,

On 14/01/25 15:04, Doug Berger wrote:
> There is a hazard in the deadline scheduler where an offlined CPU
> can have its free_cpus bit left set in the def_root_domain when
> the schedutil cpufreq governor is used. This can allow a deadline
> thread to be pushed to the runqueue of a powered down CPU which
> breaks scheduling. The details can be found here:
> https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com
> 
> The free_cpus mask is expected to be cleared by set_rq_offline();
> however, the hazard occurs before the root domain is made online
> during CPU hotplug so that function is not invoked for the CPU
> that is being made active.
> 
> This commit works around the issue by ensuring the free_cpus bit
> for a CPU is always cleared when the CPU is removed from a
> root_domain. This likely makes the call of cpudl_clear_freecpu()
> in rq_offline_dl() fully redundant, but I have not removed it
> here because I am not certain of all flows.

Hummm, I am not sure we are covering all the cases this way.

Would you mind trying out the following maybe? It's based around your
idea on the first patch you proposed.

Thanks!
Juri

---
 kernel/sched/cpudeadline.c | 23 ++++++-----------------
 kernel/sched/cpudeadline.h |  3 +--
 kernel/sched/deadline.c    | 11 ++++++-----
 3 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 95baa12a1029..30419fa92c1e 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -170,7 +170,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
  *
  * Returns: (void)
  */
-void cpudl_clear(struct cpudl *cp, int cpu)
+void cpudl_clear(struct cpudl *cp, int cpu, bool offline)
 {
 	int old_idx, new_cpu;
 	unsigned long flags;
@@ -181,11 +181,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
 
 	old_idx = cp->elements[cpu].idx;
 	if (old_idx == IDX_INVALID) {
-		/*
-		 * Nothing to remove if old_idx was invalid.
-		 * This could happen if a rq_offline_dl is
-		 * called for a CPU without -dl tasks running.
-		 */
+		if (offline)
+			cpumask_clear_cpu(cpu, cp->free_cpus);
 	} else {
 		new_cpu = cp->elements[cp->size - 1].cpu;
 		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
@@ -195,7 +192,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
 		cp->elements[cpu].idx = IDX_INVALID;
 		cpudl_heapify(cp, old_idx);
 
-		cpumask_set_cpu(cpu, cp->free_cpus);
+		if (!offline)
+			cpumask_set_cpu(cpu, cp->free_cpus);
 	}
 	raw_spin_unlock_irqrestore(&cp->lock, flags);
 }
@@ -243,19 +241,10 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl)
  */
 void cpudl_set_freecpu(struct cpudl *cp, int cpu)
 {
+	guard(raw_spinlock)(&cp->lock);
 	cpumask_set_cpu(cpu, cp->free_cpus);
 }
 
-/*
- * cpudl_clear_freecpu - Clear the cpudl.free_cpus
- * @cp: the cpudl max-heap context
- * @cpu: rd attached CPU
- */
-void cpudl_clear_freecpu(struct cpudl *cp, int cpu)
-{
-	cpumask_clear_cpu(cpu, cp->free_cpus);
-}
-
 /*
  * cpudl_init - initialize the cpudl structure
  * @cp: the cpudl max-heap context
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
index 0adeda93b5fb..cb2be7a205dd 100644
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -18,9 +18,8 @@ struct cpudl {
 #ifdef CONFIG_SMP
 int  cpudl_find(struct cpudl *cp, struct task_struct *p, struct cpumask *later_mask);
 void cpudl_set(struct cpudl *cp, int cpu, u64 dl);
-void cpudl_clear(struct cpudl *cp, int cpu);
+void cpudl_clear(struct cpudl *cp, int cpu, bool offline);
 int  cpudl_init(struct cpudl *cp);
 void cpudl_set_freecpu(struct cpudl *cp, int cpu);
-void cpudl_clear_freecpu(struct cpudl *cp, int cpu);
 void cpudl_cleanup(struct cpudl *cp);
 #endif /* CONFIG_SMP */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 643d101cb96a..b52234ebb991 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1870,7 +1870,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 	if (!dl_rq->dl_nr_running) {
 		dl_rq->earliest_dl.curr = 0;
 		dl_rq->earliest_dl.next = 0;
-		cpudl_clear(&rq->rd->cpudl, rq->cpu);
+		cpudl_clear(&rq->rd->cpudl, rq->cpu, false);
 		cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
 	} else {
 		struct rb_node *leftmost = rb_first_cached(&dl_rq->root);
@@ -2921,9 +2921,11 @@ static void rq_online_dl(struct rq *rq)
 	if (rq->dl.overloaded)
 		dl_set_overload(rq);
 
-	cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
-	if (rq->dl.dl_nr_running > 0)
+	if (rq->dl.dl_nr_running > 0) {
 		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr);
+	} else {
+		cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
+	}
 }
 
 /* Assumes rq->lock is held */
@@ -2932,8 +2934,7 @@ static void rq_offline_dl(struct rq *rq)
 	if (rq->dl.overloaded)
 		dl_clear_overload(rq);
 
-	cpudl_clear(&rq->rd->cpudl, rq->cpu);
-	cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu);
+	cpudl_clear(&rq->rd->cpudl, rq->cpu, true);
 }
 
 void __init init_sched_dl_class(void)
-- 
2.47.1

Re: [RFC PATCH] sched/topology: clear freecpu bit on detach

Posted by Doug Berger 1 year ago

On 1/16/2025 2:13 AM, Juri Lelli wrote:
> Hi Doug,
> 
> On 14/01/25 15:04, Doug Berger wrote:
>> There is a hazard in the deadline scheduler where an offlined CPU
>> can have its free_cpus bit left set in the def_root_domain when
>> the schedutil cpufreq governor is used. This can allow a deadline
>> thread to be pushed to the runqueue of a powered down CPU which
>> breaks scheduling. The details can be found here:
>> https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com
>>
>> The free_cpus mask is expected to be cleared by set_rq_offline();
>> however, the hazard occurs before the root domain is made online
>> during CPU hotplug so that function is not invoked for the CPU
>> that is being made active.
>>
>> This commit works around the issue by ensuring the free_cpus bit
>> for a CPU is always cleared when the CPU is removed from a
>> root_domain. This likely makes the call of cpudl_clear_freecpu()
>> in rq_offline_dl() fully redundant, but I have not removed it
>> here because I am not certain of all flows.
> 
> Hummm, I am not sure we are covering all the cases this way.
> 
> Would you mind trying out the following maybe? It's based around your
> idea on the first patch you proposed.
Thanks! I'm happy to test any suggestions.

> 
> Thanks!
> Juri
> 
> ---
>   kernel/sched/cpudeadline.c | 23 ++++++-----------------
>   kernel/sched/cpudeadline.h |  3 +--
>   kernel/sched/deadline.c    | 11 ++++++-----
>   3 files changed, 13 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 95baa12a1029..30419fa92c1e 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -170,7 +170,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>    *
>    * Returns: (void)
>    */
> -void cpudl_clear(struct cpudl *cp, int cpu)
> +void cpudl_clear(struct cpudl *cp, int cpu, bool offline)
>   {
>   	int old_idx, new_cpu;
>   	unsigned long flags;
> @@ -181,11 +181,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
>   
>   	old_idx = cp->elements[cpu].idx;
>   	if (old_idx == IDX_INVALID) {
> -		/*
> -		 * Nothing to remove if old_idx was invalid.
> -		 * This could happen if a rq_offline_dl is
> -		 * called for a CPU without -dl tasks running.
> -		 */
> +		if (offline)
> +			cpumask_clear_cpu(cpu, cp->free_cpus);
>   	} else {
>   		new_cpu = cp->elements[cp->size - 1].cpu;
>   		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
> @@ -195,7 +192,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
>   		cp->elements[cpu].idx = IDX_INVALID;
>   		cpudl_heapify(cp, old_idx);
>   
> -		cpumask_set_cpu(cpu, cp->free_cpus);
> +		if (!offline)
> +			cpumask_set_cpu(cpu, cp->free_cpus);
>   	}
>   	raw_spin_unlock_irqrestore(&cp->lock, flags);
>   }
> @@ -243,19 +241,10 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl)
>    */
>   void cpudl_set_freecpu(struct cpudl *cp, int cpu)
>   {
> +	guard(raw_spinlock)(&cp->lock);
>   	cpumask_set_cpu(cpu, cp->free_cpus);
>   }
>   
> -/*
> - * cpudl_clear_freecpu - Clear the cpudl.free_cpus
> - * @cp: the cpudl max-heap context
> - * @cpu: rd attached CPU
> - */
> -void cpudl_clear_freecpu(struct cpudl *cp, int cpu)
> -{
> -	cpumask_clear_cpu(cpu, cp->free_cpus);
> -}
> -
>   /*
>    * cpudl_init - initialize the cpudl structure
>    * @cp: the cpudl max-heap context
> diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
> index 0adeda93b5fb..cb2be7a205dd 100644
> --- a/kernel/sched/cpudeadline.h
> +++ b/kernel/sched/cpudeadline.h
> @@ -18,9 +18,8 @@ struct cpudl {
>   #ifdef CONFIG_SMP
>   int  cpudl_find(struct cpudl *cp, struct task_struct *p, struct cpumask *later_mask);
>   void cpudl_set(struct cpudl *cp, int cpu, u64 dl);
> -void cpudl_clear(struct cpudl *cp, int cpu);
> +void cpudl_clear(struct cpudl *cp, int cpu, bool offline);
>   int  cpudl_init(struct cpudl *cp);
>   void cpudl_set_freecpu(struct cpudl *cp, int cpu);
> -void cpudl_clear_freecpu(struct cpudl *cp, int cpu);
>   void cpudl_cleanup(struct cpudl *cp);
>   #endif /* CONFIG_SMP */
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 643d101cb96a..b52234ebb991 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1870,7 +1870,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>   	if (!dl_rq->dl_nr_running) {
>   		dl_rq->earliest_dl.curr = 0;
>   		dl_rq->earliest_dl.next = 0;
> -		cpudl_clear(&rq->rd->cpudl, rq->cpu);
> +		cpudl_clear(&rq->rd->cpudl, rq->cpu, false);
>   		cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
>   	} else {
>   		struct rb_node *leftmost = rb_first_cached(&dl_rq->root);
> @@ -2921,9 +2921,11 @@ static void rq_online_dl(struct rq *rq)
>   	if (rq->dl.overloaded)
>   		dl_set_overload(rq);
>   
> -	cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
> -	if (rq->dl.dl_nr_running > 0)
> +	if (rq->dl.dl_nr_running > 0) {
>   		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr);
> +	} else {
> +		cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
> +	}
>   }
>   
>   /* Assumes rq->lock is held */
> @@ -2932,8 +2934,7 @@ static void rq_offline_dl(struct rq *rq)
>   	if (rq->dl.overloaded)
>   		dl_clear_overload(rq);
>   
> -	cpudl_clear(&rq->rd->cpudl, rq->cpu);
> -	cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu);
> +	cpudl_clear(&rq->rd->cpudl, rq->cpu, true);
>   }
>   
>   void __init init_sched_dl_class(void)

Unfortunately, this patch doesn't prevent the issue in my testing.

I believe when the "sugov:n" thread is early scheduled while hotplugging 
CPUn it takes the dec_dl_deadline() path upon completion which ends up 
setting the free_cpus bit in the def_root_domain before the dynamic 
scheduling domain is onlined (even with your patch). This sets the trap 
that gets tripped later.

I attempted to confirm that this was in fact introduced by:
Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control")
However, I discovered that it doesn't appear to have been introduced by 
that commit since the set_rq_offline() call from sched_cpu_dying() was 
already meaningless since the runqueue was already offline by that point 
in the unplug sequence.

Thanks for your feedback,
     Doug