[v5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

[PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Pingfan Liu 3 months ago

*** Bug description ***
When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:

[   97.114759] psci: CPU142 killed (polled 0 ms)
[   97.333236] Failed to offline CPU143 - error=-16
[   97.333246] ------------[ cut here ]------------
[   97.342682] kernel BUG at kernel/cpu.c:1569!
[   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[...]

In essence, the issue originates from the CPU hot-removal process, not
limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
program that waits indefinitely on a semaphore, spawning multiple
instances to ensure some run on CPU 72, and then offlining CPUs 1–143
one by one. When attempting this, CPU 143 failed to go offline.
  bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'

Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
But that is not the fact, and contributed by the following factors:
When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
blocked-state deadline task (in this case, "cppc_fie"), it was not
migrated to CPU0, and its task_rq() information is stale. So its rq->rd
points to def_root_domain instead of the one shared with CPU0.  As a
result, its bandwidth is wrongly accounted into a wrong root domain
during domain rebuild.

*** Issue ***
The key point is that root_domain is only tracked through active rq->rd.
To avoid using a global data structure to track all root_domains in the
system, there should be a method to locate an active CPU within the
corresponding root_domain.

*** Solution ***
To locate the active cpu, the following rules for deadline
sub-system is useful
  -1.any cpu belongs to a unique root domain at a given time
  -2.DL bandwidth checker ensures that the root domain has active cpus.

Now, let's examine the blocked-state task P.
If P is attached to a cpuset that is a partition root, it is
straightforward to find an active CPU.
If P is attached to a cpuset that has changed from 'root' to 'member',
the active CPUs are grouped into the parent root domain. Naturally, the
CPUs' capacity and reserved DL bandwidth are taken into account in the
ancestor root domain. (In practice, it may be unsafe to attach P to an
arbitrary root domain, since that domain may lack sufficient DL
bandwidth for P.) Again, it is straightforward to find an active CPU in
the ancestor root domain.

This patch groups CPUs into isolated and housekeeping sets. For the
housekeeping group, it walks up the cpuset hierarchy to find active CPUs
in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
---
v4 -> v5:
  Move the housekeeping part into deadline.c (Thanks for Waiman's suggestion)
  Use cpuset_cpus_allowed() instead of introducing new cpuset function (Thanks for Ridong's suggestion)

 kernel/sched/deadline.c | 50 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 72c1f72463c75..7555b7af49486 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2879,11 +2879,43 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+/*
+ * This function always returns a non-empty bitmap in @cpus. This is because
+ * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth
+ * check will prevent CPU hotplug from deactivating all CPUs in that domain.
+ */
+static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
+{
+	const struct cpumask *hk_msk;
+
+	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
+	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
+		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
+			/*
+			 * CPUs isolated by isolcpu="domain" always belong to
+			 * def_root_domain.
+			 */
+			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
+			return;
+		}
+	}
+
+	/*
+	 * If a root domain holds a DL task, it must have active CPUs. So
+	 * active CPUs can always be found by walking up the task's cpuset
+	 * hierarchy up to the partition root.
+	 */
+	cpuset_cpus_allowed(p, cpus);
+}
+
+/* The caller should hold cpuset_mutex */
 void dl_add_task_root_domain(struct task_struct *p)
 {
 	struct rq_flags rf;
 	struct rq *rq;
 	struct dl_bw *dl_b;
+	unsigned int cpu;
+	struct cpumask msk;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2891,16 +2923,22 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	rq = __task_rq_lock(p, &rf);
-
+	/*
+	 * Get an active rq, whose rq->rd traces the correct root
+	 * domain.
+	 * And the caller should hold cpuset_mutex, which gurantees
+	 * the cpu remaining in the cpuset until rq->rd is fetched.
+	 */
+	dl_get_task_effective_cpus(p, &msk);
+	cpu = cpumask_first_and(cpu_active_mask, &msk);
+	BUG_ON(cpu >= nr_cpu_ids);
+	rq = cpu_rq(cpu);
 	dl_b = &rq->rd->dl_bw;
-	raw_spin_lock(&dl_b->lock);
 
+	raw_spin_lock(&dl_b->lock);
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-
 	raw_spin_unlock(&dl_b->lock);
-
-	task_rq_unlock(rq, p, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
 void dl_clear_root_domain(struct root_domain *rd)
-- 
2.49.0

Re: [PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Juri Lelli 2 months, 4 weeks ago

Hi,

Looks like this has two issues.

On 10/11/25 09:47, Pingfan Liu wrote:

...

> +/*
> + * This function always returns a non-empty bitmap in @cpus. This is because
> + * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth
> + * check will prevent CPU hotplug from deactivating all CPUs in that domain.
> + */
> +static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/*
> +			 * CPUs isolated by isolcpu="domain" always belong to
> +			 * def_root_domain.
> +			 */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +
> +	/*
> +	 * If a root domain holds a DL task, it must have active CPUs. So
> +	 * active CPUs can always be found by walking up the task's cpuset
> +	 * hierarchy up to the partition root.
> +	 */
> +	cpuset_cpus_allowed(p, cpus);

Grabs callbak_lock spin_lock (sleepable on RT) under pi_lock
raw_spin_lock.

> +}
> +
> +/* The caller should hold cpuset_mutex */
>  void dl_add_task_root_domain(struct task_struct *p)
>  {
>  	struct rq_flags rf;
>  	struct rq *rq;
>  	struct dl_bw *dl_b;
> +	unsigned int cpu;
> +	struct cpumask msk;

Potentially huge mask allocated on the stack.

>  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>  	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> @@ -2891,16 +2923,22 @@ void dl_add_task_root_domain(struct task_struct *p)
>  		return;
>  	}
>  
> -	rq = __task_rq_lock(p, &rf);
> -
> +	/*
> +	 * Get an active rq, whose rq->rd traces the correct root
> +	 * domain.
> +	 * And the caller should hold cpuset_mutex, which gurantees
> +	 * the cpu remaining in the cpuset until rq->rd is fetched.
> +	 */
> +	dl_get_task_effective_cpus(p, &msk);
> +	cpu = cpumask_first_and(cpu_active_mask, &msk);
> +	BUG_ON(cpu >= nr_cpu_ids);
> +	rq = cpu_rq(cpu);
>  	dl_b = &rq->rd->dl_bw;
> -	raw_spin_lock(&dl_b->lock);
>  
> +	raw_spin_lock(&dl_b->lock);
>  	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> -
>  	raw_spin_unlock(&dl_b->lock);
> -
> -	task_rq_unlock(rq, p, &rf);
> +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);

Thanks,
Juri

Re: [PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Pingfan Liu 2 months, 4 weeks ago

Hi Juri,

Thanks for your review. Please see the comments below.

On Mon, Nov 10, 2025 at 12:14:39PM +0100, Juri Lelli wrote:
> Hi,
> 
> Looks like this has two issues.
> 
> On 10/11/25 09:47, Pingfan Liu wrote:
> 
> ...
> 
> > +/*
> > + * This function always returns a non-empty bitmap in @cpus. This is because
> > + * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth
> > + * check will prevent CPU hotplug from deactivating all CPUs in that domain.
> > + */
> > +static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +			/*
> > +			 * CPUs isolated by isolcpu="domain" always belong to
> > +			 * def_root_domain.
> > +			 */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * If a root domain holds a DL task, it must have active CPUs. So
> > +	 * active CPUs can always be found by walking up the task's cpuset
> > +	 * hierarchy up to the partition root.
> > +	 */
> > +	cpuset_cpus_allowed(p, cpus);
> 
> Grabs callbak_lock spin_lock (sleepable on RT) under pi_lock
> raw_spin_lock.
> 

Yes, it should be fixed. I'll discuss it in my reply to Waiman's email later.

> > +}
> > +
> > +/* The caller should hold cpuset_mutex */
> >  void dl_add_task_root_domain(struct task_struct *p)
> >  {
> >  	struct rq_flags rf;
> >  	struct rq *rq;
> >  	struct dl_bw *dl_b;
> > +	unsigned int cpu;
> > +	struct cpumask msk;
> 
> Potentially huge mask allocated on the stack.
> 

Since there's no way to handle memory allocation failures, could it be
done by using alloc_cpumask_var() in init_sched_dl_class() to reserve
the memory for this purpose?

Best Regards,

Pingfan
> >  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> >  	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> > @@ -2891,16 +2923,22 @@ void dl_add_task_root_domain(struct task_struct *p)
> >  		return;
> >  	}
> >  
> > -	rq = __task_rq_lock(p, &rf);
> > -
> > +	/*
> > +	 * Get an active rq, whose rq->rd traces the correct root
> > +	 * domain.
> > +	 * And the caller should hold cpuset_mutex, which gurantees
> > +	 * the cpu remaining in the cpuset until rq->rd is fetched.
> > +	 */
> > +	dl_get_task_effective_cpus(p, &msk);
> > +	cpu = cpumask_first_and(cpu_active_mask, &msk);
> > +	BUG_ON(cpu >= nr_cpu_ids);
> > +	rq = cpu_rq(cpu);
> >  	dl_b = &rq->rd->dl_bw;
> > -	raw_spin_lock(&dl_b->lock);
> >  
> > +	raw_spin_lock(&dl_b->lock);
> >  	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> > -
> >  	raw_spin_unlock(&dl_b->lock);
> > -
> > -	task_rq_unlock(rq, p, &rf);
> > +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> 
> Thanks,
> Juri
>

Re: [PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Juri Lelli 2 months, 4 weeks ago

On 11/11/25 19:40, Pingfan Liu wrote:
> Hi Juri,
> 
> Thanks for your review. Please see the comments below.

...

> Since there's no way to handle memory allocation failures, could it be
> done by using alloc_cpumask_var() in init_sched_dl_class() to reserve
> the memory for this purpose?

Maybe something similar to local_cpu_mask_dl (or use local_cpu_mask_dl
directly?), I'm thinking.

Re: [PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Waiman Long 2 months, 4 weeks ago

On 11/10/25 6:14 AM, Juri Lelli wrote:
> Hi,
>
> Looks like this has two issues.
>
> On 10/11/25 09:47, Pingfan Liu wrote:
>
> ...
>
>> +/*
>> + * This function always returns a non-empty bitmap in @cpus. This is because
>> + * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth
>> + * check will prevent CPU hotplug from deactivating all CPUs in that domain.
>> + */
>> +static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
>> +{
>> +	const struct cpumask *hk_msk;
>> +
>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
>> +			/*
>> +			 * CPUs isolated by isolcpu="domain" always belong to
>> +			 * def_root_domain.
>> +			 */
>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
>> +			return;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * If a root domain holds a DL task, it must have active CPUs. So
>> +	 * active CPUs can always be found by walking up the task's cpuset
>> +	 * hierarchy up to the partition root.
>> +	 */
>> +	cpuset_cpus_allowed(p, cpus);
> Grabs callbak_lock spin_lock (sleepable on RT) under pi_lock
> raw_spin_lock.
I have been thinking about changing callback_lock to a raw_spinlock_t, 
but need to find a good use case for this change. So it is a solvable 
problem.
>> +}
>> +
>> +/* The caller should hold cpuset_mutex */
There is an upstream patch series that will add a helper function to 
check if cpuset_mutex has been held. So this comment should be replaced 
by a call to that helper function once it is available in the linux 
mainline.
>>   void dl_add_task_root_domain(struct task_struct *p)
>>   {
>>   	struct rq_flags rf;
>>   	struct rq *rq;
>>   	struct dl_bw *dl_b;
>> +	unsigned int cpu;
>> +	struct cpumask msk;
> Potentially huge mask allocated on the stack.

Yes, we should use cpumask_var_t and call alloc_cpumask_var() before 
acquiring lock.

Cheers,
Longman

>
>>   	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>>   	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
>> @@ -2891,16 +2923,22 @@ void dl_add_task_root_domain(struct task_struct *p)
>>   		return;
>>   	}
>>   
>> -	rq = __task_rq_lock(p, &rf);
>> -
>> +	/*
>> +	 * Get an active rq, whose rq->rd traces the correct root
>> +	 * domain.
>> +	 * And the caller should hold cpuset_mutex, which gurantees
>> +	 * the cpu remaining in the cpuset until rq->rd is fetched.
>> +	 */
>> +	dl_get_task_effective_cpus(p, &msk);
>> +	cpu = cpumask_first_and(cpu_active_mask, &msk);
>> +	BUG_ON(cpu >= nr_cpu_ids);
>> +	rq = cpu_rq(cpu);
>>   	dl_b = &rq->rd->dl_bw;
>> -	raw_spin_lock(&dl_b->lock);
>>   
>> +	raw_spin_lock(&dl_b->lock);
>>   	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
>> -
>>   	raw_spin_unlock(&dl_b->lock);
>> -
>> -	task_rq_unlock(rq, p, &rf);
>> +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> Thanks,
> Juri
>

Re: [PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Waiman Long 2 months, 4 weeks ago

On 11/10/25 4:07 PM, Waiman Long wrote:
> On 11/10/25 6:14 AM, Juri Lelli wrote:
>> Hi,
>>
>> Looks like this has two issues.
>>
>> On 10/11/25 09:47, Pingfan Liu wrote:
>>
>> ...
>>
>>> +/*
>>> + * This function always returns a non-empty bitmap in @cpus. This 
>>> is because
>>> + * if a root domain has reserved bandwidth for DL tasks, the DL 
>>> bandwidth
>>> + * check will prevent CPU hotplug from deactivating all CPUs in 
>>> that domain.
>>> + */
>>> +static void dl_get_task_effective_cpus(struct task_struct *p, 
>>> struct cpumask *cpus)
>>> +{
>>> +    const struct cpumask *hk_msk;
>>> +
>>> +    hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
>>> +    if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
>>> +        if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
>>> +            /*
>>> +             * CPUs isolated by isolcpu="domain" always belong to
>>> +             * def_root_domain.
>>> +             */
>>> +            cpumask_andnot(cpus, cpu_active_mask, hk_msk);
>>> +            return;
>>> +        }
>>> +    }
>>> +
>>> +    /*
>>> +     * If a root domain holds a DL task, it must have active CPUs. So
>>> +     * active CPUs can always be found by walking up the task's cpuset
>>> +     * hierarchy up to the partition root.
>>> +     */
>>> +    cpuset_cpus_allowed(p, cpus);
>> Grabs callbak_lock spin_lock (sleepable on RT) under pi_lock
>> raw_spin_lock.
> I have been thinking about changing callback_lock to a raw_spinlock_t, 
> but need to find a good use case for this change. So it is a solvable 
> problem. 

Actually, we don't need to acquire the callback_lock if cpuset_mutex is 
held. So another possibility is to create a cpuset_cpus_allowed() 
variant that doesn't acquire the callback_mutex but assert that 
cpuset_mutex is held.

Cheers,
Longman

Re: [PATCHv5] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Posted by Pingfan Liu 2 months, 4 weeks ago

On Mon, Nov 10, 2025 at 05:08:56PM -0500, Waiman Long wrote:
> On 11/10/25 4:07 PM, Waiman Long wrote:
> > On 11/10/25 6:14 AM, Juri Lelli wrote:
> > > Hi,
> > > 
> > > Looks like this has two issues.
> > > 
> > > On 10/11/25 09:47, Pingfan Liu wrote:
> > > 
> > > ...
> > > 
> > > > +/*
> > > > + * This function always returns a non-empty bitmap in @cpus.
> > > > This is because
> > > > + * if a root domain has reserved bandwidth for DL tasks, the DL
> > > > bandwidth
> > > > + * check will prevent CPU hotplug from deactivating all CPUs in
> > > > that domain.
> > > > + */
> > > > +static void dl_get_task_effective_cpus(struct task_struct *p,
> > > > struct cpumask *cpus)
> > > > +{
> > > > +    const struct cpumask *hk_msk;
> > > > +
> > > > +    hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > > > +    if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > > > +        if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > > > +            /*
> > > > +             * CPUs isolated by isolcpu="domain" always belong to
> > > > +             * def_root_domain.
> > > > +             */
> > > > +            cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > > > +            return;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * If a root domain holds a DL task, it must have active CPUs. So
> > > > +     * active CPUs can always be found by walking up the task's cpuset
> > > > +     * hierarchy up to the partition root.
> > > > +     */
> > > > +    cpuset_cpus_allowed(p, cpus);
> > > Grabs callbak_lock spin_lock (sleepable on RT) under pi_lock
> > > raw_spin_lock.
> > I have been thinking about changing callback_lock to a raw_spinlock_t,
> > but need to find a good use case for this change. So it is a solvable
> > problem.
> 

Thank you very much for your accommodation.

> Actually, we don't need to acquire the callback_lock if cpuset_mutex is
> held. So another possibility is to create a cpuset_cpus_allowed() variant
> that doesn't acquire the callback_mutex but assert that cpuset_mutex is
> held.
> 

The real requirement is a reader protection section starting from
dl_get_task_effective_cpus() to dl_b = &rq->rd->dl_bw;
But there is no handy lock which can spread across
cpuset_cpus_allowed(), I choose the write-lock "cpuset_mutex".

It would be perfect if cpuset_cpus_allowed() had a
cpuset_cpus_allowed_nolock() variant, and if callback_lock could be
changed to a raw_spinlock_t.

But if this is too trivial, I could move dl_get_task_effective_cpus()
outside the pi_lock and re-check task_cs(task) as an alternative.


Best Regards,

Pingfan