[PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug

Pingfan Liu posted 2 patches 1 week, 5 days ago
[PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Pingfan Liu 1 week, 5 days ago
*** Bug description ***
When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:

[   97.114759] psci: CPU142 killed (polled 0 ms)
[   97.333236] Failed to offline CPU143 - error=-16
[   97.333246] ------------[ cut here ]------------
[   97.342682] kernel BUG at kernel/cpu.c:1569!
[   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[...]

In essence, the issue originates from the CPU hot-removal process, not
limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
program that waits indefinitely on a semaphore, spawning multiple
instances to ensure some run on CPU 72, and then offlining CPUs 1–143
one by one. When attempting this, CPU 143 failed to go offline.
  bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'

Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
But that is not the fact, and contributed by the following factors:
When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
blocked-state deadline task (in this case, "cppc_fie"), it was not
migrated to CPU0, and its task_rq() information is stale. So its rq->rd
points to def_root_domain instead of the one shared with CPU0.  As a
result, its bandwidth is wrongly accounted into a wrong root domain
during domain rebuild.

*** Issue ***
The key point is that root_domain is only tracked through active rq->rd.
To avoid using a global data structure to track all root_domains in the
system, there should be a method to locate an active CPU within the
corresponding root_domain.

*** Solution ***
To locate the active cpu, the following rules for deadline
sub-system is useful
  -1.any cpu belongs to a unique root domain at a given time
  -2.DL bandwidth checker ensures that the root domain has active cpus.

Now, let's examine the blocked-state task P.
If P is attached to a cpuset that is a partition root, it is
straightforward to find an active CPU.
If P is attached to a cpuset that has changed from 'root' to 'member',
the active CPUs are grouped into the parent root domain. Naturally, the
CPUs' capacity and reserved DL bandwidth are taken into account in the
ancestor root domain. (In practice, it may be unsafe to attach P to an
arbitrary root domain, since that domain may lack sufficient DL
bandwidth for P.) Again, it is straightforward to find an active CPU in
the ancestor root domain.

This patch groups CPUs into isolated and housekeeping sets. For the
housekeeping group, it walks up the cpuset hierarchy to find active CPUs
in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
---
 kernel/sched/deadline.c | 54 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7b7671060bf9e..194a341e85864 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2465,6 +2465,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
 	return NULL;
 }
 
+/* Access rule: must be called on local CPU with preemption disabled */
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
 static int find_later_rq(struct task_struct *task)
@@ -2907,11 +2908,43 @@ void __init init_sched_dl_class(void)
 					GFP_KERNEL, cpu_to_node(i));
 }
 
+/*
+ * This function always returns a non-empty bitmap in @cpus. This is because
+ * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth
+ * check will prevent CPU hotplug from deactivating all CPUs in that domain.
+ */
+static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
+{
+	const struct cpumask *hk_msk;
+
+	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
+	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
+		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
+			/*
+			 * CPUs isolated by isolcpu="domain" always belong to
+			 * def_root_domain.
+			 */
+			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
+			return;
+		}
+	}
+
+	/*
+	 * If a root domain holds a DL task, it must have active CPUs. So
+	 * active CPUs can always be found by walking up the task's cpuset
+	 * hierarchy up to the partition root.
+	 */
+	cpuset_cpus_allowed_locked(p, cpus);
+}
+
+/* The caller should hold cpuset_mutex */
 void dl_add_task_root_domain(struct task_struct *p)
 {
 	struct rq_flags rf;
 	struct rq *rq;
 	struct dl_bw *dl_b;
+	unsigned int cpu;
+	struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2919,16 +2952,25 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	rq = __task_rq_lock(p, &rf);
-
+	/*
+	 * Get an active rq, whose rq->rd traces the correct root
+	 * domain.
+	 * Ideally this would be under cpuset reader lock until rq->rd is
+	 * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
+	 * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
+	 * to guarantee the CPU stays in the cpuset.
+	 */
+	dl_get_task_effective_cpus(p, msk);
+	cpu = cpumask_first_and(cpu_active_mask, msk);
+	BUG_ON(cpu >= nr_cpu_ids);
+	rq = cpu_rq(cpu);
 	dl_b = &rq->rd->dl_bw;
-	raw_spin_lock(&dl_b->lock);
+	/* End of fetching rd */
 
+	raw_spin_lock(&dl_b->lock);
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-
 	raw_spin_unlock(&dl_b->lock);
-
-	task_rq_unlock(rq, p, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
 void dl_clear_root_domain(struct root_domain *rd)
-- 
2.49.0

[PATCHv2 0/2] sched/deadline: Fix potential race in dl_add_task_root_domain()
Posted by Pingfan Liu 6 days, 22 hours ago
These two patches address the issue reported by Juri [1] (thanks!).

The first removes an unnecessary comment, the second is the actual fix.

@Tejun, while these could be squashed together, I kept them separate to
maintain the one-patch-one-purpose rule. let me know if you'd like me to
resend these in a different format, or feel free to adjust as needed.

[1]: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb

Pingfan Liu (2):
  sched/deadline: Remove unnecessary comment in
    dl_add_task_root_domain()
  sched/deadline: Fix potential race in dl_add_task_root_domain()

 kernel/sched/deadline.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

-- 
2.49.0
Re: [PATCHv2 0/2] sched/deadline: Fix potential race in dl_add_task_root_domain()
Posted by Waiman Long 6 days, 11 hours ago
On 11/24/25 10:26 PM, Pingfan Liu wrote:
> These two patches address the issue reported by Juri [1] (thanks!).
>
> The first removes an unnecessary comment, the second is the actual fix.
>
> @Tejun, while these could be squashed together, I kept them separate to
> maintain the one-patch-one-purpose rule. let me know if you'd like me to
> resend these in a different format, or feel free to adjust as needed.
>
> [1]: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb
>
> Pingfan Liu (2):
>    sched/deadline: Remove unnecessary comment in
>      dl_add_task_root_domain()
>    sched/deadline: Fix potential race in dl_add_task_root_domain()
>
>   kernel/sched/deadline.c | 12 ++----------
>   1 file changed, 2 insertions(+), 10 deletions(-)
>
For the patch series,

Acked-by:  Waiman Long <longman@redhat.com>

Re: [PATCHv2 0/2] sched/deadline: Fix potential race in dl_add_task_root_domain()
Posted by Juri Lelli 6 days, 18 hours ago
Hi,

On 25/11/25 11:26, Pingfan Liu wrote:
> These two patches address the issue reported by Juri [1] (thanks!).
> 
> The first removes an unnecessary comment, the second is the actual fix.
> 
> @Tejun, while these could be squashed together, I kept them separate to
> maintain the one-patch-one-purpose rule. let me know if you'd like me to
> resend these in a different format, or feel free to adjust as needed.
> 
> [1]: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb
> 
> Pingfan Liu (2):
>   sched/deadline: Remove unnecessary comment in
>     dl_add_task_root_domain()
>   sched/deadline: Fix potential race in dl_add_task_root_domain()
> 
>  kernel/sched/deadline.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)

For both

Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks!
Juri
[PATCHv2 1/2] sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()
Posted by Pingfan Liu 6 days, 22 hours ago
The comments above dl_get_task_effective_cpus() and
dl_add_task_root_domain() already explain how to fetch a valid
root domain and protect against races. There's no need to repeat
this inside dl_add_task_root_domain(). Remove the redundant comment
to keep the code clean.

No functional change.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: cgroups@vger.kernel.org
To: linux-kernel@vger.kernel.org
---
 kernel/sched/deadline.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 194a341e85864..0b6646259bcd7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2952,20 +2952,11 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	/*
-	 * Get an active rq, whose rq->rd traces the correct root
-	 * domain.
-	 * Ideally this would be under cpuset reader lock until rq->rd is
-	 * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
-	 * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
-	 * to guarantee the CPU stays in the cpuset.
-	 */
 	dl_get_task_effective_cpus(p, msk);
 	cpu = cpumask_first_and(cpu_active_mask, msk);
 	BUG_ON(cpu >= nr_cpu_ids);
 	rq = cpu_rq(cpu);
 	dl_b = &rq->rd->dl_bw;
-	/* End of fetching rd */
 
 	raw_spin_lock(&dl_b->lock);
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-- 
2.49.0
[PATCHv2 2/2] sched/deadline: Fix potential race in dl_add_task_root_domain()
Posted by Pingfan Liu 6 days, 22 hours ago
The access rule for local_cpu_mask_dl requires it to be called on the
local CPU with preemption disabled. However, dl_add_task_root_domain()
currently violates this rule.

Without preemption disabled, the following race can occur:

1. ThreadA calls dl_add_task_root_domain() on CPU 0
2. Gets pointer to CPU 0's local_cpu_mask_dl
3. ThreadA is preempted and migrated to CPU 1
4. ThreadA continues using CPU 0's local_cpu_mask_dl
5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also
   uses local_cpu_mask_dl (with preemption properly disabled)
6. Both contexts now corrupt the same per-CPU buffer concurrently

Fix this by moving the local_cpu_mask_dl access to the preemption
disabled section.

Closes: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb
Fixes: 318e18ed22e8 ("sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug")
Reported-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: cgroups@vger.kernel.org
To: linux-kernel@vger.kernel.org
---
 kernel/sched/deadline.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0b6646259bcd7..65c8539b468a2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2944,7 +2944,7 @@ void dl_add_task_root_domain(struct task_struct *p)
 	struct rq *rq;
 	struct dl_bw *dl_b;
 	unsigned int cpu;
-	struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
+	struct cpumask *msk;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2952,6 +2952,7 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
+	msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
 	dl_get_task_effective_cpus(p, msk);
 	cpu = cpumask_first_and(cpu_active_mask, msk);
 	BUG_ON(cpu >= nr_cpu_ids);
-- 
2.49.0
Re: [PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Juri Lelli 1 week, 3 days ago
Hi!

On 19/11/25 17:55, Pingfan Liu wrote:

...

> +/* Access rule: must be called on local CPU with preemption disabled */
>  static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);

...

> +/* The caller should hold cpuset_mutex */

Maybe we can add a lockdep explicit check?

>  void dl_add_task_root_domain(struct task_struct *p)
>  {
>  	struct rq_flags rf;
>  	struct rq *rq;
>  	struct dl_bw *dl_b;
> +	unsigned int cpu;
> +	struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);

Can this corrupt local_cpu_mask_dl?

Without preemption being disabled, the following race can occur:

1. Thread calls dl_add_task_root_domain() on CPU 0
2. Gets pointer to CPU 0's local_cpu_mask_dl
3. Thread is preempted and migrated to CPU 1
4. Thread continues using CPU 0's local_cpu_mask_dl
5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also
   uses local_cpu_mask_dl (with preemption properly disabled)
6. Both contexts now corrupt the same per-CPU buffer concurrently

>  
>  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);

It's safe to get the pointer after this point.

>  	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> @@ -2919,16 +2952,25 @@ void dl_add_task_root_domain(struct task_struct *p)
>  		return;
>  	}
>  
> -	rq = __task_rq_lock(p, &rf);
> -
> +	/*
> +	 * Get an active rq, whose rq->rd traces the correct root
> +	 * domain.
> +	 * Ideally this would be under cpuset reader lock until rq->rd is
> +	 * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
> +	 * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
> +	 * to guarantee the CPU stays in the cpuset.
> +	 */
> +	dl_get_task_effective_cpus(p, msk);
> +	cpu = cpumask_first_and(cpu_active_mask, msk);
> +	BUG_ON(cpu >= nr_cpu_ids);
> +	rq = cpu_rq(cpu);
>  	dl_b = &rq->rd->dl_bw;
> -	raw_spin_lock(&dl_b->lock);
> +	/* End of fetching rd */

Not sure we need this comment above. :)

> +	raw_spin_lock(&dl_b->lock);
>  	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
> -
>  	raw_spin_unlock(&dl_b->lock);
> -
> -	task_rq_unlock(rq, p, &rf);
> +	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
>  }

Thanks,
Juri
Re: [PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Pingfan Liu 1 week, 1 day ago
On Fri, Nov 21, 2025 at 02:05:31PM +0100, Juri Lelli wrote:
> Hi!
> 
> On 19/11/25 17:55, Pingfan Liu wrote:
> 
> ...
> 
> > +/* Access rule: must be called on local CPU with preemption disabled */
> >  static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
> 
> ...
> 
> > +/* The caller should hold cpuset_mutex */
> 
> Maybe we can add a lockdep explicit check?
> 
Currently, all cpuset locks are encapsulated in 
kernel/cgroup/cpuset-internal.h. I'm not sure if it's appropriate to
expose them. If exposing them is acceptable,
cpuset_callback_lock_irq()/cpuset_callback_unlock_irq() would be
preferable to cpuset_mutex assertion.

@Waiman, @Ridong, could you kindly share your opinion?

> >  void dl_add_task_root_domain(struct task_struct *p)
> >  {
> >  	struct rq_flags rf;
> >  	struct rq *rq;
> >  	struct dl_bw *dl_b;
> > +	unsigned int cpu;
> > +	struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
> 
> Can this corrupt local_cpu_mask_dl?
> 
> Without preemption being disabled, the following race can occur:
> 
> 1. Thread calls dl_add_task_root_domain() on CPU 0
> 2. Gets pointer to CPU 0's local_cpu_mask_dl
> 3. Thread is preempted and migrated to CPU 1
> 4. Thread continues using CPU 0's local_cpu_mask_dl
> 5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also
>    uses local_cpu_mask_dl (with preemption properly disabled)
> 6. Both contexts now corrupt the same per-CPU buffer concurrently
> 

Oh, that is definitely an issue. Thanks for pointing it out.

> >  
> >  	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> 
> It's safe to get the pointer after this point.
> 

Yes.
> >  	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> > @@ -2919,16 +2952,25 @@ void dl_add_task_root_domain(struct task_struct *p)
> >  		return;
> >  	}
> >  
> > -	rq = __task_rq_lock(p, &rf);
> > -
> > +	/*
> > +	 * Get an active rq, whose rq->rd traces the correct root
> > +	 * domain.
> > +	 * Ideally this would be under cpuset reader lock until rq->rd is
> > +	 * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
> > +	 * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
> > +	 * to guarantee the CPU stays in the cpuset.
> > +	 */
> > +	dl_get_task_effective_cpus(p, msk);
> > +	cpu = cpumask_first_and(cpu_active_mask, msk);
> > +	BUG_ON(cpu >= nr_cpu_ids);
> > +	rq = cpu_rq(cpu);
> >  	dl_b = &rq->rd->dl_bw;
> > -	raw_spin_lock(&dl_b->lock);
> > +	/* End of fetching rd */
> 
> Not sure we need this comment above. :)
> 

OK, I can remove them to keep the code neat.


Thanks,

Pingfan
Re: [PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Waiman Long 1 week ago
On 11/23/25 8:45 PM, Pingfan Liu wrote:
> On Fri, Nov 21, 2025 at 02:05:31PM +0100, Juri Lelli wrote:
>> Hi!
>>
>> On 19/11/25 17:55, Pingfan Liu wrote:
>>
>> ...
>>
>>> +/* Access rule: must be called on local CPU with preemption disabled */
>>>   static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>> ...
>>
>>> +/* The caller should hold cpuset_mutex */
>> Maybe we can add a lockdep explicit check?
>>
> Currently, all cpuset locks are encapsulated in
> kernel/cgroup/cpuset-internal.h. I'm not sure if it's appropriate to
> expose them. If exposing them is acceptable,
> cpuset_callback_lock_irq()/cpuset_callback_unlock_irq() would be
> preferable to cpuset_mutex assertion.
>
> @Waiman, @Ridong, could you kindly share your opinion?

The cpuset_cpus_allowed_locked() already has a 
"lockdep_assert_held(&cpuset_mutex)"  call to make sure that 
cpuset_mutex is held, or a warning will be printed by the debug kernel. 
So a check is there, it is just not in the deadline.c code. The 
dl_add_task_root_domain() is called indirectly from 
dl_rebuild_rd_accounting() in cpuset.c which does have an assertion on 
cpuset_mutex.

There is an external visible cpuset_lock/unlock() to acquire and release 
the cpuset_mutex. However, there is no public API to assert that 
cpuset_mutex is held. There is another set of patch series that is going 
to add that in the near future. At this point, I don't think we need to 
have such an API yet. I will suggest adding comment 
to cpuset_cpus_allowed_locked() that it will warn if cpuset_mutex isn't 
held.

Providing a cpuset_callback_{lock|unlock}_irq() helpers may not be 
helpful because we are back to the problem that callback_lock isn't a 
raw_spinlock_t.

Cheers,
Longman
>
>>>   void dl_add_task_root_domain(struct task_struct *p)
>>>   {
>>>   	struct rq_flags rf;
>>>   	struct rq *rq;
>>>   	struct dl_bw *dl_b;
>>> +	unsigned int cpu;
>>> +	struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl);
>> Can this corrupt local_cpu_mask_dl?
>>
>> Without preemption being disabled, the following race can occur:
>>
>> 1. Thread calls dl_add_task_root_domain() on CPU 0
>> 2. Gets pointer to CPU 0's local_cpu_mask_dl
>> 3. Thread is preempted and migrated to CPU 1
>> 4. Thread continues using CPU 0's local_cpu_mask_dl
>> 5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also
>>     uses local_cpu_mask_dl (with preemption properly disabled)
>> 6. Both contexts now corrupt the same per-CPU buffer concurrently
>>
> Oh, that is definitely an issue. Thanks for pointing it out.
>
>>>   
>>>   	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>> It's safe to get the pointer after this point.
>>
> Yes.
>>>   	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
>>> @@ -2919,16 +2952,25 @@ void dl_add_task_root_domain(struct task_struct *p)
>>>   		return;
>>>   	}
>>>   
>>> -	rq = __task_rq_lock(p, &rf);
>>> -
>>> +	/*
>>> +	 * Get an active rq, whose rq->rd traces the correct root
>>> +	 * domain.
>>> +	 * Ideally this would be under cpuset reader lock until rq->rd is
>>> +	 * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
>>> +	 * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
>>> +	 * to guarantee the CPU stays in the cpuset.
>>> +	 */
>>> +	dl_get_task_effective_cpus(p, msk);
>>> +	cpu = cpumask_first_and(cpu_active_mask, msk);
>>> +	BUG_ON(cpu >= nr_cpu_ids);
>>> +	rq = cpu_rq(cpu);
>>>   	dl_b = &rq->rd->dl_bw;
>>> -	raw_spin_lock(&dl_b->lock);
>>> +	/* End of fetching rd */
>> Not sure we need this comment above. :)
>>
> OK, I can remove them to keep the code neat.
>
>
> Thanks,
>
> Pingfan
>

Re: [PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Pingfan Liu 1 week ago
On Mon, Nov 24, 2025 at 10:24 AM Waiman Long <llong@redhat.com> wrote:
>
[...]
> > Currently, all cpuset locks are encapsulated in
> > kernel/cgroup/cpuset-internal.h. I'm not sure if it's appropriate to
> > expose them. If exposing them is acceptable,
> > cpuset_callback_lock_irq()/cpuset_callback_unlock_irq() would be
> > preferable to cpuset_mutex assertion.
> >
> > @Waiman, @Ridong, could you kindly share your opinion?
>
> The cpuset_cpus_allowed_locked() already has a
> "lockdep_assert_held(&cpuset_mutex)"  call to make sure that
> cpuset_mutex is held, or a warning will be printed by the debug kernel.
> So a check is there, it is just not in the deadline.c code. The
> dl_add_task_root_domain() is called indirectly from
> dl_rebuild_rd_accounting() in cpuset.c which does have an assertion on
> cpuset_mutex.
>
> There is an external visible cpuset_lock/unlock() to acquire and release
> the cpuset_mutex. However, there is no public API to assert that
> cpuset_mutex is held. There is another set of patch series that is going
> to add that in the near future. At this point, I don't think we need to
> have such an API yet. I will suggest adding comment
> to cpuset_cpus_allowed_locked() that it will warn if cpuset_mutex isn't
> held.
>
> Providing a cpuset_callback_{lock|unlock}_irq() helpers may not be
> helpful because we are back to the problem that callback_lock isn't a
> raw_spinlock_t.
>

I meant to put them outside the pi_lock, so it can reflect the
original purpose of this section -- cpuset read access instead of
write. But yes, I agree that at this point, there is no need to
introduce a public API.

> >
> >>>   void dl_add_task_root_domain(struct task_struct *p)
> >>>   {
[...]

> >>> +   /*
> >>> +    * Get an active rq, whose rq->rd traces the correct root
> >>> +    * domain.
> >>> +    * Ideally this would be under cpuset reader lock until rq->rd is
> >>> +    * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
> >>> +    * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
> >>> +    * to guarantee the CPU stays in the cpuset.
> >>> +    */
> >>> +   dl_get_task_effective_cpus(p, msk);
> >>> +   cpu = cpumask_first_and(cpu_active_mask, msk);
> >>> +   BUG_ON(cpu >= nr_cpu_ids);
> >>> +   rq = cpu_rq(cpu);
> >>>     dl_b = &rq->rd->dl_bw;
> >>> -   raw_spin_lock(&dl_b->lock);
> >>> +   /* End of fetching rd */
> >> Not sure we need this comment above. :)
> >>
> > OK, I can remove them to keep the code neat.

@Juri, sorry - I need to send out a fix that should be simple and
focused: just the fix itself, without removing the comments. So I have
not removed them. Anyway, they can remind us this is an atomic cpuset
read context.


Best Regards,

Pingfan
Re: [PATCHv7 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Posted by Waiman Long 1 week ago
On 11/23/25 10:56 PM, Pingfan Liu wrote:
> On Mon, Nov 24, 2025 at 10:24 AM Waiman Long <llong@redhat.com> wrote:
> [...]
>>> Currently, all cpuset locks are encapsulated in
>>> kernel/cgroup/cpuset-internal.h. I'm not sure if it's appropriate to
>>> expose them. If exposing them is acceptable,
>>> cpuset_callback_lock_irq()/cpuset_callback_unlock_irq() would be
>>> preferable to cpuset_mutex assertion.
>>>
>>> @Waiman, @Ridong, could you kindly share your opinion?
>> The cpuset_cpus_allowed_locked() already has a
>> "lockdep_assert_held(&cpuset_mutex)"  call to make sure that
>> cpuset_mutex is held, or a warning will be printed by the debug kernel.
>> So a check is there, it is just not in the deadline.c code. The
>> dl_add_task_root_domain() is called indirectly from
>> dl_rebuild_rd_accounting() in cpuset.c which does have an assertion on
>> cpuset_mutex.
>>
>> There is an external visible cpuset_lock/unlock() to acquire and release
>> the cpuset_mutex. However, there is no public API to assert that
>> cpuset_mutex is held. There is another set of patch series that is going
>> to add that in the near future. At this point, I don't think we need to
>> have such an API yet. I will suggest adding comment
>> to cpuset_cpus_allowed_locked() that it will warn if cpuset_mutex isn't
>> held.
>>
>> Providing a cpuset_callback_{lock|unlock}_irq() helpers may not be
>> helpful because we are back to the problem that callback_lock isn't a
>> raw_spinlock_t.
>>
> I meant to put them outside the pi_lock, so it can reflect the
> original purpose of this section -- cpuset read access instead of
> write. But yes, I agree that at this point, there is no need to
> introduce a public API.

By putting it outside of the pi_lock, you may as well just call 
cpuset_cpus_allowed().

Cheers,
Longman

>>>>>    void dl_add_task_root_domain(struct task_struct *p)
>>>>>    {
> [...]
>
>>>>> +   /*
>>>>> +    * Get an active rq, whose rq->rd traces the correct root
>>>>> +    * domain.
>>>>> +    * Ideally this would be under cpuset reader lock until rq->rd is
>>>>> +    * fetched.  However, sleepable locks cannot nest inside pi_lock, so we
>>>>> +    * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex'
>>>>> +    * to guarantee the CPU stays in the cpuset.
>>>>> +    */
>>>>> +   dl_get_task_effective_cpus(p, msk);
>>>>> +   cpu = cpumask_first_and(cpu_active_mask, msk);
>>>>> +   BUG_ON(cpu >= nr_cpu_ids);
>>>>> +   rq = cpu_rq(cpu);
>>>>>      dl_b = &rq->rd->dl_bw;
>>>>> -   raw_spin_lock(&dl_b->lock);
>>>>> +   /* End of fetching rd */
>>>> Not sure we need this comment above. :)
>>>>
>>> OK, I can remove them to keep the code neat.
> @Juri, sorry - I need to send out a fix that should be simple and
> focused: just the fix itself, without removing the comments. So I have
> not removed them. Anyway, they can remind us this is an atomic cpuset
> read context.
>
>
> Best Regards,
>
> Pingfan
>