[v3] Fix SCHED_DEADLINE bandwidth accounting during suspend

[PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Juri Lelli 11 months ago

Rebuilding of root domains accounting information (total_bw) is
currently broken on some cases, e.g. suspend/resume on aarch64. Problem
is that the way we keep track of domain changes and try to add bandwidth
back is convoluted and fragile.

Fix it by simplify things by making sure bandwidth accounting is cleared
and completely restored after root domains changes (after root domains
are again stable).

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
v2 -> v3: remove spurious dl_bw_visited declaration (Shrikanth)
---
 include/linux/sched/deadline.h |  1 +
 include/linux/sched/topology.h |  2 ++
 kernel/cgroup/cpuset.c         | 16 +++++++++-------
 kernel/sched/deadline.c        | 16 ++++++++++------
 kernel/sched/topology.c        |  1 +
 5 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 6ec578600b24..f9aabbc9d22e 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -34,6 +34,7 @@ static inline bool dl_time_before(u64 a, u64 b)
 struct root_domain;
 extern void dl_add_task_root_domain(struct task_struct *p);
 extern void dl_clear_root_domain(struct root_domain *rd);
+extern void dl_clear_root_domain_cpu(int cpu);
 
 #endif /* CONFIG_SMP */
 
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7f3dbafe1817..1622232bd08b 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -166,6 +166,8 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+extern void dl_rebuild_rd_accounting(void);
+
 extern void partition_sched_domains_locked(int ndoms_new,
 					   cpumask_var_t doms_new[],
 					   struct sched_domain_attr *dattr_new);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f87526edb2a4..f66b2aefdc04 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -954,10 +954,12 @@ static void dl_update_tasks_root_domain(struct cpuset *cs)
 	css_task_iter_end(&it);
 }
 
-static void dl_rebuild_rd_accounting(void)
+void dl_rebuild_rd_accounting(void)
 {
 	struct cpuset *cs = NULL;
 	struct cgroup_subsys_state *pos_css;
+	int cpu;
+	u64 cookie = ++dl_cookie;
 
 	lockdep_assert_held(&cpuset_mutex);
 	lockdep_assert_cpus_held();
@@ -965,11 +967,12 @@ static void dl_rebuild_rd_accounting(void)
 
 	rcu_read_lock();
 
-	/*
-	 * Clear default root domain DL accounting, it will be computed again
-	 * if a task belongs to it.
-	 */
-	dl_clear_root_domain(&def_root_domain);
+	for_each_possible_cpu(cpu) {
+		if (dl_bw_visited(cpu, cookie))
+			continue;
+
+		dl_clear_root_domain_cpu(cpu);
+	}
 
 	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
 
@@ -996,7 +999,6 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 {
 	sched_domains_mutex_lock();
 	partition_sched_domains_locked(ndoms_new, doms_new, dattr_new);
-	dl_rebuild_rd_accounting();
 	sched_domains_mutex_unlock();
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3e05032e9e0e..5dca336cdd7c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -166,7 +166,7 @@ static inline unsigned long dl_bw_capacity(int i)
 	}
 }
 
-static inline bool dl_bw_visited(int cpu, u64 cookie)
+bool dl_bw_visited(int cpu, u64 cookie)
 {
 	struct root_domain *rd = cpu_rq(cpu)->rd;
 
@@ -207,7 +207,7 @@ static inline unsigned long dl_bw_capacity(int i)
 	return SCHED_CAPACITY_SCALE;
 }
 
-static inline bool dl_bw_visited(int cpu, u64 cookie)
+bool dl_bw_visited(int cpu, u64 cookie)
 {
 	return false;
 }
@@ -2981,18 +2981,22 @@ void dl_clear_root_domain(struct root_domain *rd)
 	rd->dl_bw.total_bw = 0;
 
 	/*
-	 * dl_server bandwidth is only restored when CPUs are attached to root
-	 * domains (after domains are created or CPUs moved back to the
-	 * default root doamin).
+	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
+	 * them, we need to account for them here explicitly.
 	 */
 	for_each_cpu(i, rd->span) {
 		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
 
 		if (dl_server(dl_se) && cpu_active(i))
-			rd->dl_bw.total_bw += dl_se->dl_bw;
+			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
 	}
 }
 
+void dl_clear_root_domain_cpu(int cpu)
+{
+	dl_clear_root_domain(cpu_rq(cpu)->rd);
+}
+
 #endif /* CONFIG_SMP */
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 44093339761c..363ad268a25b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	ndoms_cur = ndoms_new;
 
 	update_sched_domain_debugfs();
+	dl_rebuild_rd_accounting();
 }
 
 /*
-- 
2.48.1

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Dietmar Eggemann 11 months ago

On 10/03/2025 10:37, Juri Lelli wrote:
> Rebuilding of root domains accounting information (total_bw) is
> currently broken on some cases, e.g. suspend/resume on aarch64. Problem

Nit: Couldn't spot any arch dependency here. I guess it was just tested
on Arm64 platforms so far.

[...]

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 44093339761c..363ad268a25b 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
>  	ndoms_cur = ndoms_new;
>  
>  	update_sched_domain_debugfs();
> +	dl_rebuild_rd_accounting();

Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
barf when called via cpuhp's:

sched_cpu_deactivate()

  cpuset_cpu_inactive()

    partition_sched_domains()

      partition_sched_domains_locked()

        dl_rebuild_rd_accounting()

?

[...]

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Waiman Long 11 months ago

On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
> On 10/03/2025 10:37, Juri Lelli wrote:
>> Rebuilding of root domains accounting information (total_bw) is
>> currently broken on some cases, e.g. suspend/resume on aarch64. Problem
> Nit: Couldn't spot any arch dependency here. I guess it was just tested
> on Arm64 platforms so far.
>
> [...]
>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 44093339761c..363ad268a25b 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
>>   	ndoms_cur = ndoms_new;
>>   
>>   	update_sched_domain_debugfs();
>> +	dl_rebuild_rd_accounting();
> Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
> barf when called via cpuhp's:
>
> sched_cpu_deactivate()
>
>    cpuset_cpu_inactive()
>
>      partition_sched_domains()
>
>        partition_sched_domains_locked()
>
>          dl_rebuild_rd_accounting()
>
> ?
>
> [...]

Right. If cpuhp_tasks_frozen is true, partition_sched_domains() will be 
called without holding cpuset mutex.

Well, I think we will need an additional wrapper in cpuset.c that 
acquires the cpuset_mutex first before calling partition_sched_domains() 
and use the new wrapper in these cases.

Cheers,
Longman

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Waiman Long 11 months ago

On 3/10/25 3:18 PM, Waiman Long wrote:
>
> On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
>> On 10/03/2025 10:37, Juri Lelli wrote:
>>> Rebuilding of root domains accounting information (total_bw) is
>>> currently broken on some cases, e.g. suspend/resume on aarch64. Problem
>> Nit: Couldn't spot any arch dependency here. I guess it was just tested
>> on Arm64 platforms so far.
>>
>> [...]
>>
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 44093339761c..363ad268a25b 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int 
>>> ndoms_new, cpumask_var_t doms_new[],
>>>       ndoms_cur = ndoms_new;
>>>         update_sched_domain_debugfs();
>>> +    dl_rebuild_rd_accounting();
>> Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
>> barf when called via cpuhp's:
>>
>> sched_cpu_deactivate()
>>
>>    cpuset_cpu_inactive()
>>
>>      partition_sched_domains()
>>
>>        partition_sched_domains_locked()
>>
>>          dl_rebuild_rd_accounting()
>>
>> ?
>>
>> [...]
>
> Right. If cpuhp_tasks_frozen is true, partition_sched_domains() will 
> be called without holding cpuset mutex.
>
> Well, I think we will need an additional wrapper in cpuset.c that 
> acquires the cpuset_mutex first before calling 
> partition_sched_domains() and use the new wrapper in these cases.

Actually, partition_sched_domains() is called with the special arguments 
(1, NULL, NULL) to reset the domain to a single one. So perhaps 
something like the following will be enough to avoid this problem.

BTW, we can merge partition_sched_domains_locked() into 
partition_sched_domains() as there is no other caller.

Cheers,
Longman

------------------------------------------------------------------------------------------------

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 95bde793651c..39b9ffa6a39a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2692,6 +2692,7 @@ static void partition_sched_domains_locked(int 
ndoms_new, cpumask_var_t doms_new
                                     struct sched_domain_attr *dattr_new)
  {
         bool __maybe_unused has_eas = false;
+       bool reset_domain = false;
         int i, j, n;
         int new_topology;

@@ -2706,6 +2707,7 @@ static void partition_sched_domains_locked(int 
ndoms_new, cpumask_var_t doms_new
         if (!doms_new) {
                 WARN_ON_ONCE(dattr_new);
                 n = 0;
+               reset_domain = true;
                 doms_new = alloc_sched_domains(1);
                 if (doms_new) {
                         n = 1;
@@ -2778,7 +2780,8 @@ static void partition_sched_domains_locked(int 
ndoms_new, cpumask_var_t doms_new
         ndoms_cur = ndoms_new;

         update_sched_domain_debugfs();
-       dl_rebuild_rd_accounting();
+       if (!reset_domain)
+               dl_rebuild_rd_accounting();
  }

  /*

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Juri Lelli 11 months ago

On 10/03/25 20:16, Waiman Long wrote:
> On 3/10/25 3:18 PM, Waiman Long wrote:
> > 
> > On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
> > > On 10/03/2025 10:37, Juri Lelli wrote:
> > > > Rebuilding of root domains accounting information (total_bw) is
> > > > currently broken on some cases, e.g. suspend/resume on aarch64. Problem
> > > Nit: Couldn't spot any arch dependency here. I guess it was just tested
> > > on Arm64 platforms so far.
> > > 
> > > [...]
> > > 
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index 44093339761c..363ad268a25b 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int
> > > > ndoms_new, cpumask_var_t doms_new[],
> > > >       ndoms_cur = ndoms_new;
> > > >         update_sched_domain_debugfs();
> > > > +    dl_rebuild_rd_accounting();
> > > Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
> > > barf when called via cpuhp's:
> > > 
> > > sched_cpu_deactivate()
> > > 
> > >    cpuset_cpu_inactive()
> > > 
> > >      partition_sched_domains()
> > > 
> > >        partition_sched_domains_locked()
> > > 
> > >          dl_rebuild_rd_accounting()
> > > 
> > > ?

Good catch. Guess I didn't notice while testing with LOCKDEP as I was
never able to hit this call path on my systems.

> > Right. If cpuhp_tasks_frozen is true, partition_sched_domains() will be
> > called without holding cpuset mutex.
> > 
> > Well, I think we will need an additional wrapper in cpuset.c that
> > acquires the cpuset_mutex first before calling partition_sched_domains()
> > and use the new wrapper in these cases.
> 
> Actually, partition_sched_domains() is called with the special arguments (1,
> NULL, NULL) to reset the domain to a single one. So perhaps something like
> the following will be enough to avoid this problem.

I think this would work, as we will still rebuild the accounting after
last CPU comes back from suspend. The thing I am still not sure about is
what we want to do in case we have DEADLINE tasks around, since with
this I belive we would be ignoring them and let suspend proceed.

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Waiman Long 11 months ago

On 3/11/25 7:59 AM, Juri Lelli wrote:
> On 10/03/25 20:16, Waiman Long wrote:
>> On 3/10/25 3:18 PM, Waiman Long wrote:
>>> On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
>>>> On 10/03/2025 10:37, Juri Lelli wrote:
>>>>> Rebuilding of root domains accounting information (total_bw) is
>>>>> currently broken on some cases, e.g. suspend/resume on aarch64. Problem
>>>> Nit: Couldn't spot any arch dependency here. I guess it was just tested
>>>> on Arm64 platforms so far.
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>>> index 44093339761c..363ad268a25b 100644
>>>>> --- a/kernel/sched/topology.c
>>>>> +++ b/kernel/sched/topology.c
>>>>> @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int
>>>>> ndoms_new, cpumask_var_t doms_new[],
>>>>>        ndoms_cur = ndoms_new;
>>>>>          update_sched_domain_debugfs();
>>>>> +    dl_rebuild_rd_accounting();
>>>> Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
>>>> barf when called via cpuhp's:
>>>>
>>>> sched_cpu_deactivate()
>>>>
>>>>     cpuset_cpu_inactive()
>>>>
>>>>       partition_sched_domains()
>>>>
>>>>         partition_sched_domains_locked()
>>>>
>>>>           dl_rebuild_rd_accounting()
>>>>
>>>> ?
> Good catch. Guess I didn't notice while testing with LOCKDEP as I was
> never able to hit this call path on my systems.
>
>>> Right. If cpuhp_tasks_frozen is true, partition_sched_domains() will be
>>> called without holding cpuset mutex.
>>>
>>> Well, I think we will need an additional wrapper in cpuset.c that
>>> acquires the cpuset_mutex first before calling partition_sched_domains()
>>> and use the new wrapper in these cases.
>> Actually, partition_sched_domains() is called with the special arguments (1,
>> NULL, NULL) to reset the domain to a single one. So perhaps something like
>> the following will be enough to avoid this problem.
> I think this would work, as we will still rebuild the accounting after
> last CPU comes back from suspend. The thing I am still not sure about is
> what we want to do in case we have DEADLINE tasks around, since with
> this I belive we would be ignoring them and let suspend proceed.

That is the current behavior. You can certainly create a test case to 
trigger such condition and see what to do about it. Alternatively, you 
can document that and come up with a follow-up patch later on.

Cheers,
Longman

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Dietmar Eggemann 11 months ago

On 11/03/2025 13:34, Waiman Long wrote:
> On 3/11/25 7:59 AM, Juri Lelli wrote:
>> On 10/03/25 20:16, Waiman Long wrote:
>>> On 3/10/25 3:18 PM, Waiman Long wrote:
>>>> On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
>>>>> On 10/03/2025 10:37, Juri Lelli wrote:
>>>>>> Rebuilding of root domains accounting information (total_bw) is
>>>>>> currently broken on some cases, e.g. suspend/resume on aarch64.
>>>>>> Problem
>>>>> Nit: Couldn't spot any arch dependency here. I guess it was just
>>>>> tested
>>>>> on Arm64 platforms so far.
>>>>>
>>>>> [...]
>>>>>
>>>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>>>> index 44093339761c..363ad268a25b 100644
>>>>>> --- a/kernel/sched/topology.c
>>>>>> +++ b/kernel/sched/topology.c
>>>>>> @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int
>>>>>> ndoms_new, cpumask_var_t doms_new[],
>>>>>>        ndoms_cur = ndoms_new;
>>>>>>          update_sched_domain_debugfs();
>>>>>> +    dl_rebuild_rd_accounting();
>>>>> Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
>>>>> barf when called via cpuhp's:
>>>>>
>>>>> sched_cpu_deactivate()
>>>>>
>>>>>     cpuset_cpu_inactive()
>>>>>
>>>>>       partition_sched_domains()
>>>>>
>>>>>         partition_sched_domains_locked()
>>>>>
>>>>>           dl_rebuild_rd_accounting()
>>>>>
>>>>> ?
>> Good catch. Guess I didn't notice while testing with LOCKDEP as I was
>> never able to hit this call path on my systems.
>>
>>>> Right. If cpuhp_tasks_frozen is true, partition_sched_domains() will be
>>>> called without holding cpuset mutex.
>>>>
>>>> Well, I think we will need an additional wrapper in cpuset.c that
>>>> acquires the cpuset_mutex first before calling
>>>> partition_sched_domains()
>>>> and use the new wrapper in these cases.
>>> Actually, partition_sched_domains() is called with the special
>>> arguments (1,
>>> NULL, NULL) to reset the domain to a single one. So perhaps something
>>> like
>>> the following will be enough to avoid this problem.
>> I think this would work, as we will still rebuild the accounting after
>> last CPU comes back from suspend. The thing I am still not sure about is
>> what we want to do in case we have DEADLINE tasks around, since with
>> this I belive we would be ignoring them and let suspend proceed.
> 
> That is the current behavior. You can certainly create a test case to
> trigger such condition and see what to do about it. Alternatively, you
> can document that and come up with a follow-up patch later on.

But don't we rely on that partition_sched_domains_locked() calls
dl_rebuild_rd_accounting() even in the reset_domain=1 case?

Testcase: suspend/resume

on Arm64 big.LITTLE cpumask=[LITTLE][big]=[0,3-5][1-2]
plus cmd line option 'isolcpus=3,4'.

with Waiman's snippet:
https://lkml.kernel.org/r/fd4d6143-9bd2-4a7c-80dc-1e19e4d1b2d1@redhat.com

...
[  234.831675] --- > partition_sched_domains_locked() reset_domain=1
[  234.835966] psci: CPU4 killed (polled 0 ms)
[  234.838912] Error taking CPU3 down: -16
[  234.838952] Non-boot CPUs are not disabled
[  234.838986] Enabling non-boot CPUs ...
...

IIRC, that's the old DL accounting issue.

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Waiman Long 11 months ago

On 3/11/25 9:29 AM, Dietmar Eggemann wrote:
> On 11/03/2025 13:34, Waiman Long wrote:
>> On 3/11/25 7:59 AM, Juri Lelli wrote:
>>> On 10/03/25 20:16, Waiman Long wrote:
>>>> On 3/10/25 3:18 PM, Waiman Long wrote:
>>>>> On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
>>>>>> On 10/03/2025 10:37, Juri Lelli wrote:
>>>>>>> Rebuilding of root domains accounting information (total_bw) is
>>>>>>> currently broken on some cases, e.g. suspend/resume on aarch64.
>>>>>>> Problem
>>>>>> Nit: Couldn't spot any arch dependency here. I guess it was just
>>>>>> tested
>>>>>> on Arm64 platforms so far.
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>>>>> index 44093339761c..363ad268a25b 100644
>>>>>>> --- a/kernel/sched/topology.c
>>>>>>> +++ b/kernel/sched/topology.c
>>>>>>> @@ -2791,6 +2791,7 @@ void partition_sched_domains_locked(int
>>>>>>> ndoms_new, cpumask_var_t doms_new[],
>>>>>>>         ndoms_cur = ndoms_new;
>>>>>>>           update_sched_domain_debugfs();
>>>>>>> +    dl_rebuild_rd_accounting();
>>>>>> Won't dl_rebuild_rd_accounting()'s lockdep_assert_held(&cpuset_mutex)
>>>>>> barf when called via cpuhp's:
>>>>>>
>>>>>> sched_cpu_deactivate()
>>>>>>
>>>>>>      cpuset_cpu_inactive()
>>>>>>
>>>>>>        partition_sched_domains()
>>>>>>
>>>>>>          partition_sched_domains_locked()
>>>>>>
>>>>>>            dl_rebuild_rd_accounting()
>>>>>>
>>>>>> ?
>>> Good catch. Guess I didn't notice while testing with LOCKDEP as I was
>>> never able to hit this call path on my systems.
>>>
>>>>> Right. If cpuhp_tasks_frozen is true, partition_sched_domains() will be
>>>>> called without holding cpuset mutex.
>>>>>
>>>>> Well, I think we will need an additional wrapper in cpuset.c that
>>>>> acquires the cpuset_mutex first before calling
>>>>> partition_sched_domains()
>>>>> and use the new wrapper in these cases.
>>>> Actually, partition_sched_domains() is called with the special
>>>> arguments (1,
>>>> NULL, NULL) to reset the domain to a single one. So perhaps something
>>>> like
>>>> the following will be enough to avoid this problem.
>>> I think this would work, as we will still rebuild the accounting after
>>> last CPU comes back from suspend. The thing I am still not sure about is
>>> what we want to do in case we have DEADLINE tasks around, since with
>>> this I belive we would be ignoring them and let suspend proceed.
>> That is the current behavior. You can certainly create a test case to
>> trigger such condition and see what to do about it. Alternatively, you
>> can document that and come up with a follow-up patch later on.
> But don't we rely on that partition_sched_domains_locked() calls
> dl_rebuild_rd_accounting() even in the reset_domain=1 case?
>
> Testcase: suspend/resume
>
> on Arm64 big.LITTLE cpumask=[LITTLE][big]=[0,3-5][1-2]
> plus cmd line option 'isolcpus=3,4'.
>
> with Waiman's snippet:
> https://lkml.kernel.org/r/fd4d6143-9bd2-4a7c-80dc-1e19e4d1b2d1@redhat.com
>
> ...
> [  234.831675] --- > partition_sched_domains_locked() reset_domain=1
> [  234.835966] psci: CPU4 killed (polled 0 ms)
> [  234.838912] Error taking CPU3 down: -16
> [  234.838952] Non-boot CPUs are not disabled
> [  234.838986] Enabling non-boot CPUs ...
> ...
>
> IIRC, that's the old DL accounting issue.

You are right. cpuhp_tasks_frozen will be set in the suspend/resume 
case. In that case, we do need to add a cpuset helper to acquire the 
cpuset_mutex. A test patch as follows (no testing done yet):

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index c414daa7d503..ef1ffb9c52b0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -129,6 +129,7 @@ extern void dl_rebuild_rd_accounting(void);
  extern void rebuild_sched_domains(void);

  extern void cpuset_print_current_mems_allowed(void);
+extern void cpuset_reset_sched_domains(void)

  /*
   * read_mems_allowed_begin is required when making decisions involving
@@ -269,6 +270,11 @@ static inline void rebuild_sched_domains(void)
         partition_sched_domains(1, NULL, NULL);
  }

+static inline void cpuset_reset_sched_domains(void)
+{
+       partition_sched_domains(1, NULL, NULL);
+}
+
  static inline void cpuset_print_current_mems_allowed(void)
  {
  }
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7995cd58a01b..a51099e5d587 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1076,6 +1076,13 @@ void rebuild_sched_domains(void)
         cpus_read_unlock();
  }

+void cpuset_reset_sched_domains(void)
+{
+       mutex_lock(&cpuset_mutex);
+       partition_sched_domains(1, NULL, NULL);
+       mutex_unlock(&cpuset_mutex);
+}
+
  /**
   * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the 
cpuset.
   * @cs: the cpuset in which each task's cpus_allowed mask needs to be 
changed
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 58593f4d09a1..dbf44ddbb6b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8183,7 +8183,7 @@ static void cpuset_cpu_active(void)
                  * operation in the resume sequence, just build a 
single sched
                  * domain, ignoring cpusets.
                  */
-               partition_sched_domains(1, NULL, NULL);
+               cpuset_reset_sched_domains();
                 if (--num_cpus_frozen)
                         return;
                 /*
@@ -8202,7 +8202,7 @@ static void cpuset_cpu_inactive(unsigned int cpu)
                 cpuset_update_active_cpus();
         } else {
                 num_cpus_frozen++;
-               partition_sched_domains(1, NULL, NULL);
+               cpuset_reset_sched_domains();
         }
  }

Cheers,
Longman

>

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Dietmar Eggemann 11 months ago

On 11/03/2025 15:51, Waiman Long wrote:
> On 3/11/25 9:29 AM, Dietmar Eggemann wrote:
>> On 11/03/2025 13:34, Waiman Long wrote:
>>> On 3/11/25 7:59 AM, Juri Lelli wrote:
>>>> On 10/03/25 20:16, Waiman Long wrote:
>>>>> On 3/10/25 3:18 PM, Waiman Long wrote:
>>>>>> On 3/10/25 2:54 PM, Dietmar Eggemann wrote:
>>>>>>> On 10/03/2025 10:37, Juri Lelli wrote:

[...]

>> Testcase: suspend/resume
>>
>> on Arm64 big.LITTLE cpumask=[LITTLE][big]=[0,3-5][1-2]
>> plus cmd line option 'isolcpus=3,4'.
>>
>> with Waiman's snippet:
>> https://lkml.kernel.org/r/fd4d6143-9bd2-4a7c-80dc-1e19e4d1b2d1@redhat.com
>>
>> ...
>> [  234.831675] --- > partition_sched_domains_locked() reset_domain=1
>> [  234.835966] psci: CPU4 killed (polled 0 ms)
>> [  234.838912] Error taking CPU3 down: -16
>> [  234.838952] Non-boot CPUs are not disabled
>> [  234.838986] Enabling non-boot CPUs ...
>> ...
>>
>> IIRC, that's the old DL accounting issue.
> 
> You are right. cpuhp_tasks_frozen will be set in the suspend/resume
> case. In that case, we do need to add a cpuset helper to acquire the
> cpuset_mutex. A test patch as follows (no testing done yet):
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index c414daa7d503..ef1ffb9c52b0 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -129,6 +129,7 @@ extern void dl_rebuild_rd_accounting(void);
>  extern void rebuild_sched_domains(void);
> 
>  extern void cpuset_print_current_mems_allowed(void);
> +extern void cpuset_reset_sched_domains(void)
> 
>  /*
>   * read_mems_allowed_begin is required when making decisions involving
> @@ -269,6 +270,11 @@ static inline void rebuild_sched_domains(void)
>         partition_sched_domains(1, NULL, NULL);
>  }
> 
> +static inline void cpuset_reset_sched_domains(void)
> +{
> +       partition_sched_domains(1, NULL, NULL);
> +}
> +
>  static inline void cpuset_print_current_mems_allowed(void)
>  {
>  }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 7995cd58a01b..a51099e5d587 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1076,6 +1076,13 @@ void rebuild_sched_domains(void)
>         cpus_read_unlock();
>  }
> 
> +void cpuset_reset_sched_domains(void)
> +{
> +       mutex_lock(&cpuset_mutex);
> +       partition_sched_domains(1, NULL, NULL);
> +       mutex_unlock(&cpuset_mutex);
> +}
> +
>  /**
>   * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the
> cpuset.
>   * @cs: the cpuset in which each task's cpus_allowed mask needs to be
> changed
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 58593f4d09a1..dbf44ddbb6b4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8183,7 +8183,7 @@ static void cpuset_cpu_active(void)
>                  * operation in the resume sequence, just build a single
> sched
>                  * domain, ignoring cpusets.
>                  */
> -               partition_sched_domains(1, NULL, NULL);
> +               cpuset_reset_sched_domains();
>                 if (--num_cpus_frozen)
>                         return;
>                 /*
> @@ -8202,7 +8202,7 @@ static void cpuset_cpu_inactive(unsigned int cpu)
>                 cpuset_update_active_cpus();
>         } else {
>                 num_cpus_frozen++;
> -               partition_sched_domains(1, NULL, NULL);
> +               cpuset_reset_sched_domains();
>         }
>  }

This seems to work. But what about a !CONFIG_CPUSETS build. In this case
we won't have this DL accounting update during suspend/resume since
dl_rebuild_rd_accounting() is empty.

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Juri Lelli 11 months ago

On 12/03/25 10:53, Dietmar Eggemann wrote:
> On 11/03/2025 15:51, Waiman Long wrote:

...

> > You are right. cpuhp_tasks_frozen will be set in the suspend/resume
> > case. In that case, we do need to add a cpuset helper to acquire the
> > cpuset_mutex. A test patch as follows (no testing done yet):

...

> This seems to work.

Thanks for testing!

Waiman, how do you like to proceed. Separate patch (in this case can you
please send me that with changelog etc.) or incorporate your changes
into my original patch and possibly, if you like, add Co-authored-by?

> But what about a !CONFIG_CPUSETS build. In this case we won't have
> this DL accounting update during suspend/resume since
> dl_rebuild_rd_accounting() is empty.

I unfortunately very much suspect !CPUSETS accounting is broken. But if
that is indeed the case, it has been broken for a while. :(

Will need to double check that, but I would probably do it later on
separated from this set that at least seems to cure the most common
cases. What do people think?

Thanks,
Juri

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Waiman Long 11 months ago

On 3/12/25 6:09 AM, Juri Lelli wrote:
> On 12/03/25 10:53, Dietmar Eggemann wrote:
>> On 11/03/2025 15:51, Waiman Long wrote:
> ...
>
>>> You are right. cpuhp_tasks_frozen will be set in the suspend/resume
>>> case. In that case, we do need to add a cpuset helper to acquire the
>>> cpuset_mutex. A test patch as follows (no testing done yet):
> ...
>
>> This seems to work.
> Thanks for testing!
>
> Waiman, how do you like to proceed. Separate patch (in this case can you
> please send me that with changelog etc.) or incorporate your changes
> into my original patch and possibly, if you like, add Co-authored-by?
I think it will be better to merge into a single patch to avoid having a 
broken patch. It is up to you if you want me as a co-author. I don't 
really mind.
>
>> But what about a !CONFIG_CPUSETS build. In this case we won't have
>> this DL accounting update during suspend/resume since
>> dl_rebuild_rd_accounting() is empty.
> I unfortunately very much suspect !CPUSETS accounting is broken. But if
> that is indeed the case, it has been broken for a while. :(
Without CONFIG_CPUSETS, there will be one and only one global sched 
domain. Will this still be a problem?
>
> Will need to double check that, but I would probably do it later on
> separated from this set that at least seems to cure the most common
> cases. What do people think?

I am not aware of any distros without setting CONFIG_CPUSETS. So it is 
mostly a theoretical problem if there is one. So I would recommend going 
ahead with the current patch series instead of spending more time 
investigating this issue.

Cheers,
Longman

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Juri Lelli 11 months ago

On 12/03/25 09:55, Waiman Long wrote:
> On 3/12/25 6:09 AM, Juri Lelli wrote:
> > On 12/03/25 10:53, Dietmar Eggemann wrote:
> > > On 11/03/2025 15:51, Waiman Long wrote:
> > ...
> > 
> > > > You are right. cpuhp_tasks_frozen will be set in the suspend/resume
> > > > case. In that case, we do need to add a cpuset helper to acquire the
> > > > cpuset_mutex. A test patch as follows (no testing done yet):
> > ...
> > 
> > > This seems to work.
> > Thanks for testing!
> > 
> > Waiman, how do you like to proceed. Separate patch (in this case can you
> > please send me that with changelog etc.) or incorporate your changes
> > into my original patch and possibly, if you like, add Co-authored-by?
> I think it will be better to merge into a single patch to avoid having a
> broken patch. It is up to you if you want me as a co-author. I don't really
> mind.
> > 
> > > But what about a !CONFIG_CPUSETS build. In this case we won't have
> > > this DL accounting update during suspend/resume since
> > > dl_rebuild_rd_accounting() is empty.
> > I unfortunately very much suspect !CPUSETS accounting is broken. But if
> > that is indeed the case, it has been broken for a while. :(
> Without CONFIG_CPUSETS, there will be one and only one global sched domain.
> Will this still be a problem?

Still need to double check. But I have a feeling we don't restore
accounting correctly (at all?!) without CPUSETS. Orthogonal to this
issue though, as if we don't, we didn't so far. :/

> > Will need to double check that, but I would probably do it later on
> > separated from this set that at least seems to cure the most common
> > cases. What do people think?
> 
> I am not aware of any distros without setting CONFIG_CPUSETS. So it is
> mostly a theoretical problem if there is one. So I would recommend going
> ahead with the current patch series instead of spending more time
> investigating this issue.

And I would agree (and then find time to look better into !CPUSETS
case). If nobody objects, I will refresh the series including Waiman's
changes and repost.

Thanks!
Juri

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Dietmar Eggemann 11 months ago

On 12/03/2025 15:11, Juri Lelli wrote:
> On 12/03/25 09:55, Waiman Long wrote:
>> On 3/12/25 6:09 AM, Juri Lelli wrote:
>>> On 12/03/25 10:53, Dietmar Eggemann wrote:
>>>> On 11/03/2025 15:51, Waiman Long wrote:

[...]

>>> I unfortunately very much suspect !CPUSETS accounting is broken. But if
>>> that is indeed the case, it has been broken for a while. :(
>> Without CONFIG_CPUSETS, there will be one and only one global sched domain.
>> Will this still be a problem?
> 
> Still need to double check. But I have a feeling we don't restore
> accounting correctly (at all?!) without CPUSETS. Orthogonal to this
> issue though, as if we don't, we didn't so far. :/

As expected:

Since dl_rebuild_rd_accounting() is empty with !CONFIG_CPUSETS, the same
issue happens.

Testcase: suspend/resume

Test machine: Arm64 big.LITTLE cpumask=[LITTLE][big]=[0,3-5][1-2]
plus cmd line option 'isolcpus=3,4'.

...

[ 2250.898771] PM: suspend entry (deep)
[ 2250.902566] Filesystems sync: 0.000 seconds
[ 2250.908704] Freezing user space processes
[ 2250.914379] Freezing user space processes completed (elapsed 0.001
seconds)
[ 2250.921433] OOM killer disabled.
[ 2250.924702] Freezing remaining freezable tasks
[ 2250.930497] Freezing remaining freezable tasks completed (elapsed
0.001 seconds)
...
[ 2251.060052] Disabling non-boot CPUs ...
[ 2251.060426] CPU0 attaching NULL sched-domain.
[ 2251.060455] CPU1 attaching NULL sched-domain.
[ 2251.060478] CPU2 attaching NULL sched-domain.
[ 2251.060499] CPU5 attaching NULL sched-domain.
[ 2251.060712] CPU0 attaching sched-domain(s):
[ 2251.060723]  domain-0: span=0-2 level=PKG
[ 2251.060750]   groups: 0:{ span=0 cap=503 }, 1:{ span=1-2 cap=2048 }
[ 2251.060829] CPU1 attaching sched-domain(s):
[ 2251.060838]  domain-0: span=1-2 level=MC
[ 2251.060859]   groups: 1:{ span=1 }, 2:{ span=2 }
[ 2251.060906]   domain-1: span=0-2 level=PKG
[ 2251.060926]    groups: 1:{ span=1-2 cap=2048 }, 0:{ span=0 cap=503 }
[ 2251.061000] CPU2 attaching sched-domain(s):
[ 2251.061009]  domain-0: span=1-2 level=MC
[ 2251.061030]   groups: 2:{ span=2 }, 1:{ span=1 }
[ 2251.061077]   domain-1: span=0-2 level=PKG
[ 2251.061097]    groups: 1:{ span=1-2 cap=2048 }, 0:{ span=0 cap=503 }
[ 2251.061221] root domain span: 0-2
[ 2251.061270] root_domain 0-2: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
cpus=0,3-5 nr_pstate=5 }
[ 2251.064976] psci: CPU5 killed (polled 0 ms)
[ 2251.066211] Error taking CPU4 down: -16
[ 2251.066226] Non-boot CPUs are not disabled
[ 2251.066234] Enabling non-boot CPUs ...

[...]

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Juri Lelli 11 months ago

On 12/03/25 17:29, Dietmar Eggemann wrote:
> On 12/03/2025 15:11, Juri Lelli wrote:
> > On 12/03/25 09:55, Waiman Long wrote:
> >> On 3/12/25 6:09 AM, Juri Lelli wrote:
> >>> On 12/03/25 10:53, Dietmar Eggemann wrote:
> >>>> On 11/03/2025 15:51, Waiman Long wrote:
> 
> [...]
> 
> >>> I unfortunately very much suspect !CPUSETS accounting is broken. But if
> >>> that is indeed the case, it has been broken for a while. :(
> >> Without CONFIG_CPUSETS, there will be one and only one global sched domain.
> >> Will this still be a problem?
> > 
> > Still need to double check. But I have a feeling we don't restore
> > accounting correctly (at all?!) without CPUSETS. Orthogonal to this
> > issue though, as if we don't, we didn't so far. :/
> 
> As expected:
> 
> Since dl_rebuild_rd_accounting() is empty with !CONFIG_CPUSETS, the same
> issue happens.

Right, suspicion confirmed. :)

But, as I was saying, I believe it has been broken for a while/forever.
Not only suspend/resume, the accounting itself.

Would you be OK if we address the !CPUSETS case with a separate later
series?

Thanks!
Juri

Re: [PATCH v3 4/8] sched/deadline: Rebuild root domain accounting after every update

Posted by Dietmar Eggemann 11 months ago

On 12.03.25 17:51, Juri Lelli wrote:
> On 12/03/25 17:29, Dietmar Eggemann wrote:
>> On 12/03/2025 15:11, Juri Lelli wrote:
>>> On 12/03/25 09:55, Waiman Long wrote:
>>>> On 3/12/25 6:09 AM, Juri Lelli wrote:
>>>>> On 12/03/25 10:53, Dietmar Eggemann wrote:
>>>>>> On 11/03/2025 15:51, Waiman Long wrote:
>>
>> [...]
>>
>>>>> I unfortunately very much suspect !CPUSETS accounting is broken. But if
>>>>> that is indeed the case, it has been broken for a while. :(
>>>> Without CONFIG_CPUSETS, there will be one and only one global sched domain.
>>>> Will this still be a problem?
>>>
>>> Still need to double check. But I have a feeling we don't restore
>>> accounting correctly (at all?!) without CPUSETS. Orthogonal to this
>>> issue though, as if we don't, we didn't so far. :/
>>
>> As expected:
>>
>> Since dl_rebuild_rd_accounting() is empty with !CONFIG_CPUSETS, the same
>> issue happens.
> 
> Right, suspicion confirmed. :)
> 
> But, as I was saying, I believe it has been broken for a while/forever.
> Not only suspend/resume, the accounting itself.
> 
> Would you be OK if we address the !CPUSETS case with a separate later
> series?

Yes, we can do that.