kernel/sched/cpudeadline.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
There is a hazard in the deadline scheduler where an offlined CPU
can have its free_cpus bit left set in the def_root_domain when
the schedutil cpufreq governor is used. This can allow a deadline
thread to be pushed to the runqueue of a powered down CPU which
breaks scheduling.
This commit works around the issue by only setting the free_cpus
bit for a CPU when it is "active". It is likely that the ordering
of sched_set_rq_online() and set_cpu_active() at the end of the
sched_cpu_deactivate() function should be revisited if this
approach has merit.
Signed-off-by: Doug Berger <opendmb@gmail.com>
---
Coffee is recommended before proceeding.
While stress testing CPU hotplug on a quad-core arm64 architecture
system I encountered a deadlock. My specific deadlock appears to be
dependent on the system having three or more cores and using the
sched-util cpufreq governor which uses a deadline scheduled thread
named "sugov:n" where n is the CPU number.
The scenario I observe is as follows:
Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
previously offlined so their runqueues are attached to the
def_root_domain.
1) A hot plug is initiated on CPU2.
2) The cpuhp/2 thread invokes the cpufreq governor driver during
the CPUHP_AP_ONLINE_DYN step.
3) The sched util cpufreq governor creates the "sugov:2" thread to
execute on CPU2 with the deadline scheduler.
4) The deadline scheduler clears the free_cpus mask for CPU2 within
the def_root_domain when "sugov:2" is scheduled.
5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
clear the deadline which sets the free_cpus mask for CPU2 within
the def_root_domain.
6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
domain is created to include CPU0, CPU1, and CPU2.
o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
CPU1 which offlines their runqueues and detaches their current
dynamic scheduling domain (clearing their deadline free_cpus
bits there) and attaches the def_root_domain and onlines their
runqueus (setting their deadline free_cpus bits there).
o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
and CPU2.
- Since only CPU0 and CPU1 are online in the def_root_domain
set_rq_offline() is only called for them to offline their
runqueues and detach the def_root_domain (clearing their
deadline free_cpus bits there).
- The free_cpus bit for CPU2 in def_root_domain is allowed to
remain set.
- The newly created dynamic scheduling domain is attached to
CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
to online their runqueues (setting their deadline free_cpus
bits there).
7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
CPUHP_AP_ACTIVE step, but since the runqueues are already online
essentially nothing happens.
8) Some time later CPU2 is hot unplugged.
9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
invokes balance_push_set() for CPU2 which migrates "sugov:2" to
a different CPU through fallback.
10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
CPU2 which takes its runqueue offline and clears its deadline
free_cpus bit in the current dynamic scheduling domain.
11) Also at this step, cpuhp/2 updates the scheduling domain to
remove CPU2.
o detach_destroy_domains() invokes rq_attach_root() for CPU0,
CPU1, and CPU2 to move them back to the def_root_domain.
- Since only CPU0 and CPU1 are online in the current dynamic
scheduling domain (CPU2 was removed at 10 above),
set_rq_offline() is only called for them to clear their
deadline free_cpus bits.
- The def_root_domain is attached to CPU0, CPU1, and CPU2
runqueues and since only CPU0 and CPU1 are marked active
set_rq_online() is used to online their runqueues (setting
their deadline free_cpus bits there).
- The free_cpus bit for CPU2 in def_root_domain is allowed
to remain set.
o build_sched_domains() invokes rq_attach_root() for CPU0 and
CPU1 which offlines their runqueues (clearing their deadline
free_cpus bits in def_root_domain), attaches a new dynamic
scheduling domain, and onlines their runqueus (setting their
deadline free_cpus bits there).
12) The cpuhp/2 thread invokes the cpufreq governor driver during
the CPUHP_AP_ONLINE_DYN step which attempts to stop the
"sugov:2" kthread by calling kthread_flush_worker() followed by
kthread_stop().
13) The "sugov:2" thread can likely be successfully deadline
scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
complete offlining CPU2 and power it off.
14) Some time later CPU1 is hot unplugged.
15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
invokes balance_push_set() for CPU1 which migrates "sugov:1" to
CPU0 through fallback.
16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
CPU1 which takes its runqueue offline and clears its deadline
free_cpus bit in the current dynamic scheduling domain.
17) Also at this step, cpuhp/1 updates the scheduling domain to
remove CPU1.
o detach_destroy_domains() invokes rq_attach_root() for CPU0
and CPU1 to move them back to the def_root_domain.
- Since only CPU0 is online in the current dynamic scheduling
domain (CPU1 was removed at 16 above), set_rq_offline() is
only called for it to clear its deadline free_cpus bit.
- The def_root_domain is attached to CPU0 and CPU1 runqueues
and since only CPU0 is marked active set_rq_online() is
used to online its runqueue (setting its deadline free_cpus
bits there).
- The free_cpus bit for CPU1 is untouched in def_root_domain.
- The free_cpus bit for CPU2 in def_root_domain remains set
from the preceeding sequence.
18) If CPU0 executes the "sugov:0" deadline thread at this time it
may see that the "sugov:1" deadline thread is also on its
runqueue and may call push_dl_task() to attempt to push it to a
different CPU.
19) The effort to find a later runqueue will find the stale
free_cpus bit of CPU2 in the currently attached def_root_domain
and will migrate the "sugov:1" thread to the runqueue of the
powered down CPU2 where it can never get scheduled.
20) The cpuhp/1 thread invokes the cpufreq governor driver during
the CPUHP_AP_ONLINE_DYN step which attempts to stop the
"sugov:1" kthread by calling kthread_flush_worker() followed by
kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
remains blocked on completion events.
Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
the deadline scheduler def_root_domain to remain set for a CPU that
is powered off. The trap can be sprung during the narrow timing
hazard when the def_root_domain is transitionally attached while
changing scheduling domains if the deadline scheduler pushes a
queued task to the powered off CPU.
This problem appears to have been initially introduced by commit
120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
moved the set_rq_offline() handling from sched_cpu_dying() to
sched_cpu_deactivate(). The original sequence allowed the free_cpus
bit to be forcibly cleared in the def_root_domains after all of the
scheduler dust settled. The new location makes the
sched_set_rq_offline() essentially meaningless for the deadline
scheduler since the managing of changed scheduling domains happens
later.
There are likely many different approaches to address this issue
and I'm hopeful that somone more familiar with the scheduler than
I can propose a better solution than the one suggested here.
Thank you for reading this far. Any advice is appreciated.
-Doug
kernel/sched/cpudeadline.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 95baa12a1029..6896bbe0e9ae 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
cp->elements[cpu].idx = IDX_INVALID;
cpudl_heapify(cp, old_idx);
- cpumask_set_cpu(cpu, cp->free_cpus);
+ if (cpu_active(cpu))
+ cpumask_set_cpu(cpu, cp->free_cpus);
}
raw_spin_unlock_irqrestore(&cp->lock, flags);
}
--
2.34.1
Hi Doug,
On 10/01/25 15:30, Doug Berger wrote:
> There is a hazard in the deadline scheduler where an offlined CPU
> can have its free_cpus bit left set in the def_root_domain when
> the schedutil cpufreq governor is used. This can allow a deadline
> thread to be pushed to the runqueue of a powered down CPU which
> breaks scheduling.
>
> This commit works around the issue by only setting the free_cpus
> bit for a CPU when it is "active". It is likely that the ordering
> of sched_set_rq_online() and set_cpu_active() at the end of the
> sched_cpu_deactivate() function should be revisited if this
> approach has merit.
>
> Signed-off-by: Doug Berger <opendmb@gmail.com>
> ---
>
> Coffee is recommended before proceeding.
>
> While stress testing CPU hotplug on a quad-core arm64 architecture
> system I encountered a deadlock. My specific deadlock appears to be
> dependent on the system having three or more cores and using the
> sched-util cpufreq governor which uses a deadline scheduled thread
> named "sugov:n" where n is the CPU number.
>
> The scenario I observe is as follows:
> Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
> previously offlined so their runqueues are attached to the
> def_root_domain.
> 1) A hot plug is initiated on CPU2.
> 2) The cpuhp/2 thread invokes the cpufreq governor driver during
> the CPUHP_AP_ONLINE_DYN step.
> 3) The sched util cpufreq governor creates the "sugov:2" thread to
> execute on CPU2 with the deadline scheduler.
> 4) The deadline scheduler clears the free_cpus mask for CPU2 within
> the def_root_domain when "sugov:2" is scheduled.
> 5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
> clear the deadline which sets the free_cpus mask for CPU2 within
> the def_root_domain.
> 6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
> domain is created to include CPU0, CPU1, and CPU2.
> o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
> CPU1 which offlines their runqueues and detaches their current
> dynamic scheduling domain (clearing their deadline free_cpus
> bits there) and attaches the def_root_domain and onlines their
> runqueus (setting their deadline free_cpus bits there).
> o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
> and CPU2.
> - Since only CPU0 and CPU1 are online in the def_root_domain
> set_rq_offline() is only called for them to offline their
> runqueues and detach the def_root_domain (clearing their
> deadline free_cpus bits there).
> - The free_cpus bit for CPU2 in def_root_domain is allowed to
> remain set.
> - The newly created dynamic scheduling domain is attached to
> CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
> to online their runqueues (setting their deadline free_cpus
> bits there).
> 7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
> CPUHP_AP_ACTIVE step, but since the runqueues are already online
> essentially nothing happens.
> 8) Some time later CPU2 is hot unplugged.
> 9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
> invokes balance_push_set() for CPU2 which migrates "sugov:2" to
> a different CPU through fallback.
> 10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
> CPU2 which takes its runqueue offline and clears its deadline
> free_cpus bit in the current dynamic scheduling domain.
> 11) Also at this step, cpuhp/2 updates the scheduling domain to
> remove CPU2.
> o detach_destroy_domains() invokes rq_attach_root() for CPU0,
> CPU1, and CPU2 to move them back to the def_root_domain.
> - Since only CPU0 and CPU1 are online in the current dynamic
> scheduling domain (CPU2 was removed at 10 above),
> set_rq_offline() is only called for them to clear their
> deadline free_cpus bits.
> - The def_root_domain is attached to CPU0, CPU1, and CPU2
> runqueues and since only CPU0 and CPU1 are marked active
> set_rq_online() is used to online their runqueues (setting
> their deadline free_cpus bits there).
> - The free_cpus bit for CPU2 in def_root_domain is allowed
> to remain set.
> o build_sched_domains() invokes rq_attach_root() for CPU0 and
> CPU1 which offlines their runqueues (clearing their deadline
> free_cpus bits in def_root_domain), attaches a new dynamic
> scheduling domain, and onlines their runqueus (setting their
> deadline free_cpus bits there).
> 12) The cpuhp/2 thread invokes the cpufreq governor driver during
> the CPUHP_AP_ONLINE_DYN step which attempts to stop the
> "sugov:2" kthread by calling kthread_flush_worker() followed by
> kthread_stop().
> 13) The "sugov:2" thread can likely be successfully deadline
> scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
> complete offlining CPU2 and power it off.
> 14) Some time later CPU1 is hot unplugged.
> 15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
> invokes balance_push_set() for CPU1 which migrates "sugov:1" to
> CPU0 through fallback.
> 16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
> CPU1 which takes its runqueue offline and clears its deadline
> free_cpus bit in the current dynamic scheduling domain.
> 17) Also at this step, cpuhp/1 updates the scheduling domain to
> remove CPU1.
> o detach_destroy_domains() invokes rq_attach_root() for CPU0
> and CPU1 to move them back to the def_root_domain.
> - Since only CPU0 is online in the current dynamic scheduling
> domain (CPU1 was removed at 16 above), set_rq_offline() is
> only called for it to clear its deadline free_cpus bit.
> - The def_root_domain is attached to CPU0 and CPU1 runqueues
> and since only CPU0 is marked active set_rq_online() is
> used to online its runqueue (setting its deadline free_cpus
> bits there).
> - The free_cpus bit for CPU1 is untouched in def_root_domain.
> - The free_cpus bit for CPU2 in def_root_domain remains set
> from the preceeding sequence.
> 18) If CPU0 executes the "sugov:0" deadline thread at this time it
> may see that the "sugov:1" deadline thread is also on its
> runqueue and may call push_dl_task() to attempt to push it to a
> different CPU.
> 19) The effort to find a later runqueue will find the stale
> free_cpus bit of CPU2 in the currently attached def_root_domain
> and will migrate the "sugov:1" thread to the runqueue of the
> powered down CPU2 where it can never get scheduled.
> 20) The cpuhp/1 thread invokes the cpufreq governor driver during
> the CPUHP_AP_ONLINE_DYN step which attempts to stop the
> "sugov:1" kthread by calling kthread_flush_worker() followed by
> kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
> remains blocked on completion events.
>
> Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
> the deadline scheduler def_root_domain to remain set for a CPU that
> is powered off. The trap can be sprung during the narrow timing
> hazard when the def_root_domain is transitionally attached while
> changing scheduling domains if the deadline scheduler pushes a
> queued task to the powered off CPU.
>
> This problem appears to have been initially introduced by commit
> 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
> moved the set_rq_offline() handling from sched_cpu_dying() to
> sched_cpu_deactivate(). The original sequence allowed the free_cpus
> bit to be forcibly cleared in the def_root_domains after all of the
> scheduler dust settled. The new location makes the
> sched_set_rq_offline() essentially meaningless for the deadline
> scheduler since the managing of changed scheduling domains happens
> later.
>
> There are likely many different approaches to address this issue
> and I'm hopeful that somone more familiar with the scheduler than
> I can propose a better solution than the one suggested here.
>
> Thank you for reading this far. Any advice is appreciated.
Thanks a lot for the detailed analysis!
I actually fear that the issue is due to the cpudl_clear_freecpu() call
in rq_offline_dl() being racy, as we don't hold cp->lock while calling
that. So, I think your solution below might be almost correct. I am
thinking we should do something similar in cpudl_set() and remove cpudl_
{set,clear}_freecpu() calls altogether.
What do you think? If agree, care to update your patch please? :)
Best,
Juri
> kernel/sched/cpudeadline.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 95baa12a1029..6896bbe0e9ae 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
> cp->elements[cpu].idx = IDX_INVALID;
> cpudl_heapify(cp, old_idx);
>
> - cpumask_set_cpu(cpu, cp->free_cpus);
> + if (cpu_active(cpu))
> + cpumask_set_cpu(cpu, cp->free_cpus);
> }
> raw_spin_unlock_irqrestore(&cp->lock, flags);
> }
> --
> 2.34.1
>
On 1/13/2025 2:54 AM, Juri Lelli wrote:
> Hi Doug,
>
> On 10/01/25 15:30, Doug Berger wrote:
>> There is a hazard in the deadline scheduler where an offlined CPU
>> can have its free_cpus bit left set in the def_root_domain when
>> the schedutil cpufreq governor is used. This can allow a deadline
>> thread to be pushed to the runqueue of a powered down CPU which
>> breaks scheduling.
>>
>> This commit works around the issue by only setting the free_cpus
>> bit for a CPU when it is "active". It is likely that the ordering
>> of sched_set_rq_online() and set_cpu_active() at the end of the
>> sched_cpu_deactivate() function should be revisited if this
>> approach has merit.
>>
>> Signed-off-by: Doug Berger <opendmb@gmail.com>
>> ---
<snip>
>> This problem appears to have been initially introduced by commit
>> 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
>> moved the set_rq_offline() handling from sched_cpu_dying() to
>> sched_cpu_deactivate(). The original sequence allowed the free_cpus
>> bit to be forcibly cleared in the def_root_domains after all of the
>> scheduler dust settled. The new location makes the
>> sched_set_rq_offline() essentially meaningless for the deadline
>> scheduler since the managing of changed scheduling domains happens
>> later.
>>
>> There are likely many different approaches to address this issue
>> and I'm hopeful that somone more familiar with the scheduler than
>> I can propose a better solution than the one suggested here.
>>
>> Thank you for reading this far. Any advice is appreciated.
>
> Thanks a lot for the detailed analysis!
Thanks for the review!
>
> I actually fear that the issue is due to the cpudl_clear_freecpu() call
> in rq_offline_dl() being racy, as we don't hold cp->lock while calling
> that. So, I think your solution below might be almost correct. I am
> thinking we should do something similar in cpudl_set() and remove cpudl_
> {set,clear}_freecpu() calls altogether.
>
> What do you think? If agree, care to update your patch please? :)
As I noted, I'm not real keen on what I proposed here. In particular I
am wary of introducing any extra processing to normal deadline scheduler
execution paths.
I am about to submit an alternative that only affects paths when the
topology is changing that I hope you will take another look at ;).
I am still very much open to someone saying something like "You know we
could probably bring the def_root_domain of the onlining CPU online
earlier so that set_rq_offline() gets called for it when attaching to
the new scheduling domain." I just don't feel comfortable enough yet
with my understanding of the scheduler to know what all of the
implications of such a change would be.
>
> Best,
> Juri
Thanks again!
Doug
>
>> kernel/sched/cpudeadline.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
>> index 95baa12a1029..6896bbe0e9ae 100644
>> --- a/kernel/sched/cpudeadline.c
>> +++ b/kernel/sched/cpudeadline.c
>> @@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
>> cp->elements[cpu].idx = IDX_INVALID;
>> cpudl_heapify(cp, old_idx);
>>
>> - cpumask_set_cpu(cpu, cp->free_cpus);
>> + if (cpu_active(cpu))
>> + cpumask_set_cpu(cpu, cp->free_cpus);
>> }
>> raw_spin_unlock_irqrestore(&cp->lock, flags);
>> }
>> --
>> 2.34.1
>>
>
© 2016 - 2026 Red Hat, Inc.