sched/deadline: only mark active cpu as free

[RFC] sched/deadline: only mark active cpu as free

Posted by Doug Berger 1 year ago

There is a hazard in the deadline scheduler where an offlined CPU
can have its free_cpus bit left set in the def_root_domain when
the schedutil cpufreq governor is used. This can allow a deadline
thread to be pushed to the runqueue of a powered down CPU which
breaks scheduling.

This commit works around the issue by only setting the free_cpus
bit for a CPU when it is "active". It is likely that the ordering
of sched_set_rq_online() and set_cpu_active() at the end of the
sched_cpu_deactivate() function should be revisited if this
approach has merit.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---

Coffee is recommended before proceeding.

While stress testing CPU hotplug on a quad-core arm64 architecture
system I encountered a deadlock. My specific deadlock appears to be
dependent on the system having three or more cores and using the
sched-util cpufreq governor which uses a deadline scheduled thread
named "sugov:n" where n is the CPU number.

The scenario I observe is as follows:
Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
previously offlined so their runqueues are attached to the
def_root_domain.
1) A hot plug is initiated on CPU2.
2) The cpuhp/2 thread invokes the cpufreq governor driver during
   the CPUHP_AP_ONLINE_DYN step.
3) The sched util cpufreq governor creates the "sugov:2" thread to
   execute on CPU2 with the deadline scheduler.
4) The deadline scheduler clears the free_cpus mask for CPU2 within
   the def_root_domain when "sugov:2" is scheduled.
5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
   clear the deadline which sets the free_cpus mask for CPU2 within
   the def_root_domain.
6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
   domain is created to include CPU0, CPU1, and CPU2.
   o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
     CPU1 which offlines their runqueues and detaches their current
     dynamic scheduling domain (clearing their deadline free_cpus
     bits there) and attaches the def_root_domain and onlines their
     runqueus (setting their deadline free_cpus bits there).
   o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
     and CPU2.
     - Since only CPU0 and CPU1 are online in the def_root_domain
       set_rq_offline() is only called for them to offline their
       runqueues and detach the def_root_domain (clearing their
       deadline free_cpus bits there).
     - The free_cpus bit for CPU2 in def_root_domain is allowed to
       remain set.
     - The newly created dynamic scheduling domain is attached to
       CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
       to online their runqueues (setting their deadline free_cpus
       bits there).
7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
   CPUHP_AP_ACTIVE step, but since the runqueues are already online
   essentially nothing happens.
8) Some time later CPU2 is hot unplugged.
9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
   invokes balance_push_set() for CPU2 which migrates "sugov:2" to
   a different CPU through fallback.
10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
    CPU2 which takes its runqueue offline and clears its deadline
    free_cpus bit in the current dynamic scheduling domain.
11) Also at this step, cpuhp/2 updates the scheduling domain to
    remove CPU2.
    o detach_destroy_domains() invokes rq_attach_root() for CPU0,
      CPU1, and CPU2 to move them back to the def_root_domain.
      - Since only CPU0 and CPU1 are online in the current dynamic
        scheduling domain (CPU2 was removed at 10 above),
        set_rq_offline() is only called for them to clear their
        deadline free_cpus bits.
      - The def_root_domain is attached to CPU0, CPU1, and CPU2
        runqueues and since only CPU0 and CPU1 are marked active
        set_rq_online() is used to online their runqueues (setting
        their deadline free_cpus bits there).
      - The free_cpus bit for CPU2 in def_root_domain is allowed
        to remain set.
    o build_sched_domains() invokes rq_attach_root() for CPU0 and
      CPU1 which offlines their runqueues (clearing their deadline
      free_cpus bits in def_root_domain), attaches a new dynamic
      scheduling domain, and onlines their runqueus (setting their
      deadline free_cpus bits there).
12) The cpuhp/2 thread invokes the cpufreq governor driver during
    the CPUHP_AP_ONLINE_DYN step which attempts to stop the
    "sugov:2" kthread by calling kthread_flush_worker() followed by
    kthread_stop().
13) The "sugov:2" thread can likely be successfully deadline
    scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
    complete offlining CPU2 and power it off.
14) Some time later CPU1 is hot unplugged.
15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
   invokes balance_push_set() for CPU1 which migrates "sugov:1" to
   CPU0 through fallback.
16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
    CPU1 which takes its runqueue offline and clears its deadline
    free_cpus bit in the current dynamic scheduling domain.
17) Also at this step, cpuhp/1 updates the scheduling domain to
    remove CPU1.
    o detach_destroy_domains() invokes rq_attach_root() for CPU0
      and CPU1 to move them back to the def_root_domain.
      - Since only CPU0 is online in the current dynamic scheduling
        domain (CPU1 was removed at 16 above), set_rq_offline() is
        only called for it to clear its deadline free_cpus bit.
      - The def_root_domain is attached to CPU0 and CPU1 runqueues
        and since only CPU0 is marked active set_rq_online() is
        used to online its runqueue (setting its deadline free_cpus
        bits there).
      - The free_cpus bit for CPU1 is untouched in def_root_domain.
      - The free_cpus bit for CPU2 in def_root_domain remains set
        from the preceeding sequence.
18) If CPU0 executes the "sugov:0" deadline thread at this time it
    may see that the "sugov:1" deadline thread is also on its
    runqueue and may call push_dl_task() to attempt to push it to a
    different CPU.
19) The effort to find a later runqueue will find the stale
    free_cpus bit of CPU2 in the currently attached def_root_domain
    and will migrate the "sugov:1" thread to the runqueue of the
    powered down CPU2 where it can never get scheduled.
20) The cpuhp/1 thread invokes the cpufreq governor driver during
    the CPUHP_AP_ONLINE_DYN step which attempts to stop the
    "sugov:1" kthread by calling kthread_flush_worker() followed by
    kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
    remains blocked on completion events.

Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
the deadline scheduler def_root_domain to remain set for a CPU that
is powered off. The trap can be sprung during the narrow timing
hazard when the def_root_domain is transitionally attached while
changing scheduling domains if the deadline scheduler pushes a
queued task to the powered off CPU.

This problem appears to have been initially introduced by commit 
120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
moved the set_rq_offline() handling from sched_cpu_dying() to
sched_cpu_deactivate(). The original sequence allowed the free_cpus
bit to be forcibly cleared in the def_root_domains after all of the
scheduler dust settled. The new location makes the
sched_set_rq_offline() essentially meaningless for the deadline
scheduler since the managing of changed scheduling domains happens
later.

There are likely many different approaches to address this issue
and I'm hopeful that somone more familiar with the scheduler than
I can propose a better solution than the one suggested here.

Thank you for reading this far. Any advice is appreciated.
-Doug

 kernel/sched/cpudeadline.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 95baa12a1029..6896bbe0e9ae 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
 		cp->elements[cpu].idx = IDX_INVALID;
 		cpudl_heapify(cp, old_idx);
 
-		cpumask_set_cpu(cpu, cp->free_cpus);
+		if (cpu_active(cpu))
+			cpumask_set_cpu(cpu, cp->free_cpus);
 	}
 	raw_spin_unlock_irqrestore(&cp->lock, flags);
 }
-- 
2.34.1

Re: [RFC] sched/deadline: only mark active cpu as free

Posted by Juri Lelli 1 year ago

Hi Doug,

On 10/01/25 15:30, Doug Berger wrote:
> There is a hazard in the deadline scheduler where an offlined CPU
> can have its free_cpus bit left set in the def_root_domain when
> the schedutil cpufreq governor is used. This can allow a deadline
> thread to be pushed to the runqueue of a powered down CPU which
> breaks scheduling.
> 
> This commit works around the issue by only setting the free_cpus
> bit for a CPU when it is "active". It is likely that the ordering
> of sched_set_rq_online() and set_cpu_active() at the end of the
> sched_cpu_deactivate() function should be revisited if this
> approach has merit.
> 
> Signed-off-by: Doug Berger <opendmb@gmail.com>
> ---
> 
> Coffee is recommended before proceeding.
> 
> While stress testing CPU hotplug on a quad-core arm64 architecture
> system I encountered a deadlock. My specific deadlock appears to be
> dependent on the system having three or more cores and using the
> sched-util cpufreq governor which uses a deadline scheduled thread
> named "sugov:n" where n is the CPU number.
> 
> The scenario I observe is as follows:
> Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
> previously offlined so their runqueues are attached to the
> def_root_domain.
> 1) A hot plug is initiated on CPU2.
> 2) The cpuhp/2 thread invokes the cpufreq governor driver during
>    the CPUHP_AP_ONLINE_DYN step.
> 3) The sched util cpufreq governor creates the "sugov:2" thread to
>    execute on CPU2 with the deadline scheduler.
> 4) The deadline scheduler clears the free_cpus mask for CPU2 within
>    the def_root_domain when "sugov:2" is scheduled.
> 5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
>    clear the deadline which sets the free_cpus mask for CPU2 within
>    the def_root_domain.
> 6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
>    domain is created to include CPU0, CPU1, and CPU2.
>    o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
>      CPU1 which offlines their runqueues and detaches their current
>      dynamic scheduling domain (clearing their deadline free_cpus
>      bits there) and attaches the def_root_domain and onlines their
>      runqueus (setting their deadline free_cpus bits there).
>    o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
>      and CPU2.
>      - Since only CPU0 and CPU1 are online in the def_root_domain
>        set_rq_offline() is only called for them to offline their
>        runqueues and detach the def_root_domain (clearing their
>        deadline free_cpus bits there).
>      - The free_cpus bit for CPU2 in def_root_domain is allowed to
>        remain set.
>      - The newly created dynamic scheduling domain is attached to
>        CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
>        to online their runqueues (setting their deadline free_cpus
>        bits there).
> 7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
>    CPUHP_AP_ACTIVE step, but since the runqueues are already online
>    essentially nothing happens.
> 8) Some time later CPU2 is hot unplugged.
> 9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
>    invokes balance_push_set() for CPU2 which migrates "sugov:2" to
>    a different CPU through fallback.
> 10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
>     CPU2 which takes its runqueue offline and clears its deadline
>     free_cpus bit in the current dynamic scheduling domain.
> 11) Also at this step, cpuhp/2 updates the scheduling domain to
>     remove CPU2.
>     o detach_destroy_domains() invokes rq_attach_root() for CPU0,
>       CPU1, and CPU2 to move them back to the def_root_domain.
>       - Since only CPU0 and CPU1 are online in the current dynamic
>         scheduling domain (CPU2 was removed at 10 above),
>         set_rq_offline() is only called for them to clear their
>         deadline free_cpus bits.
>       - The def_root_domain is attached to CPU0, CPU1, and CPU2
>         runqueues and since only CPU0 and CPU1 are marked active
>         set_rq_online() is used to online their runqueues (setting
>         their deadline free_cpus bits there).
>       - The free_cpus bit for CPU2 in def_root_domain is allowed
>         to remain set.
>     o build_sched_domains() invokes rq_attach_root() for CPU0 and
>       CPU1 which offlines their runqueues (clearing their deadline
>       free_cpus bits in def_root_domain), attaches a new dynamic
>       scheduling domain, and onlines their runqueus (setting their
>       deadline free_cpus bits there).
> 12) The cpuhp/2 thread invokes the cpufreq governor driver during
>     the CPUHP_AP_ONLINE_DYN step which attempts to stop the
>     "sugov:2" kthread by calling kthread_flush_worker() followed by
>     kthread_stop().
> 13) The "sugov:2" thread can likely be successfully deadline
>     scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
>     complete offlining CPU2 and power it off.
> 14) Some time later CPU1 is hot unplugged.
> 15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
>    invokes balance_push_set() for CPU1 which migrates "sugov:1" to
>    CPU0 through fallback.
> 16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
>     CPU1 which takes its runqueue offline and clears its deadline
>     free_cpus bit in the current dynamic scheduling domain.
> 17) Also at this step, cpuhp/1 updates the scheduling domain to
>     remove CPU1.
>     o detach_destroy_domains() invokes rq_attach_root() for CPU0
>       and CPU1 to move them back to the def_root_domain.
>       - Since only CPU0 is online in the current dynamic scheduling
>         domain (CPU1 was removed at 16 above), set_rq_offline() is
>         only called for it to clear its deadline free_cpus bit.
>       - The def_root_domain is attached to CPU0 and CPU1 runqueues
>         and since only CPU0 is marked active set_rq_online() is
>         used to online its runqueue (setting its deadline free_cpus
>         bits there).
>       - The free_cpus bit for CPU1 is untouched in def_root_domain.
>       - The free_cpus bit for CPU2 in def_root_domain remains set
>         from the preceeding sequence.
> 18) If CPU0 executes the "sugov:0" deadline thread at this time it
>     may see that the "sugov:1" deadline thread is also on its
>     runqueue and may call push_dl_task() to attempt to push it to a
>     different CPU.
> 19) The effort to find a later runqueue will find the stale
>     free_cpus bit of CPU2 in the currently attached def_root_domain
>     and will migrate the "sugov:1" thread to the runqueue of the
>     powered down CPU2 where it can never get scheduled.
> 20) The cpuhp/1 thread invokes the cpufreq governor driver during
>     the CPUHP_AP_ONLINE_DYN step which attempts to stop the
>     "sugov:1" kthread by calling kthread_flush_worker() followed by
>     kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
>     remains blocked on completion events.
> 
> Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
> the deadline scheduler def_root_domain to remain set for a CPU that
> is powered off. The trap can be sprung during the narrow timing
> hazard when the def_root_domain is transitionally attached while
> changing scheduling domains if the deadline scheduler pushes a
> queued task to the powered off CPU.
> 
> This problem appears to have been initially introduced by commit 
> 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
> moved the set_rq_offline() handling from sched_cpu_dying() to
> sched_cpu_deactivate(). The original sequence allowed the free_cpus
> bit to be forcibly cleared in the def_root_domains after all of the
> scheduler dust settled. The new location makes the
> sched_set_rq_offline() essentially meaningless for the deadline
> scheduler since the managing of changed scheduling domains happens
> later.
> 
> There are likely many different approaches to address this issue
> and I'm hopeful that somone more familiar with the scheduler than
> I can propose a better solution than the one suggested here.
> 
> Thank you for reading this far. Any advice is appreciated.

Thanks a lot for the detailed analysis!

I actually fear that the issue is due to the cpudl_clear_freecpu() call
in rq_offline_dl() being racy, as we don't hold cp->lock while calling
that. So, I think your solution below might be almost correct. I am
thinking we should do something similar in cpudl_set() and remove cpudl_
{set,clear}_freecpu() calls altogether.

What do you think? If agree, care to update your patch please? :)

Best,
Juri

>  kernel/sched/cpudeadline.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 95baa12a1029..6896bbe0e9ae 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
>  		cp->elements[cpu].idx = IDX_INVALID;
>  		cpudl_heapify(cp, old_idx);
>  
> -		cpumask_set_cpu(cpu, cp->free_cpus);
> +		if (cpu_active(cpu))
> +			cpumask_set_cpu(cpu, cp->free_cpus);
>  	}
>  	raw_spin_unlock_irqrestore(&cp->lock, flags);
>  }
> -- 
> 2.34.1
>

Re: [RFC] sched/deadline: only mark active cpu as free

Posted by Doug Berger 1 year ago

On 1/13/2025 2:54 AM, Juri Lelli wrote:
> Hi Doug,
> 
> On 10/01/25 15:30, Doug Berger wrote:
>> There is a hazard in the deadline scheduler where an offlined CPU
>> can have its free_cpus bit left set in the def_root_domain when
>> the schedutil cpufreq governor is used. This can allow a deadline
>> thread to be pushed to the runqueue of a powered down CPU which
>> breaks scheduling.
>>
>> This commit works around the issue by only setting the free_cpus
>> bit for a CPU when it is "active". It is likely that the ordering
>> of sched_set_rq_online() and set_cpu_active() at the end of the
>> sched_cpu_deactivate() function should be revisited if this
>> approach has merit.
>>
>> Signed-off-by: Doug Berger <opendmb@gmail.com>
>> ---
<snip>
>> This problem appears to have been initially introduced by commit
>> 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
>> moved the set_rq_offline() handling from sched_cpu_dying() to
>> sched_cpu_deactivate(). The original sequence allowed the free_cpus
>> bit to be forcibly cleared in the def_root_domains after all of the
>> scheduler dust settled. The new location makes the
>> sched_set_rq_offline() essentially meaningless for the deadline
>> scheduler since the managing of changed scheduling domains happens
>> later.
>>
>> There are likely many different approaches to address this issue
>> and I'm hopeful that somone more familiar with the scheduler than
>> I can propose a better solution than the one suggested here.
>>
>> Thank you for reading this far. Any advice is appreciated.
> 
> Thanks a lot for the detailed analysis!
Thanks for the review!

> 
> I actually fear that the issue is due to the cpudl_clear_freecpu() call
> in rq_offline_dl() being racy, as we don't hold cp->lock while calling
> that. So, I think your solution below might be almost correct. I am
> thinking we should do something similar in cpudl_set() and remove cpudl_
> {set,clear}_freecpu() calls altogether.
> 
> What do you think? If agree, care to update your patch please? :)
As I noted, I'm not real keen on what I proposed here. In particular I 
am wary of introducing any extra processing to normal deadline scheduler 
execution paths.

I am about to submit an alternative that only affects paths when the 
topology is changing that I hope you will take another look at ;).

I am still very much open to someone saying something like "You know we 
could probably bring the def_root_domain of the onlining CPU online 
earlier so that set_rq_offline() gets called for it when attaching to 
the new scheduling domain." I just don't feel comfortable enough yet 
with my understanding of the scheduler to know what all of the 
implications of such a change would be.

> 
> Best,
> Juri

Thanks again!
     Doug
> 
>>   kernel/sched/cpudeadline.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
>> index 95baa12a1029..6896bbe0e9ae 100644
>> --- a/kernel/sched/cpudeadline.c
>> +++ b/kernel/sched/cpudeadline.c
>> @@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
>>   		cp->elements[cpu].idx = IDX_INVALID;
>>   		cpudl_heapify(cp, old_idx);
>>   
>> -		cpumask_set_cpu(cpu, cp->free_cpus);
>> +		if (cpu_active(cpu))
>> +			cpumask_set_cpu(cpu, cp->free_cpus);
>>   	}
>>   	raw_spin_unlock_irqrestore(&cp->lock, flags);
>>   }
>> -- 
>> 2.34.1
>>
>