[PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs

Pierre Gondois posted 7 patches 1 year, 10 months ago
include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
include/linux/sched/nohz.h      |  2 ++
kernel/sched/fair.c             | 44 +++++++++++++++++-------------
kernel/sched/isolation.c        | 48 ++++++++++++++++++++++++++++++++-
kernel/sched/topology.c         |  7 +++++
5 files changed, 110 insertions(+), 21 deletions(-)
[PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs
Posted by Pierre Gondois 1 year, 10 months ago
Zhang Rui reported that find_new_ilb() was iterating over CPUs in
isolated cgroup partitions. This triggered spurious wakeups for
theses CPUs. [1]
The initial approach was to ignore CPUs on NULL sched domains, as
isolated CPUs have a NULL sched domain. However a CPU:
- with its tick disabled, so taken into account in
  nohz.[idle_cpus_mask|nr_cpus]
- which is placed in an isolated cgroup partition
will never update nohz.[idle_cpus_mask|nr_cpus] again.

To avoid that, the following variables should be cleared
when a CPU is placed in an isolated cgroup partition:
- nohz.idle_cpus_mask
- nohz.nr_cpus
- rq->nohz_tick_stopped
This would allow to avoid considering wrong nohz.* values during
idle load balance.

As suggested in [2] and to avoid calling nohz_balance_[enter|exit]_idle()
from a remote CPU and create concurrency issues, leverage the existing
housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
sched domains).
Indeed the HK_TYPE_SCHED mask is currently never set by the
isolcpus/nohz_full kernel parameters, so it defaults to cpu_online_mask.
Plus it's current usage fits CPUs that are isolated and should
not take part in load balancing.

Making use of HK_TYPE_SCHED for this purpose implies creating a
housekeeping mask which can be modified at runtime.

[1] https://lore.kernel.org/all/20230804090858.7605-1-rui.zhang@intel.com/
[2] https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/

Pierre Gondois (7):
  sched/isolation: Introduce housekeeping_runtime isolation
  sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
  sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
  sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
  sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
  sched/fair: Remove on_null_domain() and redundant checks
  sched/fair: Clear idle_cpus_mask for CPUs with NULL sd

 include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
 include/linux/sched/nohz.h      |  2 ++
 kernel/sched/fair.c             | 44 +++++++++++++++++-------------
 kernel/sched/isolation.c        | 48 ++++++++++++++++++++++++++++++++-
 kernel/sched/topology.c         |  7 +++++
 5 files changed, 110 insertions(+), 21 deletions(-)

-- 
2.25.1
Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs
Posted by Waiman Long 1 year, 10 months ago
On 4/3/24 11:05, Pierre Gondois wrote:
> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
> isolated cgroup partitions. This triggered spurious wakeups for
> theses CPUs. [1]
> The initial approach was to ignore CPUs on NULL sched domains, as
> isolated CPUs have a NULL sched domain. However a CPU:
> - with its tick disabled, so taken into account in
>    nohz.[idle_cpus_mask|nr_cpus]
> - which is placed in an isolated cgroup partition
> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>
> To avoid that, the following variables should be cleared
> when a CPU is placed in an isolated cgroup partition:
> - nohz.idle_cpus_mask
> - nohz.nr_cpus
> - rq->nohz_tick_stopped
> This would allow to avoid considering wrong nohz.* values during
> idle load balance.
>
> As suggested in [2] and to avoid calling nohz_balance_[enter|exit]_idle()
> from a remote CPU and create concurrency issues, leverage the existing
> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
> sched domains).
> Indeed the HK_TYPE_SCHED mask is currently never set by the
> isolcpus/nohz_full kernel parameters, so it defaults to cpu_online_mask.
> Plus it's current usage fits CPUs that are isolated and should
> not take part in load balancing.
>
> Making use of HK_TYPE_SCHED for this purpose implies creating a
> housekeeping mask which can be modified at runtime.
>
> [1] https://lore.kernel.org/all/20230804090858.7605-1-rui.zhang@intel.com/
> [2] https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>
> Pierre Gondois (7):
>    sched/isolation: Introduce housekeeping_runtime isolation
>    sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>    sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>    sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>    sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>    sched/fair: Remove on_null_domain() and redundant checks
>    sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>
>   include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>   include/linux/sched/nohz.h      |  2 ++
>   kernel/sched/fair.c             | 44 +++++++++++++++++-------------
>   kernel/sched/isolation.c        | 48 ++++++++++++++++++++++++++++++++-
>   kernel/sched/topology.c         |  7 +++++
>   5 files changed, 110 insertions(+), 21 deletions(-)
>
I had also posted a patch series on excluding isolated CPUs in isolated 
partitions from housekeeping cpumasks earlier this year. See

https://lore.kernel.org/lkml/20240229021414.508972-1-longman@redhat.com/

It took a different approach from this series. It looks like I should 
include HK_TYPE_MISC as well.

Cheers,
Longman
Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs
Posted by Pierre Gondois 1 year, 10 months ago
Hello Waiman,
Thanks for the link, I didn't see the patchset previously.

On 4/4/24 05:01, Waiman Long wrote:
> On 4/3/24 11:05, Pierre Gondois wrote:
>> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
>> isolated cgroup partitions. This triggered spurious wakeups for
>> theses CPUs. [1]
>> The initial approach was to ignore CPUs on NULL sched domains, as
>> isolated CPUs have a NULL sched domain. However a CPU:
>> - with its tick disabled, so taken into account in
>>     nohz.[idle_cpus_mask|nr_cpus]
>> - which is placed in an isolated cgroup partition
>> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>>
>> To avoid that, the following variables should be cleared
>> when a CPU is placed in an isolated cgroup partition:
>> - nohz.idle_cpus_mask
>> - nohz.nr_cpus
>> - rq->nohz_tick_stopped
>> This would allow to avoid considering wrong nohz.* values during
>> idle load balance.
>>
>> As suggested in [2] and to avoid calling nohz_balance_[enter|exit]_idle()
>> from a remote CPU and create concurrency issues, leverage the existing
>> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
>> sched domains).
>> Indeed the HK_TYPE_SCHED mask is currently never set by the
>> isolcpus/nohz_full kernel parameters, so it defaults to cpu_online_mask.
>> Plus it's current usage fits CPUs that are isolated and should
>> not take part in load balancing.
>>
>> Making use of HK_TYPE_SCHED for this purpose implies creating a
>> housekeeping mask which can be modified at runtime.
>>
>> [1] https://lore.kernel.org/all/20230804090858.7605-1-rui.zhang@intel.com/
>> [2] https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>>
>> Pierre Gondois (7):
>>     sched/isolation: Introduce housekeeping_runtime isolation
>>     sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>>     sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>>     sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>>     sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>>     sched/fair: Remove on_null_domain() and redundant checks
>>     sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>>
>>    include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>>    include/linux/sched/nohz.h      |  2 ++
>>    kernel/sched/fair.c             | 44 +++++++++++++++++-------------
>>    kernel/sched/isolation.c        | 48 ++++++++++++++++++++++++++++++++-
>>    kernel/sched/topology.c         |  7 +++++
>>    5 files changed, 110 insertions(+), 21 deletions(-)
>>
> I had also posted a patch series on excluding isolated CPUs in isolated
> partitions from housekeeping cpumasks earlier this year. See
> 
> https://lore.kernel.org/lkml/20240229021414.508972-1-longman@redhat.com/
> 
> It took a different approach from this series. It looks like I should
> include HK_TYPE_MISC as well.

The common point between the 2 patchset is that find_new_ilb() won't
take into account isolated CPUs.
The present patchset also:
- clears nohz.[idle_cpus_mask|nr_cpus] variable when a CPU becomes isolated,
   cf. [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
- tries to clean up/gather on_null_domain()/HK_TYPE_SCHED/HK_TYPE_MISC
   mask checks, as HK_TYPE_SCHED/HK_TYPE_MISC masks are currently never
   set.
but it also:
- updates the housekeeping mask from sched/topology.c. It might be better
   to do it from cpuset.c as you did as the update originally comes from
   here and it is unlikely another place would require updating housekeeping
   CPUs.
   A new housekeeping_runtime type is also created, but I think the way you
   handle updating housekeeping mask at runtime is better.
- adds a dependency of sched/fair.c over CPU_ISOLATION (cf. housekeeping_*
   calls), as Peter noted (IIUC) [1].

Should I re-spin the patchset and try to correct those points ? Or do you
think this should be done differently ?

Regards,
Pierre

[1] https://lore.kernel.org/lkml/20240404072745.GA35684@noisy.programming.kicks-ass.net/

> 
> Cheers,
> Longman
>
Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs
Posted by Waiman Long 1 year, 10 months ago
On 4/4/24 08:55, Pierre Gondois wrote:
> Hello Waiman,
> Thanks for the link, I didn't see the patchset previously.
>
> On 4/4/24 05:01, Waiman Long wrote:
>> On 4/3/24 11:05, Pierre Gondois wrote:
>>> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
>>> isolated cgroup partitions. This triggered spurious wakeups for
>>> theses CPUs. [1]
>>> The initial approach was to ignore CPUs on NULL sched domains, as
>>> isolated CPUs have a NULL sched domain. However a CPU:
>>> - with its tick disabled, so taken into account in
>>>     nohz.[idle_cpus_mask|nr_cpus]
>>> - which is placed in an isolated cgroup partition
>>> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>>>
>>> To avoid that, the following variables should be cleared
>>> when a CPU is placed in an isolated cgroup partition:
>>> - nohz.idle_cpus_mask
>>> - nohz.nr_cpus
>>> - rq->nohz_tick_stopped
>>> This would allow to avoid considering wrong nohz.* values during
>>> idle load balance.
>>>
>>> As suggested in [2] and to avoid calling 
>>> nohz_balance_[enter|exit]_idle()
>>> from a remote CPU and create concurrency issues, leverage the existing
>>> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
>>> sched domains).
>>> Indeed the HK_TYPE_SCHED mask is currently never set by the
>>> isolcpus/nohz_full kernel parameters, so it defaults to 
>>> cpu_online_mask.
>>> Plus it's current usage fits CPUs that are isolated and should
>>> not take part in load balancing.
>>>
>>> Making use of HK_TYPE_SCHED for this purpose implies creating a
>>> housekeeping mask which can be modified at runtime.
>>>
>>> [1] 
>>> https://lore.kernel.org/all/20230804090858.7605-1-rui.zhang@intel.com/
>>> [2] 
>>> https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>>>
>>> Pierre Gondois (7):
>>>     sched/isolation: Introduce housekeeping_runtime isolation
>>>     sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>>>     sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>>>     sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>>>     sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>>>     sched/fair: Remove on_null_domain() and redundant checks
>>>     sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>>>
>>>    include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>>>    include/linux/sched/nohz.h      |  2 ++
>>>    kernel/sched/fair.c             | 44 +++++++++++++++++-------------
>>>    kernel/sched/isolation.c        | 48 
>>> ++++++++++++++++++++++++++++++++-
>>>    kernel/sched/topology.c         |  7 +++++
>>>    5 files changed, 110 insertions(+), 21 deletions(-)
>>>
>> I had also posted a patch series on excluding isolated CPUs in isolated
>> partitions from housekeeping cpumasks earlier this year. See
>>
>> https://lore.kernel.org/lkml/20240229021414.508972-1-longman@redhat.com/
>>
>> It took a different approach from this series. It looks like I should
>> include HK_TYPE_MISC as well.
>
> The common point between the 2 patchset is that find_new_ilb() won't
> take into account isolated CPUs.
> The present patchset also:
> - clears nohz.[idle_cpus_mask|nr_cpus] variable when a CPU becomes 
> isolated,
>   cf. [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
> - tries to clean up/gather on_null_domain()/HK_TYPE_SCHED/HK_TYPE_MISC
>   mask checks, as HK_TYPE_SCHED/HK_TYPE_MISC masks are currently never
>   set.
> but it also:
> - updates the housekeeping mask from sched/topology.c. It might be better
>   to do it from cpuset.c as you did as the update originally comes from
>   here and it is unlikely another place would require updating 
> housekeeping
>   CPUs.
>   A new housekeeping_runtime type is also created, but I think the way 
> you
>   handle updating housekeeping mask at runtime is better.
> - adds a dependency of sched/fair.c over CPU_ISOLATION (cf. 
> housekeeping_*
>   calls), as Peter noted (IIUC) [1].

That is true. Without CONFIG_CPU_ISOLATION, all the housekeeping* calls 
are essentially no-ops.

OTOH, without CONFIG_CPU_ISOLATION, a number of isolation capabilities 
won't be there. Most distros will have this config option set anyway.

BTW, a number of the HK_TYPE_* are also used at runtime, like 
HK_TYPE_TIMER and HK_TYPE_RCU. So it is hard to artificially distinguish 
between runtime or boot time.

I don't believe you need to add direct dependency on 
CONFIG_CPU_ISOLATION, but you do have to add any housekeeping check as 
an additional check, not as a replacement of the existing check.

Cheers,
Longman

Re: [PATCH 0/7] sched/fair|isolation: Correctly clear nohz.[nr_cpus|idle_cpus_mask] for isolated CPUs
Posted by Pierre Gondois 1 year, 10 months ago

On 4/5/24 02:23, Waiman Long wrote:
> On 4/4/24 08:55, Pierre Gondois wrote:
>> Hello Waiman,
>> Thanks for the link, I didn't see the patchset previously.
>>
>> On 4/4/24 05:01, Waiman Long wrote:
>>> On 4/3/24 11:05, Pierre Gondois wrote:
>>>> Zhang Rui reported that find_new_ilb() was iterating over CPUs in
>>>> isolated cgroup partitions. This triggered spurious wakeups for
>>>> theses CPUs. [1]
>>>> The initial approach was to ignore CPUs on NULL sched domains, as
>>>> isolated CPUs have a NULL sched domain. However a CPU:
>>>> - with its tick disabled, so taken into account in
>>>>      nohz.[idle_cpus_mask|nr_cpus]
>>>> - which is placed in an isolated cgroup partition
>>>> will never update nohz.[idle_cpus_mask|nr_cpus] again.
>>>>
>>>> To avoid that, the following variables should be cleared
>>>> when a CPU is placed in an isolated cgroup partition:
>>>> - nohz.idle_cpus_mask
>>>> - nohz.nr_cpus
>>>> - rq->nohz_tick_stopped
>>>> This would allow to avoid considering wrong nohz.* values during
>>>> idle load balance.
>>>>
>>>> As suggested in [2] and to avoid calling
>>>> nohz_balance_[enter|exit]_idle()
>>>> from a remote CPU and create concurrency issues, leverage the existing
>>>> housekeeping HK_TYPE_SCHED mask to reflect isolated CPUs (i.e. on NULL
>>>> sched domains).
>>>> Indeed the HK_TYPE_SCHED mask is currently never set by the
>>>> isolcpus/nohz_full kernel parameters, so it defaults to
>>>> cpu_online_mask.
>>>> Plus it's current usage fits CPUs that are isolated and should
>>>> not take part in load balancing.
>>>>
>>>> Making use of HK_TYPE_SCHED for this purpose implies creating a
>>>> housekeeping mask which can be modified at runtime.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/20230804090858.7605-1-rui.zhang@intel.com/
>>>> [2]
>>>> https://lore.kernel.org/all/CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com/
>>>>
>>>> Pierre Gondois (7):
>>>>      sched/isolation: Introduce housekeeping_runtime isolation
>>>>      sched/isolation: Move HK_TYPE_SCHED to housekeeping runtime
>>>>      sched/isolation: Use HKR_TYPE_SCHED in find_new_ilb()
>>>>      sched/fair: Move/add on_null_domain()/housekeeping_cpu() checks
>>>>      sched/topology: Remove CPUs with NULL sd from HKR_TYPE_SCHED mask
>>>>      sched/fair: Remove on_null_domain() and redundant checks
>>>>      sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>>>>
>>>>     include/linux/sched/isolation.h | 30 ++++++++++++++++++++-
>>>>     include/linux/sched/nohz.h      |  2 ++
>>>>     kernel/sched/fair.c             | 44 +++++++++++++++++-------------
>>>>     kernel/sched/isolation.c        | 48
>>>> ++++++++++++++++++++++++++++++++-
>>>>     kernel/sched/topology.c         |  7 +++++
>>>>     5 files changed, 110 insertions(+), 21 deletions(-)
>>>>
>>> I had also posted a patch series on excluding isolated CPUs in isolated
>>> partitions from housekeeping cpumasks earlier this year. See
>>>
>>> https://lore.kernel.org/lkml/20240229021414.508972-1-longman@redhat.com/
>>>
>>> It took a different approach from this series. It looks like I should
>>> include HK_TYPE_MISC as well.

Yes right, but as noted, if CONFIG_CPUSET is set without CPU_ISOLATION,
adding HK_TYPE_MISC to HOUSEKEEPING_FLAGS in your patchset would have no
effect right ?
The only check that would be always true is on_null_domain() from [1].

>>
>> The common point between the 2 patchset is that find_new_ilb() won't
>> take into account isolated CPUs.
>> The present patchset also:
>> - clears nohz.[idle_cpus_mask|nr_cpus] variable when a CPU becomes
>> isolated,
>>    cf. [PATCH 7/7] sched/fair: Clear idle_cpus_mask for CPUs with NULL sd
>> - tries to clean up/gather on_null_domain()/HK_TYPE_SCHED/HK_TYPE_MISC
>>    mask checks, as HK_TYPE_SCHED/HK_TYPE_MISC masks are currently never
>>    set.
>> but it also:
>> - updates the housekeeping mask from sched/topology.c. It might be better
>>    to do it from cpuset.c as you did as the update originally comes from
>>    here and it is unlikely another place would require updating
>> housekeeping
>>    CPUs.
>>    A new housekeeping_runtime type is also created, but I think the way
>> you
>>    handle updating housekeeping mask at runtime is better.
>> - adds a dependency of sched/fair.c over CPU_ISOLATION (cf.
>> housekeeping_*
>>    calls), as Peter noted (IIUC) [1].
> 
> That is true. Without CONFIG_CPU_ISOLATION, all the housekeeping* calls
> are essentially no-ops.
> 
> OTOH, without CONFIG_CPU_ISOLATION, a number of isolation capabilities
> won't be there. Most distros will have this config option set anyway.
> 
> BTW, a number of the HK_TYPE_* are also used at runtime, like
> HK_TYPE_TIMER and HK_TYPE_RCU. So it is hard to artificially distinguish
> between runtime or boot time.
> 
> I don't believe you need to add direct dependency on
> CONFIG_CPU_ISOLATION, but you do have to add any housekeeping check as
> an additional check, not as a replacement of the existing check.

(on another topic)

Isolated CPUs currently keep the state they were in when the isolation was
done, i.e. if the tick was stopped when adding the CPU was added to the
isolated cpumask, then the CPU stays in nohz.idle_cpus_mask forever.
Similarly if the tick was not stopped, the CPU is cleared forever in
nohz.idle_cpus_mask.

This patchset also intended to clear isolated CPUs in nohz.idle_cpus_mask
to let them in a known state. This might not be a good approach.

nohz.idle_cpus_mask is also used in:
   nohz_run_idle_balance()
   \-_nohz_idle_balance()
     \-for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1)
       \-update_nohz_stats()
This is apparently done to update 'the blocked load of already idle CPUs'.
However isolated CPUs might not have their blocked load updated as they are:
- currently randomly part of nohz.idle_cpus_mask.
- after this patch, never part of nohz.idle_cpus_mask.

I am also wondering whether this makes sense for isolated CPUs to update
the blocked load of other CPUs, i.e. if the following would not be needed:



diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1dd37168da50..9b92700564b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12291,6 +12291,9 @@ void nohz_run_idle_balance(int cpu)
  {
         unsigned int flags;
  
+       if (on_null_domain(cpu_rq(cpu)))
+               return;
+
         flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
  
         /*


Regards,
Pierre