arch/s390/include/asm/smp.h | 2 + arch/s390/kernel/smp.c | 5 ++ include/linux/sched/topology.h | 19 ++++++ kernel/sched/core.c | 13 ++++- kernel/sched/fair.c | 104 ++++++++++++++++++++++++++++----- kernel/sched/syscalls.c | 3 + 6 files changed, 130 insertions(+), 16 deletions(-)
Changes to v1
parked vs idle
- parked CPUs are now never considered to be idle
- a scheduler group is now considered parked iff there are parked CPUs
and there are no idle CPUs, i.e. all non parked CPUs are busy or there
are only parked CPUs. A scheduler group with parked tasks can be
considered to not be parked, if it has idle CPUs which can pick up
the parked tasks.
- idle_cpu_without always returns that the CPU will not be idle if the
CPU is parked
active balance, no_hz, queuing
- should_we_balance always returns true if a scheduler groups contains
a parked CPU and that CPU has a running task
- stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
if a task is running
- tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond
cleanup
- removed duplicate checks for parked CPUs
CPU capacity
- added a patch which removes parked cpus and their capacity from
scheduler statistics
Original description:
Adding a new scheduler group type which allows to remove all tasks
from certain CPUs through load balancing can help in scenarios where
such CPUs are currently unfavorable to use, for example in a
virtualized environment.
Functionally, this works as intended. The question would be, if this
could be considered to be added and would be worth going forward
with. If so, which areas would need additional attention?
Some cases are referenced below.
The underlying concept and the approach of adding a new scheduler
group type were presented in the Sched MC of the 2024 LPC.
A short summary:
Some architectures (e.g. s390) provide virtualization on a firmware
level. This implies, that Linux kernels running on such architectures
run on virtualized CPUs.
Like in other virtualized environments, the CPUs are most likely shared
with other guests on the hardware level. This implies, that Linux
kernels running in such an environment may encounter 'steal time'. In
other words, instead of being able to use all available time on a
physical CPU, some of said available time is 'stolen' by other guests.
This can cause side effects if a guest is interrupted at an unfavorable
point in time or if the guest is waiting for one of its other virtual
CPUs to perform certain actions while those are suspended in favour of
another guest.
Architectures, like arch/s390, address this issue by providing an
alternative classification for the CPUs seen by the Linux kernel.
The following example is arch/s390 specific:
In the default mode (horizontal CPU polarization), all CPUs are treated
equally and can be subject to steal time equally.
In the alternate mode (vertical CPU polarization), the underlying
firmware hypervisor assigns the CPUs, visible to the guest, different
types, depending on how many CPUs the guest is entitled to use. Said
entitlement is configured by assigning weights to all active guests.
The three CPU types are:
- vertical high : On these CPUs, the guest has always highest
priority over other guests. This means
especially that if the guest executes tasks on
these CPUs, it will encounter no steal time.
- vertical medium : These CPUs are meant to cover fractions of
entitlement.
- vertical low : These CPUs will have no priority when being
scheduled. This implies especially, that while
all other guests are using their full
entitlement, these CPUs might not be ran for a
significant amount of time.
As a consequence, using vertical lows while the underlying hypervisor
experiences a high load, driven by all defined guests, is to be avoided.
In order to consequently move tasks off of vertical lows, introduce a
new type of scheduler groups: group_parked.
Parked implies, that processes should be evacuated as fast as possible
from these CPUs. This implies that other CPUs should start pulling tasks
immediately, while the parked CPUs should refuse to pull any tasks
themselves.
Adding a group type beyond group_overloaded achieves the expected
behavior. By making its selection architecture dependent, it has
no effect on architectures which will not make use of that group type.
This approach works very well for many kinds of workloads. Tasks are
getting migrated back and forth in line with changing the parked
state of the involved CPUs.
There are a couple of issues and corner cases which need further
considerations:
- rt & dl: Realtime and deadline scheduling require some additional
attention.
- ext: Probably affected as well. Needs some conceptional
thoughts first.
- raciness: Right now, there are no synchronization efforts. It needs
to be considered whether those might be necessary or if
it is alright that the parked-state of a CPU might change
during load-balancing.
Patches apply to tip:sched/core
The s390 patch serves as a simplified implementation example.
Tobias Huschle (3):
sched/fair: introduce new scheduler group type group_parked
sched/fair: adapt scheduler group weight and capacity for parked CPUs
s390/topology: Add initial implementation for selection of parked CPUs
arch/s390/include/asm/smp.h | 2 +
arch/s390/kernel/smp.c | 5 ++
include/linux/sched/topology.h | 19 ++++++
kernel/sched/core.c | 13 ++++-
kernel/sched/fair.c | 104 ++++++++++++++++++++++++++++-----
kernel/sched/syscalls.c | 3 +
6 files changed, 130 insertions(+), 16 deletions(-)
--
2.34.1
On 2/17/25 17:02, Tobias Huschle wrote: > Changes to v1 > > parked vs idle > - parked CPUs are now never considered to be idle > - a scheduler group is now considered parked iff there are parked CPUs > and there are no idle CPUs, i.e. all non parked CPUs are busy or there > are only parked CPUs. A scheduler group with parked tasks can be > considered to not be parked, if it has idle CPUs which can pick up > the parked tasks. > - idle_cpu_without always returns that the CPU will not be idle if the > CPU is parked > > active balance, no_hz, queuing > - should_we_balance always returns true if a scheduler groups contains > a parked CPU and that CPU has a running task > - stopping the tick on parked CPUs is now prevented in sched_can_stop_tick > if a task is running > - tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond > > cleanup > - removed duplicate checks for parked CPUs > > CPU capacity > - added a patch which removes parked cpus and their capacity from > scheduler statistics > > > Original description: > > Adding a new scheduler group type which allows to remove all tasks > from certain CPUs through load balancing can help in scenarios where > such CPUs are currently unfavorable to use, for example in a > virtualized environment. > > Functionally, this works as intended. The question would be, if this > could be considered to be added and would be worth going forward > with. If so, which areas would need additional attention? > Some cases are referenced below. > > The underlying concept and the approach of adding a new scheduler > group type were presented in the Sched MC of the 2024 LPC. > A short summary: > > Some architectures (e.g. s390) provide virtualization on a firmware > level. This implies, that Linux kernels running on such architectures > run on virtualized CPUs. > > Like in other virtualized environments, the CPUs are most likely shared > with other guests on the hardware level. This implies, that Linux > kernels running in such an environment may encounter 'steal time'. In > other words, instead of being able to use all available time on a > physical CPU, some of said available time is 'stolen' by other guests. > > This can cause side effects if a guest is interrupted at an unfavorable > point in time or if the guest is waiting for one of its other virtual > CPUs to perform certain actions while those are suspended in favour of > another guest. > > Architectures, like arch/s390, address this issue by providing an > alternative classification for the CPUs seen by the Linux kernel. > > The following example is arch/s390 specific: > In the default mode (horizontal CPU polarization), all CPUs are treated > equally and can be subject to steal time equally. > In the alternate mode (vertical CPU polarization), the underlying > firmware hypervisor assigns the CPUs, visible to the guest, different > types, depending on how many CPUs the guest is entitled to use. Said > entitlement is configured by assigning weights to all active guests. > The three CPU types are: > - vertical high : On these CPUs, the guest has always highest > priority over other guests. This means > especially that if the guest executes tasks on > these CPUs, it will encounter no steal time. > - vertical medium : These CPUs are meant to cover fractions of > entitlement. > - vertical low : These CPUs will have no priority when being > scheduled. This implies especially, that while > all other guests are using their full > entitlement, these CPUs might not be ran for a > significant amount of time. > > As a consequence, using vertical lows while the underlying hypervisor > experiences a high load, driven by all defined guests, is to be avoided. > > In order to consequently move tasks off of vertical lows, introduce a > new type of scheduler groups: group_parked. > Parked implies, that processes should be evacuated as fast as possible > from these CPUs. This implies that other CPUs should start pulling tasks > immediately, while the parked CPUs should refuse to pull any tasks > themselves. > Adding a group type beyond group_overloaded achieves the expected > behavior. By making its selection architecture dependent, it has > no effect on architectures which will not make use of that group type. > > This approach works very well for many kinds of workloads. Tasks are > getting migrated back and forth in line with changing the parked > state of the involved CPUs. > > There are a couple of issues and corner cases which need further > considerations: > - rt & dl: Realtime and deadline scheduling require some additional > attention. I think we need to address atleast rt, there would be some non percpu kworker threads which need to move out of parked cpus. > - ext: Probably affected as well. Needs some conceptional > thoughts first. > - raciness: Right now, there are no synchronization efforts. It needs > to be considered whether those might be necessary or if > it is alright that the parked-state of a CPU might change > during load-balancing. > > Patches apply to tip:sched/core > > The s390 patch serves as a simplified implementation example. Gave it a try on powerpc with the debugfs file. it works for sched_normal tasks. > > Tobias Huschle (3): > sched/fair: introduce new scheduler group type group_parked > sched/fair: adapt scheduler group weight and capacity for parked CPUs > s390/topology: Add initial implementation for selection of parked CPUs > > arch/s390/include/asm/smp.h | 2 + > arch/s390/kernel/smp.c | 5 ++ > include/linux/sched/topology.h | 19 ++++++ > kernel/sched/core.c | 13 ++++- > kernel/sched/fair.c | 104 ++++++++++++++++++++++++++++----- > kernel/sched/syscalls.c | 3 + > 6 files changed, 130 insertions(+), 16 deletions(-) >
On 18/02/2025 06:58, Shrikanth Hegde wrote: [...] >> >> There are a couple of issues and corner cases which need further >> considerations: >> - rt & dl: Realtime and deadline scheduling require some additional >> attention. > > I think we need to address atleast rt, there would be some non percpu > kworker threads which need to move out of parked cpus. > Yea, sounds reasonable. Would probably make sense to go next for that one. >> - ext: Probably affected as well. Needs some conceptional >> thoughts first. >> - raciness: Right now, there are no synchronization efforts. It needs >> to be considered whether those might be necessary or if >> it is alright that the parked-state of a CPU might >> change >> during load-balancing. >> >> Patches apply to tip:sched/core >> >> The s390 patch serves as a simplified implementation example. > > > Gave it a try on powerpc with the debugfs file. it works for > sched_normal tasks. > That's great to hear! >> >> Tobias Huschle (3): >> sched/fair: introduce new scheduler group type group_parked >> sched/fair: adapt scheduler group weight and capacity for parked CPUs >> s390/topology: Add initial implementation for selection of parked CPUs >> >> arch/s390/include/asm/smp.h | 2 + >> arch/s390/kernel/smp.c | 5 ++ >> include/linux/sched/topology.h | 19 ++++++ >> kernel/sched/core.c | 13 ++++- >> kernel/sched/fair.c | 104 ++++++++++++++++++++++++++++----- >> kernel/sched/syscalls.c | 3 + >> 6 files changed, 130 insertions(+), 16 deletions(-) >> >
On 2/20/25 16:25, Tobias Huschle wrote:
>
>
> On 18/02/2025 06:58, Shrikanth Hegde wrote:
> [...]
>>>
>>> There are a couple of issues and corner cases which need further
>>> considerations:
>>> - rt & dl: Realtime and deadline scheduling require some additional
>>> attention.
>>
>> I think we need to address atleast rt, there would be some non percpu
>> kworker threads which need to move out of parked cpus.
>>
>
> Yea, sounds reasonable. Would probably make sense to go next for that one.
Ok. I was experimenting with rt code. Its all quite new to me.
Was able to get non-bound rt tasks honor the cpu parked state. However it works only
if the rt tasks performs some wakeups. (for example, start hackbench with chrt -r 10)
If it is continuously running (for example stress-ng with chrt -r 10), then it doesn't pack at runtime when
CPUs become parked after it started running. Not sure how many RT tasks behave that way.
It packs when starting afresh when CPUs are already parked and unpacks when CPUs become unparked though.
Added some prints in rt code to understand. A few observations:
1. balance_rt or rt_pull_tasks don't get called once stress-ng starts running.
That means there is no opportunity to pull the tasks or load balance?
It gets called when migration is running, but that can't be balanced.
Is there a way to trigger load balance of rt tasks when the task doesn't give up the CPU?
2. Regular load balance (sched_balance_rq) does get called even when the CPU is only
running the rt tasks. It tries to do the load balance (i.e passes update_sd_lb_stats etc),
but will not do a actual balance because it only works on src_rq->cfs_tasks.
That maybe a opportunity to skip the load balance if the CPU is running the RT task?
i.e CPU is not idle and chosen as the CPU do the load balancing because its the first CPU
in the group and its running only RT task.
Can Point 1 be addressed? and Is point 2 makes sense?
Also please suggest a better way if there is one compared to the patch below.
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4da2e60da9a8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
unsigned int max_cap;
unsigned int cpu_cap;
+ if (arch_cpu_parked(cpu))
+ return false;
+
/* Only heterogeneous systems can benefit from this check */
if (!sched_asym_cpucap_active())
return true;
@@ -476,6 +479,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
#else
static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
{
+ if (arch_cpu_parked(cpu))
+ return false;
+
return true;
}
#endif
@@ -1801,6 +1807,8 @@ static int find_lowest_rq(struct task_struct *task)
int this_cpu = smp_processor_id();
int cpu = task_cpu(task);
int ret;
+ int parked_cpu = 0;
+ int tmp_cpu;
/* Make sure the mask is initialized first */
if (unlikely(!lowest_mask))
@@ -1809,11 +1817,18 @@ static int find_lowest_rq(struct task_struct *task)
if (task->nr_cpus_allowed == 1)
return -1; /* No other targets possible */
+ for_each_cpu(tmp_cpu, cpu_online_mask) {
+ if (arch_cpu_parked(tmp_cpu)) {
+ parked_cpu = tmp_cpu;
+ break;
+ }
+ }
+
/*
* If we're on asym system ensure we consider the different capacities
* of the CPUs when searching for the lowest_mask.
*/
- if (sched_asym_cpucap_active()) {
+ if (sched_asym_cpucap_active() || parked_cpu > -1) {
ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
task, lowest_mask,
@@ -1835,14 +1850,14 @@ static int find_lowest_rq(struct task_struct *task)
* We prioritize the last CPU that the task executed on since
* it is most likely cache-hot in that location.
*/
- if (cpumask_test_cpu(cpu, lowest_mask))
+ if (cpumask_test_cpu(cpu, lowest_mask) && !arch_cpu_parked(cpu))
return cpu;
/*
* Otherwise, we consult the sched_domains span maps to figure
* out which CPU is logically closest to our hot cache data.
*/
- if (!cpumask_test_cpu(this_cpu, lowest_mask))
+ if (!cpumask_test_cpu(this_cpu, lowest_mask) || arch_cpu_parked(this_cpu))
this_cpu = -1; /* Skip this_cpu opt if not among lowest */
rcu_read_lock();
@@ -1862,7 +1877,7 @@ static int find_lowest_rq(struct task_struct *task)
best_cpu = cpumask_any_and_distribute(lowest_mask,
sched_domain_span(sd));
- if (best_cpu < nr_cpu_ids) {
+ if (best_cpu < nr_cpu_ids && !arch_cpu_parked(best_cpu)) {
rcu_read_unlock();
return best_cpu;
}
@@ -1879,7 +1894,7 @@ static int find_lowest_rq(struct task_struct *task)
return this_cpu;
cpu = cpumask_any_distribute(lowest_mask);
- if (cpu < nr_cpu_ids)
+ if (cpu < nr_cpu_ids && !arch_cpu_parked(cpu))
return cpu;
return -1;
Meanwhile, i will continue looking at code to understand it better.
>
>>> - ext: Probably affected as well. Needs some conceptional
>>> thoughts first.
>>> - raciness: Right now, there are no synchronization efforts. It
>>> needs
>>> to be considered whether those might be necessary or if
>>> it is alright that the parked-state of a CPU might
>>> change
>>> during load-balancing.
>>>
>>> Patches apply to tip:sched/core
>>>
>>> The s390 patch serves as a simplified implementation example.
>>
>>
>> Gave it a try on powerpc with the debugfs file. it works for
>> sched_normal tasks.
>>
>
> That's great to hear!
>
>>>
>>> Tobias Huschle (3):
>>> sched/fair: introduce new scheduler group type group_parked
>>> sched/fair: adapt scheduler group weight and capacity for parked CPUs
>>> s390/topology: Add initial implementation for selection of parked
>>> CPUs
>>>
>>> arch/s390/include/asm/smp.h | 2 +
>>> arch/s390/kernel/smp.c | 5 ++
>>> include/linux/sched/topology.h | 19 ++++++
>>> kernel/sched/core.c | 13 ++++-
>>> kernel/sched/fair.c | 104 ++++++++++++++++++++++++++++-----
>>> kernel/sched/syscalls.c | 3 +
>>> 6 files changed, 130 insertions(+), 16 deletions(-)
>>>
>>
>
© 2016 - 2025 Red Hat, Inc.