[v4] sched: Fix cluster scheduling in the presence of asymmetric capacity

[PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

Posted by Ricardo Neri 3 days, 22 hours ago

Hi,

This is v4 of the series. The most important change in this version is a
pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
to do more testing. Please read the changelog for details.

Cluster scheduling aims to maximize performance by spreading load across
clusters of CPUs that share mid-level resources [2]. It works well on
uniform systems, but it breaks down on topologies with big and small
cores arranged in clusters. As a result, it fails on several generations
of Intel processors already shipped and upcoming.

Consider the topology below of big (B) cores and clusters of small (s)
cores.
------ ------
| B | | B | ----------------- -----------------
| | | | | s | s | s | s | | s | s | s | s |
------ ------ ----------------- -----------------
| L2 | | L2 | | L2 | | L2 |
-------------------------------------------------------
| L3 |
-------------------------------------------------------

On a partially busy system (one with idle CPUs; busy CPUs have one task
each), scheduling for asymmetric capacity ensures that misfit tasks land on
the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
be evenly spread among the small-CPU clusters. Today, this does not
happen.

Several issues in the load balancer prevent a small CPU in one cluster
from pulling tasks from another:

a) update_sd_pick_busiest() may select a fully_busy group with higher
per-CPU capacity as the busiest, preventing a subsequent fully_busy
group of equal capacity from being correctly selected.
b) Misfit-load statistics are used to identify tasks that would benefit
from migrating to bigger CPUs. Accounting misfit load is pointless if
the destination CPU is equally small, and it also blocks balancing
between clusters.
c) Due to b), groups that are truly has_spare or fully_busy get
misclassified as misfit_task. update_sd_pick_busiest() then skips
them, since a small destination CPU cannot help with misfit tasks.
d) Once a busiest group has been identified, sched_balance_find_src_rq()
will refuse to migrate tasks to CPUs of equal capacity, even when
doing so is precisely what is required to balance small-CPU clusters.
e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
asymmetric capacity, preventing the balancer from equalizing load
across sibling small-core clusters.

Together, these issues prevent cluster-level balancing on systems with
asymmetric CPU capacity.

This series addresses each problem and restores the intended behavior.
Details, rationale, and code changes are explained in each patch.

I tested these patches on Alder Lake, which has both SMT Pcores and
clusters of Ecores. I tested with SMT both disabled and enabled. I also
tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
connected to the L3 cache. I repeated the same experiment with
CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.

Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1]
Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2]

Changes in v4:
- Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
cores with more than one busy sibling.
- Patch 2 (pre-work): Fixed a bug that would needlessly update
sg_overloaded.
- Patch 5: Reworked logic using a local variable for improved
readability.
- Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
- Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com

Changes in v3:
- Patch 3: Reverted the inverted runtime capacity check. The inverted
form resulted in migrations to CPUs of slightly lower capacity. Guarded
the check for architectural capacity with the sched_cluster_active
static key.
- Patch 4: Expanded the patch description to explain the behavior of
overloaded groups and low-capacity clusters with spare capacity.
- Added Reviewed-by tags from Christian. Thanks!
- Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com

Changes in v2:
- Patch 1: Rewrote patch description for clarity. Added a note
clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
exclusive. (Tim)
- Patch 2: Fixed a bug where the capacity check inadvertently broke
the mutual exclusion of the sched_reduced_capacity() path. Keep
marking the root domain as overloaded when misfit tasks are present
to allow bigger CPUs to help via newly idle balance. (sashiko)
Fixed the description to state that capacity_greater() looks for
differences of ~5% or more, not 20%. (Christian)
- Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
ignore runtime capacity variability. Inverted the capacity check.
(Christian)
- Patch 4: Reworded the patch description for clarity.
- Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/

---
Ricardo Neri (6):
sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
sched/fair: Check CPU capacity before comparing group types during load balance
sched/fair: Skip misfit load accounting when the destination CPU cannot help
sched/fair: Allow load balancing between CPUs of identical capacity
sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

include/linux/sched/sd_flags.h | 3 ++-
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++-----------
kernel/sched/topology.c | 14 +++++++++--
3 files changed, 56 insertions(+), 18 deletions(-)
---
base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152

Best regards,
--
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

Re: [PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

Posted by Christian Loehle 3 days, 17 hours ago

On 6/8/26 13:57, Ricardo Neri wrote:
> Hi,
> 
> This is v4 of the series. The most important change in this version is a
> pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
> CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
> to do more testing. Please read the changelog for details.
> 
> Cluster scheduling aims to maximize performance by spreading load across
> clusters of CPUs that share mid-level resources [2]. It works well on
> uniform systems, but it breaks down on topologies with big and small
> cores arranged in clusters. As a result, it fails on several generations
> of Intel processors already shipped and upcoming.
> 
> Consider the topology below of big (B) cores and clusters of small (s)
> cores.
>          ------   ------
>          | B  |   | B  |   -----------------   -----------------
>          |    |   |    |   | s | s | s | s |   | s | s | s | s |
>          ------   ------   -----------------   -----------------
>          | L2 |   | L2 |   |      L2       |   |       L2      |
>          -------------------------------------------------------
>          |                          L3                         |
>          -------------------------------------------------------
> 
> On a partially busy system (one with idle CPUs; busy CPUs have one task
> each), scheduling for asymmetric capacity ensures that misfit tasks land on
> the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
> When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
> be evenly spread among the small-CPU clusters. Today, this does not
> happen.
> 
> Several issues in the load balancer prevent a small CPU in one cluster
> from pulling tasks from another:
> 
>  a) update_sd_pick_busiest() may select a fully_busy group with higher
>     per-CPU capacity as the busiest, preventing a subsequent fully_busy
>     group of equal capacity from being correctly selected.
>  b) Misfit-load statistics are used to identify tasks that would benefit
>     from migrating to bigger CPUs. Accounting misfit load is pointless if
>     the destination CPU is equally small, and it also blocks balancing
>     between clusters.
>  c) Due to b), groups that are truly has_spare or fully_busy get
>     misclassified as misfit_task. update_sd_pick_busiest() then skips
>     them, since a small destination CPU cannot help with misfit tasks.
>  d) Once a busiest group has been identified, sched_balance_find_src_rq()
>     will refuse to migrate tasks to CPUs of equal capacity, even when
>     doing so is precisely what is required to balance small-CPU clusters.
>  e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
>     asymmetric capacity, preventing the balancer from equalizing load
>     across sibling small-core clusters.
> 
> Together, these issues prevent cluster-level balancing on systems with
> asymmetric CPU capacity.
> 
> This series addresses each problem and restores the intended behavior.
> Details, rationale, and code changes are explained in each patch.
> 
> I tested these patches on Alder Lake, which has both SMT Pcores and
> clusters of Ecores. I tested with SMT both disabled and enabled. I also
> tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
> connected to the L3 cache. I repeated the same experiment with
> CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.
> 
> Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1]
> Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2]
> 
> Changes in v4:
>  - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
>    cores with more than one busy sibling.
>  - Patch 2 (pre-work): Fixed a bug that would needlessly update
>    sg_overloaded.
>  - Patch 5: Reworked logic using a local variable for improved
>    readability.
>  - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
>  - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com
> 
> Changes in v3:
>  - Patch 3: Reverted the inverted runtime capacity check. The inverted
>    form resulted in migrations to CPUs of slightly lower capacity. Guarded
>    the check for architectural capacity with the sched_cluster_active
>    static key.
>  - Patch 4: Expanded the patch description to explain the behavior of
>    overloaded groups and low-capacity clusters with spare capacity.
>  - Added Reviewed-by tags from Christian. Thanks!
>  - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com
> 
> Changes in v2:
>  - Patch 1: Rewrote patch description for clarity. Added a note
>    clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
>    exclusive. (Tim)
>  - Patch 2: Fixed a bug where the capacity check inadvertently broke
>    the mutual exclusion of the sched_reduced_capacity() path. Keep
>    marking the root domain as overloaded when misfit tasks are present
>    to allow bigger CPUs to help via newly idle balance. (sashiko)
>    Fixed the description to state that capacity_greater() looks for
>    differences of ~5% or more, not 20%. (Christian)
>  - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
>    ignore runtime capacity variability. Inverted the capacity check.
>    (Christian)
>  - Patch 4: Reworded the patch description for clarity.
>  - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/
> 
> ---
> Ricardo Neri (6):
>       sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
>       sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
>       sched/fair: Check CPU capacity before comparing group types during load balance
>       sched/fair: Skip misfit load accounting when the destination CPU cannot help
>       sched/fair: Allow load balancing between CPUs of identical capacity
>       sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
> 
>  include/linux/sched/sd_flags.h |  3 ++-
>  kernel/sched/fair.c            | 57 +++++++++++++++++++++++++++++++-----------
>  kernel/sched/topology.c        | 14 +++++++++--
>  3 files changed, 56 insertions(+), 18 deletions(-)
> ---
> base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
> change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152
> 
> Best regards,

Since I don't really have an arm64 machine that hits the described case just
right, I tested the series on a synthetic arm64 qemu topology with two
equal-capacity little clusters and one 1024 cluster.

The guest was booted with QEMU virt, 8 CPUs and a custom dtb. The resulting
topology is:
cluster0: CPUs 0-1, cpu_capacity=446
cluster1: CPUs 2-3, cpu_capacity=446
cluster2: CPUs 4-7, cpu_capacity=1024
The dtb describes the clusters with cpu-map. The test kernel was built with
CONFIG_SCHED_CLUSTER enabled.

I used an rt-app workload with 8 (nr_cpus) SCHED_OTHER tasks.
Each task used the same two phases:
"pinned": {
    "loop": 100,
    "run": 99000,
    "timer": { "ref": "unique", "period": 100000 },
    "cpus": [0, 1, 4, 5, 6, 7]
},
"open": {
    "loop": 100,
    "run": 99000,
    "timer": { "ref": "unique", "period": 100000 },
    "cpus": [0, 1, 2, 3, 4, 5, 6, 7]
}

The intent is to first force the workload onto cluster0 plus the big cluster,
leaving the second little cluster unused. Then the affinity mask is opened to
all CPUs. If load balancing across equal-capacity clusters works, CPUs 2-3
should receive a meaningful share of the work (instead of only occasional
migrations).

I counted rt-app sched_switch events per cluster in the open phase. The pass
condition was that cluster1_little receives at least 20% of open-phase rt-app
sched_switch events.

Results over three runs (for the open phases):

mainline:
run0: cluster0 5.7%, cluster1 5.7%, big 88.6%  FAIL
run1: cluster0 5.5%, cluster1 6.2%, big 88.3%  FAIL
run2: cluster0 4.3%, cluster1 4.7%, big 91.0%  FAIL

with this series:
run0: cluster0 38.6%, cluster1 31.4%, big 30.1%  PASS
run1: cluster0 33.2%, cluster1 60.6%, big 6.3%   PASS
run2: cluster0 33.3%, cluster1 60.6%, big 6.1%   PASS

(The pinned phase behaved as expected in all runs: there were no rt-app
sched_switch samples on CPUs 2-3 before the affinity mask was opened.)

For the series (patch 1/6 is a different setup, so maybe except for that)
Tested-by: Christian Loehle <christian.loehle@arm.com>

Re: [PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

Posted by Ricardo Neri 3 days, 8 hours ago

On Mon, Jun 08, 2026 at 06:37:41PM +0100, Christian Loehle wrote:
> On 6/8/26 13:57, Ricardo Neri wrote:
> > Hi,
> > 
> > This is v4 of the series. The most important change in this version is a
> > pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
> > CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
> > to do more testing. Please read the changelog for details.
> > 
> > Cluster scheduling aims to maximize performance by spreading load across
> > clusters of CPUs that share mid-level resources [2]. It works well on
> > uniform systems, but it breaks down on topologies with big and small
> > cores arranged in clusters. As a result, it fails on several generations
> > of Intel processors already shipped and upcoming.
> > 
> > Consider the topology below of big (B) cores and clusters of small (s)
> > cores.
> >          ------   ------
> >          | B  |   | B  |   -----------------   -----------------
> >          |    |   |    |   | s | s | s | s |   | s | s | s | s |
> >          ------   ------   -----------------   -----------------
> >          | L2 |   | L2 |   |      L2       |   |       L2      |
> >          -------------------------------------------------------
> >          |                          L3                         |
> >          -------------------------------------------------------
> > 
> > On a partially busy system (one with idle CPUs; busy CPUs have one task
> > each), scheduling for asymmetric capacity ensures that misfit tasks land on
> > the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
> > When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
> > be evenly spread among the small-CPU clusters. Today, this does not
> > happen.
> > 
> > Several issues in the load balancer prevent a small CPU in one cluster
> > from pulling tasks from another:
> > 
> >  a) update_sd_pick_busiest() may select a fully_busy group with higher
> >     per-CPU capacity as the busiest, preventing a subsequent fully_busy
> >     group of equal capacity from being correctly selected.
> >  b) Misfit-load statistics are used to identify tasks that would benefit
> >     from migrating to bigger CPUs. Accounting misfit load is pointless if
> >     the destination CPU is equally small, and it also blocks balancing
> >     between clusters.
> >  c) Due to b), groups that are truly has_spare or fully_busy get
> >     misclassified as misfit_task. update_sd_pick_busiest() then skips
> >     them, since a small destination CPU cannot help with misfit tasks.
> >  d) Once a busiest group has been identified, sched_balance_find_src_rq()
> >     will refuse to migrate tasks to CPUs of equal capacity, even when
> >     doing so is precisely what is required to balance small-CPU clusters.
> >  e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
> >     asymmetric capacity, preventing the balancer from equalizing load
> >     across sibling small-core clusters.
> > 
> > Together, these issues prevent cluster-level balancing on systems with
> > asymmetric CPU capacity.
> > 
> > This series addresses each problem and restores the intended behavior.
> > Details, rationale, and code changes are explained in each patch.
> > 
> > I tested these patches on Alder Lake, which has both SMT Pcores and
> > clusters of Ecores. I tested with SMT both disabled and enabled. I also
> > tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
> > connected to the L3 cache. I repeated the same experiment with
> > CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.
> > 
> > Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1]
> > Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2]
> > 
> > Changes in v4:
> >  - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
> >    cores with more than one busy sibling.
> >  - Patch 2 (pre-work): Fixed a bug that would needlessly update
> >    sg_overloaded.
> >  - Patch 5: Reworked logic using a local variable for improved
> >    readability.
> >  - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
> >  - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com
> > 
> > Changes in v3:
> >  - Patch 3: Reverted the inverted runtime capacity check. The inverted
> >    form resulted in migrations to CPUs of slightly lower capacity. Guarded
> >    the check for architectural capacity with the sched_cluster_active
> >    static key.
> >  - Patch 4: Expanded the patch description to explain the behavior of
> >    overloaded groups and low-capacity clusters with spare capacity.
> >  - Added Reviewed-by tags from Christian. Thanks!
> >  - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com
> > 
> > Changes in v2:
> >  - Patch 1: Rewrote patch description for clarity. Added a note
> >    clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
> >    exclusive. (Tim)
> >  - Patch 2: Fixed a bug where the capacity check inadvertently broke
> >    the mutual exclusion of the sched_reduced_capacity() path. Keep
> >    marking the root domain as overloaded when misfit tasks are present
> >    to allow bigger CPUs to help via newly idle balance. (sashiko)
> >    Fixed the description to state that capacity_greater() looks for
> >    differences of ~5% or more, not 20%. (Christian)
> >  - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
> >    ignore runtime capacity variability. Inverted the capacity check.
> >    (Christian)
> >  - Patch 4: Reworded the patch description for clarity.
> >  - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/
> > 
> > ---
> > Ricardo Neri (6):
> >       sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
> >       sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
> >       sched/fair: Check CPU capacity before comparing group types during load balance
> >       sched/fair: Skip misfit load accounting when the destination CPU cannot help
> >       sched/fair: Allow load balancing between CPUs of identical capacity
> >       sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
> > 
> >  include/linux/sched/sd_flags.h |  3 ++-
> >  kernel/sched/fair.c            | 57 +++++++++++++++++++++++++++++++-----------
> >  kernel/sched/topology.c        | 14 +++++++++--
> >  3 files changed, 56 insertions(+), 18 deletions(-)
> > ---
> > base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
> > change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152
> > 
> > Best regards,
> 
> Since I don't really have an arm64 machine that hits the described case just
> right, I tested the series on a synthetic arm64 qemu topology with two
> equal-capacity little clusters and one 1024 cluster.
> 
> The guest was booted with QEMU virt, 8 CPUs and a custom dtb. The resulting
> topology is:
> cluster0: CPUs 0-1, cpu_capacity=446
> cluster1: CPUs 2-3, cpu_capacity=446
> cluster2: CPUs 4-7, cpu_capacity=1024
> The dtb describes the clusters with cpu-map. The test kernel was built with
> CONFIG_SCHED_CLUSTER enabled.
> 
> I used an rt-app workload with 8 (nr_cpus) SCHED_OTHER tasks.
> Each task used the same two phases:
> "pinned": {
>     "loop": 100,
>     "run": 99000,
>     "timer": { "ref": "unique", "period": 100000 },
>     "cpus": [0, 1, 4, 5, 6, 7]
> },
> "open": {
>     "loop": 100,
>     "run": 99000,
>     "timer": { "ref": "unique", "period": 100000 },
>     "cpus": [0, 1, 2, 3, 4, 5, 6, 7]
> }
> 
> The intent is to first force the workload onto cluster0 plus the big cluster,
> leaving the second little cluster unused. Then the affinity mask is opened to
> all CPUs. If load balancing across equal-capacity clusters works, CPUs 2-3
> should receive a meaningful share of the work (instead of only occasional
> migrations).
> 
> I counted rt-app sched_switch events per cluster in the open phase. The pass
> condition was that cluster1_little receives at least 20% of open-phase rt-app
> sched_switch events.
> 
> Results over three runs (for the open phases):
> 
> mainline:
> run0: cluster0 5.7%, cluster1 5.7%, big 88.6%  FAIL
> run1: cluster0 5.5%, cluster1 6.2%, big 88.3%  FAIL
> run2: cluster0 4.3%, cluster1 4.7%, big 91.0%  FAIL
> 
> with this series:
> run0: cluster0 38.6%, cluster1 31.4%, big 30.1%  PASS
> run1: cluster0 33.2%, cluster1 60.6%, big 6.3%   PASS
> run2: cluster0 33.3%, cluster1 60.6%, big 6.1%   PASS
> 
> (The pinned phase behaved as expected in all runs: there were no rt-app
> sched_switch samples on CPUs 2-3 before the affinity mask was opened.)
> 
> For the series (patch 1/6 is a different setup, so maybe except for that)
> Tested-by: Christian Loehle <christian.loehle@arm.com>

Many thanks for your tests! I have two questions: Do you see similar
results of you spawn less tasks than nr_cpus? Perhaps with 6 tasks? They
should continue to be evenly distributed on same-capacity clusters.

Also, are these high-utilization tasks? If yes, the high-capacity cluster
should be fully utilized before any tasks overflow to the lower-capacity
clusters.

Re: [PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

Posted by Christian Loehle 2 days, 15 hours ago

On 6/9/26 04:19, Ricardo Neri wrote:
> On Mon, Jun 08, 2026 at 06:37:41PM +0100, Christian Loehle wrote:
>> On 6/8/26 13:57, Ricardo Neri wrote:
>>> Hi,
>>>
>>> This is v4 of the series. The most important change in this version is a
>>> pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
>>> CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
>>> to do more testing. Please read the changelog for details.
>>>
>>> Cluster scheduling aims to maximize performance by spreading load across
>>> clusters of CPUs that share mid-level resources [2]. It works well on
>>> uniform systems, but it breaks down on topologies with big and small
>>> cores arranged in clusters. As a result, it fails on several generations
>>> of Intel processors already shipped and upcoming.
>>>
>>> Consider the topology below of big (B) cores and clusters of small (s)
>>> cores.
>>>          ------   ------
>>>          | B  |   | B  |   -----------------   -----------------
>>>          |    |   |    |   | s | s | s | s |   | s | s | s | s |
>>>          ------   ------   -----------------   -----------------
>>>          | L2 |   | L2 |   |      L2       |   |       L2      |
>>>          -------------------------------------------------------
>>>          |                          L3                         |
>>>          -------------------------------------------------------
>>>
>>> On a partially busy system (one with idle CPUs; busy CPUs have one task
>>> each), scheduling for asymmetric capacity ensures that misfit tasks land on
>>> the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
>>> When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
>>> be evenly spread among the small-CPU clusters. Today, this does not
>>> happen.
>>>
>>> Several issues in the load balancer prevent a small CPU in one cluster
>>> from pulling tasks from another:
>>>
>>>  a) update_sd_pick_busiest() may select a fully_busy group with higher
>>>     per-CPU capacity as the busiest, preventing a subsequent fully_busy
>>>     group of equal capacity from being correctly selected.
>>>  b) Misfit-load statistics are used to identify tasks that would benefit
>>>     from migrating to bigger CPUs. Accounting misfit load is pointless if
>>>     the destination CPU is equally small, and it also blocks balancing
>>>     between clusters.
>>>  c) Due to b), groups that are truly has_spare or fully_busy get
>>>     misclassified as misfit_task. update_sd_pick_busiest() then skips
>>>     them, since a small destination CPU cannot help with misfit tasks.
>>>  d) Once a busiest group has been identified, sched_balance_find_src_rq()
>>>     will refuse to migrate tasks to CPUs of equal capacity, even when
>>>     doing so is precisely what is required to balance small-CPU clusters.
>>>  e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
>>>     asymmetric capacity, preventing the balancer from equalizing load
>>>     across sibling small-core clusters.
>>>
>>> Together, these issues prevent cluster-level balancing on systems with
>>> asymmetric CPU capacity.
>>>
>>> This series addresses each problem and restores the intended behavior.
>>> Details, rationale, and code changes are explained in each patch.
>>>
>>> I tested these patches on Alder Lake, which has both SMT Pcores and
>>> clusters of Ecores. I tested with SMT both disabled and enabled. I also
>>> tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
>>> connected to the L3 cache. I repeated the same experiment with
>>> CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.
>>>
>>> Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1]
>>> Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2]
>>>
>>> Changes in v4:
>>>  - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
>>>    cores with more than one busy sibling.
>>>  - Patch 2 (pre-work): Fixed a bug that would needlessly update
>>>    sg_overloaded.
>>>  - Patch 5: Reworked logic using a local variable for improved
>>>    readability.
>>>  - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
>>>  - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com
>>>
>>> Changes in v3:
>>>  - Patch 3: Reverted the inverted runtime capacity check. The inverted
>>>    form resulted in migrations to CPUs of slightly lower capacity. Guarded
>>>    the check for architectural capacity with the sched_cluster_active
>>>    static key.
>>>  - Patch 4: Expanded the patch description to explain the behavior of
>>>    overloaded groups and low-capacity clusters with spare capacity.
>>>  - Added Reviewed-by tags from Christian. Thanks!
>>>  - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com
>>>
>>> Changes in v2:
>>>  - Patch 1: Rewrote patch description for clarity. Added a note
>>>    clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
>>>    exclusive. (Tim)
>>>  - Patch 2: Fixed a bug where the capacity check inadvertently broke
>>>    the mutual exclusion of the sched_reduced_capacity() path. Keep
>>>    marking the root domain as overloaded when misfit tasks are present
>>>    to allow bigger CPUs to help via newly idle balance. (sashiko)
>>>    Fixed the description to state that capacity_greater() looks for
>>>    differences of ~5% or more, not 20%. (Christian)
>>>  - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
>>>    ignore runtime capacity variability. Inverted the capacity check.
>>>    (Christian)
>>>  - Patch 4: Reworded the patch description for clarity.
>>>  - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/
>>>
>>> ---
>>> Ricardo Neri (6):
>>>       sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
>>>       sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
>>>       sched/fair: Check CPU capacity before comparing group types during load balance
>>>       sched/fair: Skip misfit load accounting when the destination CPU cannot help
>>>       sched/fair: Allow load balancing between CPUs of identical capacity
>>>       sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
>>>
>>>  include/linux/sched/sd_flags.h |  3 ++-
>>>  kernel/sched/fair.c            | 57 +++++++++++++++++++++++++++++++-----------
>>>  kernel/sched/topology.c        | 14 +++++++++--
>>>  3 files changed, 56 insertions(+), 18 deletions(-)
>>> ---
>>> base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
>>> change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152
>>>
>>> Best regards,
>>
>> Since I don't really have an arm64 machine that hits the described case just
>> right, I tested the series on a synthetic arm64 qemu topology with two
>> equal-capacity little clusters and one 1024 cluster.
>>
>> The guest was booted with QEMU virt, 8 CPUs and a custom dtb. The resulting
>> topology is:
>> cluster0: CPUs 0-1, cpu_capacity=446
>> cluster1: CPUs 2-3, cpu_capacity=446
>> cluster2: CPUs 4-7, cpu_capacity=1024
>> The dtb describes the clusters with cpu-map. The test kernel was built with
>> CONFIG_SCHED_CLUSTER enabled.
>>
>> I used an rt-app workload with 8 (nr_cpus) SCHED_OTHER tasks.
>> Each task used the same two phases:
>> "pinned": {
>>     "loop": 100,
>>     "run": 99000,
>>     "timer": { "ref": "unique", "period": 100000 },
>>     "cpus": [0, 1, 4, 5, 6, 7]
>> },
>> "open": {
>>     "loop": 100,
>>     "run": 99000,
>>     "timer": { "ref": "unique", "period": 100000 },
>>     "cpus": [0, 1, 2, 3, 4, 5, 6, 7]
>> }
>>
>> The intent is to first force the workload onto cluster0 plus the big cluster,
>> leaving the second little cluster unused. Then the affinity mask is opened to
>> all CPUs. If load balancing across equal-capacity clusters works, CPUs 2-3
>> should receive a meaningful share of the work (instead of only occasional
>> migrations).
>>
>> I counted rt-app sched_switch events per cluster in the open phase. The pass
>> condition was that cluster1_little receives at least 20% of open-phase rt-app
>> sched_switch events.
>>
>> Results over three runs (for the open phases):
>>
>> mainline:
>> run0: cluster0 5.7%, cluster1 5.7%, big 88.6%  FAIL
>> run1: cluster0 5.5%, cluster1 6.2%, big 88.3%  FAIL
>> run2: cluster0 4.3%, cluster1 4.7%, big 91.0%  FAIL
>>
>> with this series:
>> run0: cluster0 38.6%, cluster1 31.4%, big 30.1%  PASS
>> run1: cluster0 33.2%, cluster1 60.6%, big 6.3%   PASS
>> run2: cluster0 33.3%, cluster1 60.6%, big 6.1%   PASS
>>
>> (The pinned phase behaved as expected in all runs: there were no rt-app
>> sched_switch samples on CPUs 2-3 before the affinity mask was opened.)
>>
>> For the series (patch 1/6 is a different setup, so maybe except for that)
>> Tested-by: Christian Loehle <christian.loehle@arm.com>
> 
> Many thanks for your tests! I have two questions: Do you see similar
> results of you spawn less tasks than nr_cpus? Perhaps with 6 tasks? They
> should continue to be evenly distributed on same-capacity clusters.

with 6 tasks:

pinned phase first timestamp: 1850.75
open phase first timestamp: 1879.02

before_open sched_switch samples: 79
  cpu0: 30
  cpu1: 8
  cpu2: 0
  cpu3: 0
  cpu4: 9
  cpu5: 6
  cpu6: 7
  cpu7: 19
  cluster0_little: 38 (48.1%)
  cluster1_little: 0 (0.0%)
  cluster2_big:    41 (51.9%)

after_open sched_switch samples: 899
  cpu0: 235
  cpu1: 176
  cpu2: 225
  cpu3: 218
  cpu4: 9
  cpu5: 18
  cpu6: 8
  cpu7: 10
  cluster0_little: 411 (45.7%)
  cluster1_little: 443 (49.3%)
  cluster2_big:    45 (5.0%)

after_open sched_migrate_task destination clusters:
  cluster0_little: 380
  cluster1_little: 408
  cluster2_big:    6


runtime evaluation:
before_open runtime: 169.337475s
  cluster0_little: 57.503009s (34.0%)
  cluster1_little: 0.000000s (0.0%)
  cluster2_big:    111.834466s (66.0%)

after_open runtime: 191.293461s
  cluster0_little: 27.926735s (14.6%)
  cluster1_little: 33.729412s (17.6%)
  cluster2_big:    129.637314s (67.8%)

after_open little-to-big rt-app migrations: 2
  1910.977751 rtapp04-4 cpu0 -> cpu7
  1911.011555 rtapp03-3 cpu2 -> cpu5
big-cluster rtapp_task:end count: 6
  first: 1910.975772 last: 1913.225910


FWIW I also mirrored the pinned phase (so pinned to cluster1 first):

switch-count evaluation:
pinned phase first timestamp: 1469.11
open phase first timestamp: 1498.6

before_open sched_switch samples: 61
  cpu0: 0
  cpu1: 0
  cpu2: 6
  cpu3: 13
  cpu4: 24
  cpu5: 6
  cpu6: 6
  cpu7: 6
  cluster0_little: 0 (0.0%)
  cluster1_little: 19 (31.1%)
  cluster2_big:    42 (68.9%)

after_open sched_switch samples: 883
  cpu0: 221
  cpu1: 178
  cpu2: 226
  cpu3: 219
  cpu4: 10
  cpu5: 7
  cpu6: 11
  cpu7: 11
  cluster0_little: 399 (45.2%)
  cluster1_little: 445 (50.4%)
  cluster2_big:    39 (4.4%)

after_open sched_migrate_task destination clusters:
  cluster0_little: 370
  cluster1_little: 403
  cluster2_big:    8

runtime evaluation:
before_open runtime: 171.855925s
  cluster0_little: 0.000000s (0.0%)
  cluster1_little: 58.754673s (34.2%)
  cluster2_big:    113.101252s (65.8%)

after_open runtime: 189.467864s
  cluster0_little: 27.279064s (14.4%)
  cluster1_little: 31.392607s (16.6%)
  cluster2_big:    130.796193s (69.0%)

after_open little-to-big rt-app migrations: 3
  1529.002652 rtapp04-4 cpu0 -> cpu6
  1529.002902 rtapp02-2 cpu3 -> cpu6
  1529.778636 rtapp02-2 cpu0 -> cpu7
big-cluster rtapp_task:end count: 6
  first: 1529.000465 last: 1533.242563


> 
> Also, are these high-utilization tasks? If yes, the high-capacity cluster
> should be fully utilized before any tasks overflow to the lower-capacity
> clusters.

Yes, sorry I should've mentioned, the above rt-app tasks will use 99% of the
capacity of 1024 CPU, so the expected behavior is that all CPUs are used.
Once the 1024 CPUs finish (as they should finish in ~half the time), tasks
of any little cluster will be upmigrated to the 1024.
I did quickly check if that is the case, which it was, but that part is
definitely more than wonky on qemu (as the capacities are just in the dtb,
the CPUs are of course vCPUs which behave very noisily and with no correlation
to the capacity value).

Re: [PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

Posted by Ricardo Neri 2 days, 8 hours ago

On Tue, Jun 09, 2026 at 09:09:25PM +0100, Christian Loehle wrote:
> On 6/9/26 04:19, Ricardo Neri wrote:
> > On Mon, Jun 08, 2026 at 06:37:41PM +0100, Christian Loehle wrote:
> >> On 6/8/26 13:57, Ricardo Neri wrote:
> >>> Hi,
> >>>
> >>> This is v4 of the series. The most important change in this version is a
> >>> pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
> >>> CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
> >>> to do more testing. Please read the changelog for details.
> >>>
> >>> Cluster scheduling aims to maximize performance by spreading load across
> >>> clusters of CPUs that share mid-level resources [2]. It works well on
> >>> uniform systems, but it breaks down on topologies with big and small
> >>> cores arranged in clusters. As a result, it fails on several generations
> >>> of Intel processors already shipped and upcoming.
> >>>
> >>> Consider the topology below of big (B) cores and clusters of small (s)
> >>> cores.
> >>>          ------   ------
> >>>          | B  |   | B  |   -----------------   -----------------
> >>>          |    |   |    |   | s | s | s | s |   | s | s | s | s |
> >>>          ------   ------   -----------------   -----------------
> >>>          | L2 |   | L2 |   |      L2       |   |       L2      |
> >>>          -------------------------------------------------------
> >>>          |                          L3                         |
> >>>          -------------------------------------------------------
> >>>
> >>> On a partially busy system (one with idle CPUs; busy CPUs have one task
> >>> each), scheduling for asymmetric capacity ensures that misfit tasks land on
> >>> the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
> >>> When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
> >>> be evenly spread among the small-CPU clusters. Today, this does not
> >>> happen.
> >>>
> >>> Several issues in the load balancer prevent a small CPU in one cluster
> >>> from pulling tasks from another:
> >>>
> >>>  a) update_sd_pick_busiest() may select a fully_busy group with higher
> >>>     per-CPU capacity as the busiest, preventing a subsequent fully_busy
> >>>     group of equal capacity from being correctly selected.
> >>>  b) Misfit-load statistics are used to identify tasks that would benefit
> >>>     from migrating to bigger CPUs. Accounting misfit load is pointless if
> >>>     the destination CPU is equally small, and it also blocks balancing
> >>>     between clusters.
> >>>  c) Due to b), groups that are truly has_spare or fully_busy get
> >>>     misclassified as misfit_task. update_sd_pick_busiest() then skips
> >>>     them, since a small destination CPU cannot help with misfit tasks.
> >>>  d) Once a busiest group has been identified, sched_balance_find_src_rq()
> >>>     will refuse to migrate tasks to CPUs of equal capacity, even when
> >>>     doing so is precisely what is required to balance small-CPU clusters.
> >>>  e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
> >>>     asymmetric capacity, preventing the balancer from equalizing load
> >>>     across sibling small-core clusters.
> >>>
> >>> Together, these issues prevent cluster-level balancing on systems with
> >>> asymmetric CPU capacity.
> >>>
> >>> This series addresses each problem and restores the intended behavior.
> >>> Details, rationale, and code changes are explained in each patch.
> >>>
> >>> I tested these patches on Alder Lake, which has both SMT Pcores and
> >>> clusters of Ecores. I tested with SMT both disabled and enabled. I also
> >>> tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
> >>> connected to the L3 cache. I repeated the same experiment with
> >>> CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.
> >>>
> >>> Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1]
> >>> Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2]
> >>>
> >>> Changes in v4:
> >>>  - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
> >>>    cores with more than one busy sibling.
> >>>  - Patch 2 (pre-work): Fixed a bug that would needlessly update
> >>>    sg_overloaded.
> >>>  - Patch 5: Reworked logic using a local variable for improved
> >>>    readability.
> >>>  - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
> >>>  - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com
> >>>
> >>> Changes in v3:
> >>>  - Patch 3: Reverted the inverted runtime capacity check. The inverted
> >>>    form resulted in migrations to CPUs of slightly lower capacity. Guarded
> >>>    the check for architectural capacity with the sched_cluster_active
> >>>    static key.
> >>>  - Patch 4: Expanded the patch description to explain the behavior of
> >>>    overloaded groups and low-capacity clusters with spare capacity.
> >>>  - Added Reviewed-by tags from Christian. Thanks!
> >>>  - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com
> >>>
> >>> Changes in v2:
> >>>  - Patch 1: Rewrote patch description for clarity. Added a note
> >>>    clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
> >>>    exclusive. (Tim)
> >>>  - Patch 2: Fixed a bug where the capacity check inadvertently broke
> >>>    the mutual exclusion of the sched_reduced_capacity() path. Keep
> >>>    marking the root domain as overloaded when misfit tasks are present
> >>>    to allow bigger CPUs to help via newly idle balance. (sashiko)
> >>>    Fixed the description to state that capacity_greater() looks for
> >>>    differences of ~5% or more, not 20%. (Christian)
> >>>  - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
> >>>    ignore runtime capacity variability. Inverted the capacity check.
> >>>    (Christian)
> >>>  - Patch 4: Reworded the patch description for clarity.
> >>>  - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/
> >>>
> >>> ---
> >>> Ricardo Neri (6):
> >>>       sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
> >>>       sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
> >>>       sched/fair: Check CPU capacity before comparing group types during load balance
> >>>       sched/fair: Skip misfit load accounting when the destination CPU cannot help
> >>>       sched/fair: Allow load balancing between CPUs of identical capacity
> >>>       sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
> >>>
> >>>  include/linux/sched/sd_flags.h |  3 ++-
> >>>  kernel/sched/fair.c            | 57 +++++++++++++++++++++++++++++++-----------
> >>>  kernel/sched/topology.c        | 14 +++++++++--
> >>>  3 files changed, 56 insertions(+), 18 deletions(-)
> >>> ---
> >>> base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
> >>> change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152
> >>>
> >>> Best regards,
> >>
> >> Since I don't really have an arm64 machine that hits the described case just
> >> right, I tested the series on a synthetic arm64 qemu topology with two
> >> equal-capacity little clusters and one 1024 cluster.
> >>
> >> The guest was booted with QEMU virt, 8 CPUs and a custom dtb. The resulting
> >> topology is:
> >> cluster0: CPUs 0-1, cpu_capacity=446
> >> cluster1: CPUs 2-3, cpu_capacity=446
> >> cluster2: CPUs 4-7, cpu_capacity=1024
> >> The dtb describes the clusters with cpu-map. The test kernel was built with
> >> CONFIG_SCHED_CLUSTER enabled.
> >>
> >> I used an rt-app workload with 8 (nr_cpus) SCHED_OTHER tasks.
> >> Each task used the same two phases:
> >> "pinned": {
> >>     "loop": 100,
> >>     "run": 99000,
> >>     "timer": { "ref": "unique", "period": 100000 },
> >>     "cpus": [0, 1, 4, 5, 6, 7]
> >> },
> >> "open": {
> >>     "loop": 100,
> >>     "run": 99000,
> >>     "timer": { "ref": "unique", "period": 100000 },
> >>     "cpus": [0, 1, 2, 3, 4, 5, 6, 7]
> >> }
> >>
> >> The intent is to first force the workload onto cluster0 plus the big cluster,
> >> leaving the second little cluster unused. Then the affinity mask is opened to
> >> all CPUs. If load balancing across equal-capacity clusters works, CPUs 2-3
> >> should receive a meaningful share of the work (instead of only occasional
> >> migrations).
> >>
> >> I counted rt-app sched_switch events per cluster in the open phase. The pass
> >> condition was that cluster1_little receives at least 20% of open-phase rt-app
> >> sched_switch events.
> >>
> >> Results over three runs (for the open phases):
> >>
> >> mainline:
> >> run0: cluster0 5.7%, cluster1 5.7%, big 88.6%  FAIL
> >> run1: cluster0 5.5%, cluster1 6.2%, big 88.3%  FAIL
> >> run2: cluster0 4.3%, cluster1 4.7%, big 91.0%  FAIL
> >>
> >> with this series:
> >> run0: cluster0 38.6%, cluster1 31.4%, big 30.1%  PASS
> >> run1: cluster0 33.2%, cluster1 60.6%, big 6.3%   PASS
> >> run2: cluster0 33.3%, cluster1 60.6%, big 6.1%   PASS
> >>
> >> (The pinned phase behaved as expected in all runs: there were no rt-app
> >> sched_switch samples on CPUs 2-3 before the affinity mask was opened.)
> >>
> >> For the series (patch 1/6 is a different setup, so maybe except for that)
> >> Tested-by: Christian Loehle <christian.loehle@arm.com>
> > 
> > Many thanks for your tests! I have two questions: Do you see similar
> > results of you spawn less tasks than nr_cpus? Perhaps with 6 tasks? They
> > should continue to be evenly distributed on same-capacity clusters.
> 
> with 6 tasks:
> 
> pinned phase first timestamp: 1850.75
> open phase first timestamp: 1879.02
> 
> before_open sched_switch samples: 79
>   cpu0: 30
>   cpu1: 8
>   cpu2: 0
>   cpu3: 0
>   cpu4: 9
>   cpu5: 6
>   cpu6: 7
>   cpu7: 19
>   cluster0_little: 38 (48.1%)
>   cluster1_little: 0 (0.0%)
>   cluster2_big:    41 (51.9%)
> 
> after_open sched_switch samples: 899
>   cpu0: 235
>   cpu1: 176
>   cpu2: 225
>   cpu3: 218
>   cpu4: 9
>   cpu5: 18
>   cpu6: 8
>   cpu7: 10
>   cluster0_little: 411 (45.7%)
>   cluster1_little: 443 (49.3%)
>   cluster2_big:    45 (5.0%)
> 
> after_open sched_migrate_task destination clusters:
>   cluster0_little: 380
>   cluster1_little: 408
>   cluster2_big:    6
> 
> 
> runtime evaluation:
> before_open runtime: 169.337475s
>   cluster0_little: 57.503009s (34.0%)
>   cluster1_little: 0.000000s (0.0%)
>   cluster2_big:    111.834466s (66.0%)
> 
> after_open runtime: 191.293461s
>   cluster0_little: 27.926735s (14.6%)
>   cluster1_little: 33.729412s (17.6%)
>   cluster2_big:    129.637314s (67.8%)
> 
> after_open little-to-big rt-app migrations: 2
>   1910.977751 rtapp04-4 cpu0 -> cpu7
>   1911.011555 rtapp03-3 cpu2 -> cpu5
> big-cluster rtapp_task:end count: 6
>   first: 1910.975772 last: 1913.225910
> 
> 
> FWIW I also mirrored the pinned phase (so pinned to cluster1 first):
> 
> switch-count evaluation:
> pinned phase first timestamp: 1469.11
> open phase first timestamp: 1498.6
> 
> before_open sched_switch samples: 61
>   cpu0: 0
>   cpu1: 0
>   cpu2: 6
>   cpu3: 13
>   cpu4: 24
>   cpu5: 6
>   cpu6: 6
>   cpu7: 6
>   cluster0_little: 0 (0.0%)
>   cluster1_little: 19 (31.1%)
>   cluster2_big:    42 (68.9%)
> 
> after_open sched_switch samples: 883
>   cpu0: 221
>   cpu1: 178
>   cpu2: 226
>   cpu3: 219
>   cpu4: 10
>   cpu5: 7
>   cpu6: 11
>   cpu7: 11
>   cluster0_little: 399 (45.2%)
>   cluster1_little: 445 (50.4%)
>   cluster2_big:    39 (4.4%)
> 
> after_open sched_migrate_task destination clusters:
>   cluster0_little: 370
>   cluster1_little: 403
>   cluster2_big:    8
> 
> runtime evaluation:
> before_open runtime: 171.855925s
>   cluster0_little: 0.000000s (0.0%)
>   cluster1_little: 58.754673s (34.2%)
>   cluster2_big:    113.101252s (65.8%)
> 
> after_open runtime: 189.467864s
>   cluster0_little: 27.279064s (14.4%)
>   cluster1_little: 31.392607s (16.6%)
>   cluster2_big:    130.796193s (69.0%)
> 
> after_open little-to-big rt-app migrations: 3
>   1529.002652 rtapp04-4 cpu0 -> cpu6
>   1529.002902 rtapp02-2 cpu3 -> cpu6
>   1529.778636 rtapp02-2 cpu0 -> cpu7
> big-cluster rtapp_task:end count: 6
>   first: 1529.000465 last: 1533.242563

Thanks for the experiment and the details! The patchset works as expected
AFAICS.

> 
> 
> > 
> > Also, are these high-utilization tasks? If yes, the high-capacity cluster
> > should be fully utilized before any tasks overflow to the lower-capacity
> > clusters.
> 
> Yes, sorry I should've mentioned, the above rt-app tasks will use 99% of the
> capacity of 1024 CPU, so the expected behavior is that all CPUs are used.
> Once the 1024 CPUs finish (as they should finish in ~half the time), tasks
> of any little cluster will be upmigrated to the 1024.
> I did quickly check if that is the case, which it was,

Great! then it seems I didn't break anything.

> but that part is
> definitely more than wonky on qemu (as the capacities are just in the dtb,
> the CPUs are of course vCPUs which behave very noisily and with no correlation
> to the capacity value).

Indeed! :)