Introduce cluster cpu topology support

[RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by Yanan Wang 3 years ago

Hi,
This series introduces the cluster cpu topology support, besides now
existing sockets, cores, and threads.

A cluster means a group of cores that share some resources (e.g. cache)
among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
share L3 cache data while cores within each cluster share the L2 cache.

Also, there are some x86 CPU implementations (e.g. Jacobsville) where L2
cache is shared among a cluster of cores instead of being exclusive to
one single core. For example, on Jacobsville there are 6 clusters of 4
Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3).

The cache affinity of cluster has been proved to improve the Linux kernel
scheduling performance and a patchset [1] has already been posted, where
a general sched_domain for clusters was added and a cluster level was
added in the arch-neutral cpu topology struct like below.
struct cpu_topology {
    int thread_id;
    int core_id;
    int cluster_id;
    int package_id;
    int llc_id;
    cpumask_t thread_sibling;
    cpumask_t core_sibling;
    cpumask_t cluster_sibling;
    cpumask_t llc_sibling;
};

Also Kernel Doc [2]: Documentation/devicetree/bindings/cpu/cpu-topology.txt
defines a four-level CPU topology hierarchy like socket/cluster/core/thread.
According to the context, a socket node's child nodes must be one or more
cluster nodes and a cluster node's child nodes must be one or more cluster
nodes/one or more core nodes.

So let's add the arch-neutral -smp, clusters=* command line support, so that
future guest os could make use of cluster cpu topology for better scheduling
performance. And whichever architecture that has groups of cpus sharing some
separate resources(e.g. L2 cache) internely under LLC can use this command
line parameter to define a VM with cluster level cpu topology.

For ARM machines, a four-level cpu hierarchy can be defined and it will be
sockets/clusters/cores/threads. For PC machines, a five-level cpu hierarchy
can be defined and it will be sockets/dies/clusters/cores/threads.

About this series:
Note that, this series was implemented based on [3] and [4]. Although they
have not merged into qemu mainline for now, it's still meaning to post this
series to express the thoughts first. So a RFC is sent and any comments are
welcomed and appreciated.

Test results:
With command line: -smp 96,sockets=2,clusters=6,cores=4,threads=2, VM's cpu
topology description shows as below.
lscpu:
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        1
Vendor ID:           0x48
Model:               0
Stepping:            0x1
BogoMIPS:            200.00
L1d cache:           unknown size
L1i cache:           unknown size
L2 cache:            unknown size
NUMA node0 CPU(s):   0-95

Topology information of clusters can also be got:
cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list: 0-7
cat /sys/devices/system/cpu/cpu0/topology/cluster_id: 56

cat /sys/devices/system/cpu/cpu8/topology/cluster_cpus_list: 8-15
cat /sys/devices/system/cpu/cpu8/topology/cluster_id: 316
...
cat /sys/devices/system/cpu/cpu95/topology/cluster_cpus_list: 88-95
cat /sys/devices/system/cpu/cpu95/topology/cluster_id: 2936

Links:
[1] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210319041618.14316-1-song.bao.hua@hisilicon.com/
[2] https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/cpu/cpu-topology.txt
[3] https://patchwork.kernel.org/project/qemu-devel/cover/20210225085627.2263-1-fangying1@huawei.com/
[4] https://patchwork.kernel.org/project/qemu-devel/patch/20201109030452.2197-4-fangying1@huawei.com/

Yanan Wang (6):
  vl.c: Add arch-neutral -smp, clusters=* command line support
  hw/core/machine: Parse cluster cpu topology in smp_parse()
  hw/arm/virt: Parse cluster cpu topology for ARM machines
  hw/i386/pc: Parse cluster cpu topology for PC machines
  hw/arm/virt-acpi-build: Add cluster level for ARM PPTT table
  hw/arm/virt: Add cluster level for ARM device tree

 hw/acpi/aml-build.c         | 11 +++++++++
 hw/arm/virt-acpi-build.c    | 43 ++++++++++++++++++++---------------
 hw/arm/virt.c               | 45 ++++++++++++++++++++++---------------
 hw/core/machine.c           | 32 +++++++++++++++-----------
 hw/i386/pc.c                | 31 +++++++++++++++----------
 include/hw/acpi/aml-build.h |  2 ++
 include/hw/boards.h         |  4 +++-
 qemu-options.hx             | 27 +++++++++++++---------
 softmmu/vl.c                |  3 +++
 9 files changed, 125 insertions(+), 73 deletions(-)

-- 
2.19.1

Re: [RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by Paolo Bonzini 3 years ago

On 31/03/21 11:53, Yanan Wang wrote:
> A cluster means a group of cores that share some resources (e.g. cache)
> among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
> 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
> share L3 cache data while cores within each cluster share the L2 cache.

Is this different from what we already have with "-smp dies"?

Paolo

Re: [RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by wangyanan (Y) 3 years ago

Hi Paolo,

On 2021/3/31 18:00, Paolo Bonzini wrote:
> On 31/03/21 11:53, Yanan Wang wrote:
>> A cluster means a group of cores that share some resources (e.g. cache)
>> among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
>> 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
>> share L3 cache data while cores within each cluster share the L2 cache.
>
> Is this different from what we already have with "-smp dies"?
As far as I know, yes. I think they are two architecture concepts of 
different levels.
A cpu socket/package can consist of multiple dies, and each die can 
consist of
multiple clusters, which means dies are parent for clusters. And this 
kind of cpu
hierarchy structure is normal in ARM64 platform.

Still take above ARM64 server chip Kunpeng 920 as an example, there 
totally are
2 sockets, 2 dies in each socket, 6 or 8 clusters in each die, and 4 
cores in each
cluster. Since it also supports NUMA architecture, then each NUMA actually
represents one die.

Although there is already "-smp dies=*" command line parameter for PC 
Machines,
a cluster level can be added for x86 architecture if meaningful and 
there will be more
work to do in qemu to make use of it. And I am sure that ARM needs this 
level.

If there is something wrong above, please correct me, thanks!

Thanks,
Yanan
>
> Paolo
>
> .

Re: [RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by Philippe Mathieu-Daudé 3 years ago

Hi Yanan,

On 3/31/21 11:53 AM, Yanan Wang wrote:
> Hi,
> This series introduces the cluster cpu topology support, besides now
> existing sockets, cores, and threads.
> 
> A cluster means a group of cores that share some resources (e.g. cache)
> among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
> 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
> share L3 cache data while cores within each cluster share the L2 cache.
> 
> Also, there are some x86 CPU implementations (e.g. Jacobsville) where L2
> cache is shared among a cluster of cores instead of being exclusive to
> one single core. For example, on Jacobsville there are 6 clusters of 4
> Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3).

> About this series:
> Note that, this series was implemented based on [3] and [4]. Although they
> have not merged into qemu mainline for now, it's still meaning to post this
> series to express the thoughts first. So a RFC is sent and any comments are
> welcomed and appreciated.

At a glance: tests/unit/test-x86-cpuid.c should be adapted to be generic
(but still supporting target-specific sub-tests) and some aarch64 tests
added.

Similarly the ARM PPTT tables tested in tests/qtest/bios-tables-test.c.

Otherwise, the overall series looks good and coherent, but it isn't my
area :)

Regards,

Phil.

Re: [RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by wangyanan (Y) 3 years ago

Hi Philippe,

On 2021/4/27 17:57, Philippe Mathieu-Daudé wrote:
> Hi Yanan,
>
> On 3/31/21 11:53 AM, Yanan Wang wrote:
>> Hi,
>> This series introduces the cluster cpu topology support, besides now
>> existing sockets, cores, and threads.
>>
>> A cluster means a group of cores that share some resources (e.g. cache)
>> among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
>> 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
>> share L3 cache data while cores within each cluster share the L2 cache.
>>
>> Also, there are some x86 CPU implementations (e.g. Jacobsville) where L2
>> cache is shared among a cluster of cores instead of being exclusive to
>> one single core. For example, on Jacobsville there are 6 clusters of 4
>> Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3).
>> About this series:
>> Note that, this series was implemented based on [3] and [4]. Although they
>> have not merged into qemu mainline for now, it's still meaning to post this
>> series to express the thoughts first. So a RFC is sent and any comments are
>> welcomed and appreciated.
> At a glance: tests/unit/test-x86-cpuid.c should be adapted to be generic
> (but still supporting target-specific sub-tests) and some aarch64 tests
> added.
>
> Similarly the ARM PPTT tables tested in tests/qtest/bios-tables-test.c.
>
> Otherwise, the overall series looks good and coherent, but it isn't my
> area :)
Thank you for the reminder of the related tests. :)
I will have some work to make them cover the new features introduced
by this series.

Thanks,
Yanan
>
> Regards,
>
> Phil.
>
> .

Re: [RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by Philippe Mathieu-Daudé 3 years ago

On 4/27/21 1:00 PM, wangyanan (Y) wrote:
> On 2021/4/27 17:57, Philippe Mathieu-Daudé wrote:
>> On 3/31/21 11:53 AM, Yanan Wang wrote:
>>> Hi,
>>> This series introduces the cluster cpu topology support, besides now
>>> existing sockets, cores, and threads.
>>>
>>> A cluster means a group of cores that share some resources (e.g. cache)
>>> among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
>>> 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
>>> share L3 cache data while cores within each cluster share the L2 cache.
>>>
>>> Also, there are some x86 CPU implementations (e.g. Jacobsville) where L2
>>> cache is shared among a cluster of cores instead of being exclusive to
>>> one single core. For example, on Jacobsville there are 6 clusters of 4
>>> Atom cores, each cluster sharing a separate L2, and 24 cores sharing
>>> L3).
>>> About this series:
>>> Note that, this series was implemented based on [3] and [4]. Although
>>> they
>>> have not merged into qemu mainline for now, it's still meaning to
>>> post this
>>> series to express the thoughts first. So a RFC is sent and any
>>> comments are
>>> welcomed and appreciated.
>> At a glance: tests/unit/test-x86-cpuid.c should be adapted to be generic
>> (but still supporting target-specific sub-tests) and some aarch64 tests
>> added.
>>
>> Similarly the ARM PPTT tables tested in tests/qtest/bios-tables-test.c.
>>
>> Otherwise, the overall series looks good and coherent, but it isn't my
>> area :)
> Thank you for the reminder of the related tests. :)
> I will have some work to make them cover the new features introduced
> by this series.

BTW if after 4 weeks and 2 pings nobody sent negative feedback or
NAcked your series, it usually means the community is not against
your purposal, but has some doubts this feature is necessary or
well designed. Tests help to show your work is safe, as it doesn't
break anything. You might need to better explain why this feature
is needed, and what are the limitations of what is currently
possible.

OTOH QEMU has been in "feature freeze" for the next v6.0 release
for the same amount of time, so maybe the maintainers/reviewers
were busy with bugs and still have your series in their TODO list.

Regards,

Phil.

Re: [RFC PATCH 0/6] Introduce cluster cpu topology support

Posted by wangyanan (Y) 3 years ago

On 2021/4/27 20:08, Philippe Mathieu-Daudé wrote:
> On 4/27/21 1:00 PM, wangyanan (Y) wrote:
>> On 2021/4/27 17:57, Philippe Mathieu-Daudé wrote:
>>> On 3/31/21 11:53 AM, Yanan Wang wrote:
>>>> Hi,
>>>> This series introduces the cluster cpu topology support, besides now
>>>> existing sockets, cores, and threads.
>>>>
>>>> A cluster means a group of cores that share some resources (e.g. cache)
>>>> among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
>>>> 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
>>>> share L3 cache data while cores within each cluster share the L2 cache.
>>>>
>>>> Also, there are some x86 CPU implementations (e.g. Jacobsville) where L2
>>>> cache is shared among a cluster of cores instead of being exclusive to
>>>> one single core. For example, on Jacobsville there are 6 clusters of 4
>>>> Atom cores, each cluster sharing a separate L2, and 24 cores sharing
>>>> L3).
>>>> About this series:
>>>> Note that, this series was implemented based on [3] and [4]. Although
>>>> they
>>>> have not merged into qemu mainline for now, it's still meaning to
>>>> post this
>>>> series to express the thoughts first. So a RFC is sent and any
>>>> comments are
>>>> welcomed and appreciated.
>>> At a glance: tests/unit/test-x86-cpuid.c should be adapted to be generic
>>> (but still supporting target-specific sub-tests) and some aarch64 tests
>>> added.
>>>
>>> Similarly the ARM PPTT tables tested in tests/qtest/bios-tables-test.c.
>>>
>>> Otherwise, the overall series looks good and coherent, but it isn't my
>>> area :)
>> Thank you for the reminder of the related tests. :)
>> I will have some work to make them cover the new features introduced
>> by this series.
> BTW if after 4 weeks and 2 pings nobody sent negative feedback or
> NAcked your series, it usually means the community is not against
> your purposal, but has some doubts this feature is necessary or
> well designed. Tests help to show your work is safe, as it doesn't
> break anything. You might need to better explain why this feature
> is needed, and what are the limitations of what is currently
> possible.
>
> OTOH QEMU has been in "feature freeze" for the next v6.0 release
> for the same amount of time, so maybe the maintainers/reviewers
> were busy with bugs and still have your series in their TODO list.
I fully understand your point, and thanks for the explanations.

I will just patiently wait for some feedback, and of course on the same 
time,
will also refine the solution with some convincing tests for later new 
version.

Thanks,
Yanan
> Regards,
>
> Phil.
>
> .