NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

[PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Gavin Shan 2 years, 11 months ago

For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
used to populate the CPU topology in the Linux guest. It's required that
the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
scheduling domain can't be sorted out, as the following warning message
indicates. To avoid the unexpected confusion, this series attempts to
rejects such kind of insane configurations.

   -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
   -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
   -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
   -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
 
   ------------[ cut here ]------------
   WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
   pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
   pc : build_sched_domains+0x284/0x910
   lr : build_sched_domains+0x184/0x910
   sp : ffff80000804bd50
   x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
   x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
   x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
   x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
   x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
   x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
   x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
   x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
   x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
   x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
   Call trace:
    build_sched_domains+0x284/0x910
    sched_init_domains+0xac/0xe0
    sched_init_smp+0x48/0xc8
    kernel_init_freeable+0x140/0x1ac
    kernel_init+0x28/0x140
    ret_from_fork+0x10/0x20

PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary
PATCH[2] Validate the configuration if required in the NUMA subsystem
PATCH[3] Enable the validation for aarch64 machines
PATCH[4] Enable the validation for RiscV machines

v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html

Changelog
=========
v2:
  * Fix socket-NUMA-node boundary issues in qtests/numa-test  (Gavin)
  * Add helper set_numa_socket_boundary() and validate the
    boundary in the generic path                              (Philippe)

Gavin Shan (4):
  qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
  numa: Validate socket and NUMA node boundary if required
  hw/arm: Validate socket and NUMA node boundary
  hw/riscv: Validate socket and NUMA node boundary

 hw/arm/sbsa-ref.c       |  2 ++
 hw/arm/virt.c           |  2 ++
 hw/core/machine.c       | 34 ++++++++++++++++++++++++++++++++++
 hw/core/numa.c          |  7 +++++++
 hw/riscv/spike.c        |  1 +
 hw/riscv/virt.c         |  1 +
 include/sysemu/numa.h   |  4 ++++
 tests/qtest/numa-test.c | 13 ++++++++++---
 8 files changed, 61 insertions(+), 3 deletions(-)

-- 
2.23.0

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Daniel P. Berrangé 2 years, 11 months ago

On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> used to populate the CPU topology in the Linux guest. It's required that
> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> scheduling domain can't be sorted out, as the following warning message
> indicates. To avoid the unexpected confusion, this series attempts to
> rejects such kind of insane configurations.
> 
>    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>    -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>    -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>    -numa node,nodeid=2,cpus=4-5,memdev=ram2                \

This is somewhat odd as a config, because core 2 is in socket 0
and core 3 is in socket 1, so it wouldn't make much conceptual
sense to have them in the same NUMA node, while other cores within
the same socket are in different NUMA nodes. Basically the split
of NUMA nodes is not aligned with any level in the topology.

This series, however, also rejects configurations that I would
normally consider to be reasonable. I've not tested linux kernel
behaviour though, but as a user I would expect to be able to
ask for:

    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
    -numa node,nodeid=0,cpus=0,memdev=ram0                \
    -numa node,nodeid=1,cpus=1,memdev=ram1                \
    -numa node,nodeid=2,cpus=2,memdev=ram2                \
    -numa node,nodeid=3,cpus=3,memdev=ram3                \
    -numa node,nodeid=4,cpus=4,memdev=ram4                \
    -numa node,nodeid=5,cpus=5,memdev=ram5                \

ie, every core gets its own NUMA node

Or to aask for every cluster as a numa node:

    -smp 6,maxcpus=6,sockets=2,clusters=3,cores=1,threads=1 \
    -numa node,nodeid=0,cpus=0,memdev=ram0                \
    -numa node,nodeid=1,cpus=1,memdev=ram1                \
    -numa node,nodeid=2,cpus=2,memdev=ram2                \
    -numa node,nodeid=3,cpus=3,memdev=ram3                \
    -numa node,nodeid=4,cpus=4,memdev=ram4                \
    -numa node,nodeid=5,cpus=5,memdev=ram5                \

In both cases the NUMA split is aligned with a given level
in the topology, which was not the case with your example.

Rejecting these feels overly strict to me, and may risk breaking
existing valid deployments, unless we can demonstrate those
scenarios were unambiguously already broken ?

If there was something in the hardware specs that requires
this, then it is more valid to do, than if it is merely an
specific guest kernel limitation that might be fixed any day.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Gavin Shan 2 years, 11 months ago

On 2/23/23 11:57 PM, Daniel P. Berrangé wrote:
> On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>> used to populate the CPU topology in the Linux guest. It's required that
>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>> scheduling domain can't be sorted out, as the following warning message
>> indicates. To avoid the unexpected confusion, this series attempts to
>> rejects such kind of insane configurations.
>>
>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> 
> This is somewhat odd as a config, because core 2 is in socket 0
> and core 3 is in socket 1, so it wouldn't make much conceptual
> sense to have them in the same NUMA node, while other cores within
> the same socket are in different NUMA nodes. Basically the split
> of NUMA nodes is not aligned with any level in the topology.
> 
> This series, however, also rejects configurations that I would
> normally consider to be reasonable. I've not tested linux kernel
> behaviour though, but as a user I would expect to be able to
> ask for:
> 
>      -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>      -numa node,nodeid=0,cpus=0,memdev=ram0                \
>      -numa node,nodeid=1,cpus=1,memdev=ram1                \
>      -numa node,nodeid=2,cpus=2,memdev=ram2                \
>      -numa node,nodeid=3,cpus=3,memdev=ram3                \
>      -numa node,nodeid=4,cpus=4,memdev=ram4                \
>      -numa node,nodeid=5,cpus=5,memdev=ram5                \
> 
> ie, every core gets its own NUMA node
> 

It doesn't work to Linux guest either. As the following warning message
indicates, the Multicore domain isn't a subset of DIE (CLUSTER or socket)
domain. For example, Multicore domain is 0-2 while DIE domain is 0 for
CPU-0.

[    0.023486] CPU-0: 36,56,0,-1 thread=0  core=0-2  cluster=0-2 llc=0    // parsed from ACPI PPTT
[    0.023490] CPU-1: 36,56,1,-1 thread=1  core=0-2  cluster=0-2 llc=1
[    0.023492] CPU-2: 36,56,2,-1 thread=2  core=0-2  cluster=0-2 llc=2
[    0.023494] CPU-3: 136,156,3,-1 thread=3  core=3-5  cluster=3-5 llc=3
[    0.023495] CPU-4: 136,156,4,-1 thread=4  core=3-5  cluster=3-5 llc=4
[    0.023497] CPU-5: 136,156,5,-1 thread=5  core=3-5  cluster=3-5 llc=5
[    0.023499] CPU-0: SMT=0  CLUSTER=0  MULTICORE=0-2  DIE=0  CPU-OF-NODE=0      // Seen by scheduling domain
[    0.023501] CPU-1: SMT=1  CLUSTER=1  MULTICORE=0-2  DIE=1  CPU-OF-NODE=1
[    0.023503] CPU-2: SMT=2  CLUSTER=2  MULTICORE=0-2  DIE=2  CPU-OF-NODE=2
[    0.023504] CPU-3: SMT=3  CLUSTER=3  MULTICORE=3-5  DIE=3  CPU-OF-NODE=3
[    0.023506] CPU-4: SMT=4  CLUSTER=4  MULTICORE=3-5  DIE=4  CPU-OF-NODE=4
[    0.023508] CPU-5: SMT=5  CLUSTER=5  MULTICORE=3-5  DIE=5  CPU-OF_NODE=5
         :
[    0.023555] BUG: arch topology borken
[    0.023556]      the MC domain not a subset of the DIE domain

NOTE that both DIE and CPU-OF-NODE are same since they're all returned by
'cpumask_of_node(cpu_to_node(cpu))'.


> Or to aask for every cluster as a numa node:
> 
>      -smp 6,maxcpus=6,sockets=2,clusters=3,cores=1,threads=1 \
>      -numa node,nodeid=0,cpus=0,memdev=ram0                \
>      -numa node,nodeid=1,cpus=1,memdev=ram1                \
>      -numa node,nodeid=2,cpus=2,memdev=ram2                \
>      -numa node,nodeid=3,cpus=3,memdev=ram3                \
>      -numa node,nodeid=4,cpus=4,memdev=ram4                \
>      -numa node,nodeid=5,cpus=5,memdev=ram5                \
> 

This case works fine to Linux guest.

[    0.024505] CPU-0: 36,56,0,-1 thread=0  core=0-2  cluster=0 llc=0            // parsed from ACPI PPTT
[    0.024509] CPU-1: 36,96,1,-1 thread=1  core=0-2  cluster=1 llc=1
[    0.024511] CPU-2: 36,136,2,-1 thread=2  core=0-2  cluster=2 llc=2
[    0.024512] CPU-3: 176,196,3,-1 thread=3  core=3-5  cluster=3 llc=3
[    0.024514] CPU-4: 176,236,4,-1 thread=4  core=3-5  cluster=4 llc=4
[    0.024515] CPU-5: 176,276,5,-1 thread=5  core=3-5  cluster=5 llc=5
[    0.024518] CPU-0: SMT=0  CLUSTER=0  MULTICORE=0  DIE=0  CPU-OF-NODE=0      // Seen by scheduling domain
[    0.024519] CPU-1: SMT=1  CLUSTER=1  MULTICORE=1  DIE=1  CPU-OF-NODE=1
[    0.024521] CPU-2: SMT=2  CLUSTER=2  MULTICORE=2  DIE=2  CPU-OF-NODE=2
[    0.024522] CPU-3: SMT=3  CLUSTER=3  MULTICORE=3  DIE=3  CPU-OF-NODE=3
[    0.024524] CPU-4: SMT=4  CLUSTER=4  MULTICORE=4  DIE=4  CPU-OF-NODE=4
[    0.024525] CPU-5: SMT=5  CLUSTER=5  MULTICORE=5  DIE=5  CPU-OF-NODE=5


> In both cases the NUMA split is aligned with a given level
> in the topology, which was not the case with your example.
> 
> Rejecting these feels overly strict to me, and may risk breaking
> existing valid deployments, unless we can demonstrate those
> scenarios were unambiguously already broken ?
> 
> If there was something in the hardware specs that requires
> this, then it is more valid to do, than if it is merely an
> specific guest kernel limitation that might be fixed any day.
> 

Yes, I agree that it's strict to have socket-to-NUMA boundary. However,
it sounds not sensible to split CPUs in one cluster to differnet NUMA
nodes, or to split CPUs in one core to different NUMA nodes in the baremetal
environment. I think we probably need to prevent these two cases, meaning two
clusters in one socket is still allowed to be associated with different NUMA
nodes.

I fail to get accurate information about the relation among socket/cluster/core
from specs. As I can understand, the CPUs in one core are sharing L2 cache and
cores in one cluster are sharing L3 cache. thread would have its own L1 cache.
L3 cache is usually corresponding to NUMA node. I may be totally wrong here.

Thanks,
Gavin

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Andrew Jones 2 years, 11 months ago

On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> used to populate the CPU topology in the Linux guest. It's required that
> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> scheduling domain can't be sorted out, as the following warning message
> indicates. To avoid the unexpected confusion, this series attempts to
> rejects such kind of insane configurations.

Hi Gavin,

I'm not sure 'insane' is the right word, maybe 'irregular'. I've never
seen, and don't really expect to find, specifications stating that arm
and/or riscv must have 1:1 mappings for numa nodes and sockets. But, as
QEMU machine types can choose certain configurations to support, if
everyone is in agreement to restrict sbsa-ref and arm/riscv virt to 1:1
mappings, then why not. However, I suggest stating that this is simply a
choice being made for these boards, since the expectation of needing
"irregular" configurations is low. We should not be trying to use Linux
assumptions as rationale, since other OSes may cope better with irregular
configurations.

Also, I suggest adding a comment to the boards, which adopt this
restriction, which explains that it's only a platform choice, not
something specified by the architecture.

Thanks,
drew

> 
>    -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>    -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>    -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>    -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>  
>    ------------[ cut here ]------------
>    WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
>    Modules linked in:
>    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
>    pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>    pc : build_sched_domains+0x284/0x910
>    lr : build_sched_domains+0x184/0x910
>    sp : ffff80000804bd50
>    x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
>    x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
>    x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
>    x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
>    x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
>    x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
>    x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
>    x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
>    x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
>    x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
>    Call trace:
>     build_sched_domains+0x284/0x910
>     sched_init_domains+0xac/0xe0
>     sched_init_smp+0x48/0xc8
>     kernel_init_freeable+0x140/0x1ac
>     kernel_init+0x28/0x140
>     ret_from_fork+0x10/0x20
> 
> PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary
> PATCH[2] Validate the configuration if required in the NUMA subsystem
> PATCH[3] Enable the validation for aarch64 machines
> PATCH[4] Enable the validation for RiscV machines
> 
> v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html
> 
> Changelog
> =========
> v2:
>   * Fix socket-NUMA-node boundary issues in qtests/numa-test  (Gavin)
>   * Add helper set_numa_socket_boundary() and validate the
>     boundary in the generic path                              (Philippe)
> 
> Gavin Shan (4):
>   qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
>   numa: Validate socket and NUMA node boundary if required
>   hw/arm: Validate socket and NUMA node boundary
>   hw/riscv: Validate socket and NUMA node boundary
> 
>  hw/arm/sbsa-ref.c       |  2 ++
>  hw/arm/virt.c           |  2 ++
>  hw/core/machine.c       | 34 ++++++++++++++++++++++++++++++++++
>  hw/core/numa.c          |  7 +++++++
>  hw/riscv/spike.c        |  1 +
>  hw/riscv/virt.c         |  1 +
>  include/sysemu/numa.h   |  4 ++++
>  tests/qtest/numa-test.c | 13 ++++++++++---
>  8 files changed, 61 insertions(+), 3 deletions(-)
> 
> -- 
> 2.23.0
> 
>

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Gavin Shan 2 years, 11 months ago

Hi Drew,

On 2/23/23 11:25 PM, Andrew Jones wrote:
> On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote:
>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>> used to populate the CPU topology in the Linux guest. It's required that
>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>> scheduling domain can't be sorted out, as the following warning message
>> indicates. To avoid the unexpected confusion, this series attempts to
>> rejects such kind of insane configurations.
> 
> I'm not sure 'insane' is the right word, maybe 'irregular'. I've never
> seen, and don't really expect to find, specifications stating that arm
> and/or riscv must have 1:1 mappings for numa nodes and sockets. But, as
> QEMU machine types can choose certain configurations to support, if
> everyone is in agreement to restrict sbsa-ref and arm/riscv virt to 1:1
> mappings, then why not. However, I suggest stating that this is simply a
> choice being made for these boards, since the expectation of needing
> "irregular" configurations is low. We should not be trying to use Linux
> assumptions as rationale, since other OSes may cope better with irregular
> configurations.
> 
> Also, I suggest adding a comment to the boards, which adopt this
> restriction, which explains that it's only a platform choice, not
> something specified by the architecture.
> 

[...]

Thanks for your comments. Yes, "irregular" is definitely better words. I failed
to find the statement that ARM64 has the requirement for 1:1 mapping between
socket and NUMA node, which makes it hard to tell if the restriction is required
from architectural level. However, the linux driver (drivers/base/arch_topology.c)
considers both CPU-to-NUMA-association and CPU toplogy from ACPI PPTT when the
CPU set in the core group is calculated. It means the CPU-to-NUMA-association
can affect the CPU topology seen by Linux scheduler.

With the restriction of 1:1 mapping between socket and NUMA node, or between
cluster and NUMA node as I mentioned in the reply to Daniel P. Berrangé <berrange@redhat.com>,
the existing configuration for other OSes can become invalid.

Yes, a comment to mention it's a platform choice, not architectural requirement,
definitely helps to understand the context.

Thanks,
Gavin

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Daniel Henrique Barboza 2 years, 11 months ago


On 2/23/23 05:13, Gavin Shan wrote:
> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> used to populate the CPU topology in the Linux guest. It's required that
> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> scheduling domain can't be sorted out, as the following warning message
> indicates. To avoid the unexpected confusion, this series attempts to
> rejects such kind of insane configurations.
> 
>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \


And why is this a QEMU problem? This doesn't hurt ACPI.

Also, this restriction impacts breaks ARM guests in the wild that are running
non-Linux OSes. I don't see why we should impact use cases that has nothing to
do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.


Thanks,


Daniel


>   
>     ------------[ cut here ]------------
>     WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
>     Modules linked in:
>     CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
>     pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>     pc : build_sched_domains+0x284/0x910
>     lr : build_sched_domains+0x184/0x910
>     sp : ffff80000804bd50
>     x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
>     x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
>     x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
>     x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
>     x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
>     x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
>     x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
>     x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
>     x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
>     x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
>     Call trace:
>      build_sched_domains+0x284/0x910
>      sched_init_domains+0xac/0xe0
>      sched_init_smp+0x48/0xc8
>      kernel_init_freeable+0x140/0x1ac
>      kernel_init+0x28/0x140
>      ret_from_fork+0x10/0x20
> 
> PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary
> PATCH[2] Validate the configuration if required in the NUMA subsystem
> PATCH[3] Enable the validation for aarch64 machines
> PATCH[4] Enable the validation for RiscV machines
> 
> v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html
> 
> Changelog
> =========
> v2:
>    * Fix socket-NUMA-node boundary issues in qtests/numa-test  (Gavin)
>    * Add helper set_numa_socket_boundary() and validate the
>      boundary in the generic path                              (Philippe)
> 
> Gavin Shan (4):
>    qtest/numa-test: Follow socket-NUMA-node boundary for aarch64
>    numa: Validate socket and NUMA node boundary if required
>    hw/arm: Validate socket and NUMA node boundary
>    hw/riscv: Validate socket and NUMA node boundary
> 
>   hw/arm/sbsa-ref.c       |  2 ++
>   hw/arm/virt.c           |  2 ++
>   hw/core/machine.c       | 34 ++++++++++++++++++++++++++++++++++
>   hw/core/numa.c          |  7 +++++++
>   hw/riscv/spike.c        |  1 +
>   hw/riscv/virt.c         |  1 +
>   include/sysemu/numa.h   |  4 ++++
>   tests/qtest/numa-test.c | 13 ++++++++++---
>   8 files changed, 61 insertions(+), 3 deletions(-)
>

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Gavin Shan 2 years, 11 months ago

On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
> On 2/23/23 05:13, Gavin Shan wrote:
>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>> used to populate the CPU topology in the Linux guest. It's required that
>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>> scheduling domain can't be sorted out, as the following warning message
>> indicates. To avoid the unexpected confusion, this series attempts to
>> rejects such kind of insane configurations.
>>
>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> 
> 
> And why is this a QEMU problem? This doesn't hurt ACPI.
> 
> Also, this restriction impacts breaks ARM guests in the wild that are running
> non-Linux OSes. I don't see why we should impact use cases that has nothing to
> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
> 

With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
CPUs in different clusters and sockets have been associated with one NUMA node.

If I'm correct, the configuration isn't sensible in a baremetal environment and
same Linux kernel is supposed to work well for baremetal and virtualized machine.
So I think QEMU needs to emulate the topology as much as we can to match with the
baremetal environment. It's the reason why I think it's a QEMU problem even it
doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
in another thread, we may need to gurantee that the CPUs in one cluster can't be
split to multiple NUMA nodes, which matches with the baremetal environment, as I
can understand.

Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
definitely break the configurations running in the wild.

Thanks,
Gavin

[...]

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Daniel Henrique Barboza 2 years, 11 months ago


On 2/24/23 04:09, Gavin Shan wrote:
> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>> On 2/23/23 05:13, Gavin Shan wrote:
>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>> used to populate the CPU topology in the Linux guest. It's required that
>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>> scheduling domain can't be sorted out, as the following warning message
>>> indicates. To avoid the unexpected confusion, this series attempts to
>>> rejects such kind of insane configurations.
>>>
>>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>
>>
>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>
>> Also, this restriction impacts breaks ARM guests in the wild that are running
>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>
> 
> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
> CPUs in different clusters and sockets have been associated with one NUMA node.
> 
> If I'm correct, the configuration isn't sensible in a baremetal environment and
> same Linux kernel is supposed to work well for baremetal and virtualized machine.
> So I think QEMU needs to emulate the topology as much as we can to match with the
> baremetal environment. It's the reason why I think it's a QEMU problem even it
> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
> in another thread, we may need to gurantee that the CPUs in one cluster can't be
> split to multiple NUMA nodes, which matches with the baremetal environment, as I
> can understand.
> 
> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
> definitely break the configurations running in the wild.


What about a warning? If the user attempts to use an exotic NUMA configuration
like the one you mentioned we can print something like:

"Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"

This would inform the user what might be going wrong in case Linux is crashing/error
out on them and then the user is free to fix their topology (or the kernel). And
at the same time we wouldn't break existing stuff that happens to be working.


Thanks,


Daniel


> 
> Thanks,
> Gavin
> 
> [...]
>

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Gavin Shan 2 years, 11 months ago

On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
> On 2/24/23 04:09, Gavin Shan wrote:
>> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>>> On 2/23/23 05:13, Gavin Shan wrote:
>>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>>> used to populate the CPU topology in the Linux guest. It's required that
>>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>>> scheduling domain can't be sorted out, as the following warning message
>>>> indicates. To avoid the unexpected confusion, this series attempts to
>>>> rejects such kind of insane configurations.
>>>>
>>>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>>
>>>
>>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>>
>>> Also, this restriction impacts breaks ARM guests in the wild that are running
>>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>>
>>
>> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
>> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
>> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
>> CPUs in different clusters and sockets have been associated with one NUMA node.
>>
>> If I'm correct, the configuration isn't sensible in a baremetal environment and
>> same Linux kernel is supposed to work well for baremetal and virtualized machine.
>> So I think QEMU needs to emulate the topology as much as we can to match with the
>> baremetal environment. It's the reason why I think it's a QEMU problem even it
>> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
>> in another thread, we may need to gurantee that the CPUs in one cluster can't be
>> split to multiple NUMA nodes, which matches with the baremetal environment, as I
>> can understand.
>>
>> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
>> definitely break the configurations running in the wild.
> 
> 
> What about a warning? If the user attempts to use an exotic NUMA configuration
> like the one you mentioned we can print something like:
> 
> "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
> 
> This would inform the user what might be going wrong in case Linux is crashing/error
> out on them and then the user is free to fix their topology (or the kernel). And
> at the same time we wouldn't break existing stuff that happens to be working.
> 
> 

Yes, I think a warning message is more appropriate, so that users can fix their
irregular configurations and the existing configurations aren't disconnected.
It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
I'm going to change the code and post next revision.

Thanks,
Gavin

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Igor Mammedov 2 years, 11 months ago

On Fri, 24 Feb 2023 21:16:39 +1100
Gavin Shan <gshan@redhat.com> wrote:

> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
> > On 2/24/23 04:09, Gavin Shan wrote:  
> >> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:  
> >>> On 2/23/23 05:13, Gavin Shan wrote:  
> >>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> >>>> used to populate the CPU topology in the Linux guest. It's required that
> >>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> >>>> scheduling domain can't be sorted out, as the following warning message
> >>>> indicates. To avoid the unexpected confusion, this series attempts to
> >>>> rejects such kind of insane configurations.
> >>>>
> >>>>     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
> >>>>     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
> >>>>     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
> >>>>     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \  
> >>>
> >>>
> >>> And why is this a QEMU problem? This doesn't hurt ACPI.
> >>>
> >>> Also, this restriction impacts breaks ARM guests in the wild that are running
> >>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
> >>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
> >>>  
> >>
> >> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
> >> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
> >> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
> >> CPUs in different clusters and sockets have been associated with one NUMA node.
> >>
> >> If I'm correct, the configuration isn't sensible in a baremetal environment and
> >> same Linux kernel is supposed to work well for baremetal and virtualized machine.
> >> So I think QEMU needs to emulate the topology as much as we can to match with the
> >> baremetal environment. It's the reason why I think it's a QEMU problem even it
> >> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
> >> in another thread, we may need to gurantee that the CPUs in one cluster can't be
> >> split to multiple NUMA nodes, which matches with the baremetal environment, as I
> >> can understand.
> >>
> >> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
> >> definitely break the configurations running in the wild.  
> > 
> > 
> > What about a warning? If the user attempts to use an exotic NUMA configuration
> > like the one you mentioned we can print something like:
> > 
> > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
> > 
> > This would inform the user what might be going wrong in case Linux is crashing/error
> > out on them and then the user is free to fix their topology (or the kernel). And
> > at the same time we wouldn't break existing stuff that happens to be working.
> > 
> >   
> 
> Yes, I think a warning message is more appropriate, so that users can fix their
> irregular configurations and the existing configurations aren't disconnected.
> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
> I'm going to change the code and post next revision.

Honestly you (and libvirt as far as I recall) are using legacy options
to assign cpus to numa nodes.
With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where
in topology those cpus are.
What you can do is to use '-numa cpu,...' option to assign socket/core/...
to numa node, ex:
        "-numa cpu,node-id=1,socket-id=0 "           or                              
        "-numa cpu,node-id=0,socket-id=1,core-id=0 "       or                      
        "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0
to get your desired mapping.

The problem that's so far was stopping the later adoption by libvirt (Michal CCed)
is that values used by it are machine specific and to do it properly, for a concrete
'-M x -smp ...' at least for the first time qemu should be started with-
 -preconfig option and then user should query possible cpus for those values
and assign them to numa nodes via QMP.

btw: on x86 we also allow 'insane' configuration incl. those that do not
exist on baremetal and do not warn about it anyone (i.e. it's user's
responsibility to provide topology that specific guest OS could handle,
aka it's not QEMU business but upper layers). (I do occasionally try
introduce stricter checks in that are, but that gets push back more
often that not).

I'd do check only in case of a specific board where mapping is fixed
in specs of emulated machine, otherwise I wouldn't bother.

> Thanks,
> Gavin
> 
>

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Gavin Shan 2 years, 11 months ago

On 2/25/23 1:20 AM, Igor Mammedov wrote:
> On Fri, 24 Feb 2023 21:16:39 +1100
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
>>> On 2/24/23 04:09, Gavin Shan wrote:
>>>> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>>>>> On 2/23/23 05:13, Gavin Shan wrote:
>>>>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>>>>> used to populate the CPU topology in the Linux guest. It's required that
>>>>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>>>>> scheduling domain can't be sorted out, as the following warning message
>>>>>> indicates. To avoid the unexpected confusion, this series attempts to
>>>>>> rejects such kind of insane configurations.
>>>>>>
>>>>>>      -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>>>>      -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
>>>>>>      -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
>>>>>>      -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
>>>>>
>>>>>
>>>>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>>>>
>>>>> Also, this restriction impacts breaks ARM guests in the wild that are running
>>>>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>>>>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>>>>   
>>>>
>>>> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
>>>> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
>>>> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
>>>> CPUs in different clusters and sockets have been associated with one NUMA node.
>>>>
>>>> If I'm correct, the configuration isn't sensible in a baremetal environment and
>>>> same Linux kernel is supposed to work well for baremetal and virtualized machine.
>>>> So I think QEMU needs to emulate the topology as much as we can to match with the
>>>> baremetal environment. It's the reason why I think it's a QEMU problem even it
>>>> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
>>>> in another thread, we may need to gurantee that the CPUs in one cluster can't be
>>>> split to multiple NUMA nodes, which matches with the baremetal environment, as I
>>>> can understand.
>>>>
>>>> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
>>>> definitely break the configurations running in the wild.
>>>
>>>
>>> What about a warning? If the user attempts to use an exotic NUMA configuration
>>> like the one you mentioned we can print something like:
>>>
>>> "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
>>>
>>> This would inform the user what might be going wrong in case Linux is crashing/error
>>> out on them and then the user is free to fix their topology (or the kernel). And
>>> at the same time we wouldn't break existing stuff that happens to be working.
>>>
>>>    
>>
>> Yes, I think a warning message is more appropriate, so that users can fix their
>> irregular configurations and the existing configurations aren't disconnected.
>> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
>> I'm going to change the code and post next revision.
> 
> Honestly you (and libvirt as far as I recall) are using legacy options
> to assign cpus to numa nodes.
> With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where
> in topology those cpus are.
> What you can do is to use '-numa cpu,...' option to assign socket/core/...
> to numa node, ex:
>          "-numa cpu,node-id=1,socket-id=0 "           or
>          "-numa cpu,node-id=0,socket-id=1,core-id=0 "       or
>          "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0
> to get your desired mapping.
> 
> The problem that's so far was stopping the later adoption by libvirt (Michal CCed)
> is that values used by it are machine specific and to do it properly, for a concrete
> '-M x -smp ...' at least for the first time qemu should be started with-
>   -preconfig option and then user should query possible cpus for those values
> and assign them to numa nodes via QMP.
> 

The issue isn't related to the legacy or modern way to configure CPU-to-NUMA association.
Note that qtests/numa-test also use the legacy way. For example, the issue can be triggered
with the following command lines where the modern configuration is used:

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64            \
   -accel kvm -machine virt,gic-version=host                          \
   -cpu host -smp 6,sockets=2,clusters=1,cores=3,threads=1            \
   -m 768M,slots=16,maxmem=64G                                        \
   -object memory-backend-ram,id=mem0,size=256M                       \
   -object memory-backend-ram,id=mem1,size=256M                       \
   -object memory-backend-ram,id=mem2,size=256M                       \
   -numa node,nodeid=0,memdev=mem0                                    \
   -numa node,nodeid=1,memdev=mem1                                    \
   -numa node,nodeid=2,memdev=mem2                                    \
   -numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=0,thread-id=0 \
   -numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=1,thread-id=0 \
   -numa cpu,node-id=1,socket-id=0,cluster-id=0,core-id=2,thread-id=0 \
   -numa cpu,node-id=1,socket-id=1,cluster-id=0,core-id=0,thread-id=0 \
   -numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=1,thread-id=0 \
   -numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=2,thread-id=0
     :
     :
   [    0.022260] =============== sched_init_domains ===================
   [    0.022263] CPU-0: 36,56,0,-1 thread=0  core=0-2  cluster=0-2 llc=0
   [    0.022267] CPU-1: 36,56,1,-1 thread=1  core=0-2  cluster=0-2 llc=1
   [    0.022269] CPU-2: 36,56,2,-1 thread=2  core=0-2  cluster=0-2 llc=2
   [    0.022270] CPU-3: 136,156,3,-1 thread=3  core=3-5  cluster=3-5 llc=3
   [    0.022272] CPU-4: 136,156,4,-1 thread=4  core=3-5  cluster=3-5 llc=4
   [    0.022273] CPU-5: 136,156,5,-1 thread=5  core=3-5  cluster=3-5 llc=5
   [    0.022275] CPU-0: SMT=0  CLUSTER=0  MULTICORE=0-2  CPUMASK=0-1  0-1
   [    0.022277] CPU-1: SMT=1  CLUSTER=1  MULTICORE=0-2  CPUMASK=0-1  0-1
   [    0.022279] CPU-2: SMT=2  CLUSTER=0-2  MULTICORE=2-3  CPUMASK=2-3  2-3
   [    0.022281] CPU-3: SMT=3  CLUSTER=3-5  MULTICORE=2-3  CPUMASK=2-3  2-3
   [    0.022283] CPU-4: SMT=4  CLUSTER=4  MULTICORE=3-5  CPUMASK=4-5  4-5
   [    0.022284] CPU-5: SMT=5  CLUSTER=5  MULTICORE=3-5  CPUMASK=4-5  4-5
   [    0.022314] build_sched_domains: CPU-0 level-SMT
   [    0.022317] build_sched_domains: CPU-0 level-CLS
   [    0.022318] topology_span_sane: CPU-0 0 vs CPU-2 0-2
   [    0.022322] ------------[ cut here ]------------
   [    0.022323] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2275 build_sched_domains+0x135c/0x1660
      :

> btw: on x86 we also allow 'insane' configuration incl. those that do not
> exist on baremetal and do not warn about it anyone (i.e. it's user's
> responsibility to provide topology that specific guest OS could handle,
> aka it's not QEMU business but upper layers). (I do occasionally try
> introduce stricter checks in that are, but that gets push back more
> often that not).
> 
> I'd do check only in case of a specific board where mapping is fixed
> in specs of emulated machine, otherwise I wouldn't bother.
> 

Right, x86 can handle the irregular configuration well and we didn't find
any triggered warning messages there. As the subject indicates, it's a
ARM64 or riscv specific issue. Unfortunately, I failed to find the requirement,
socket-to-NUMA-node or cluster-to-NUMA-node 1:1 mapping from specs. However, it's not
sensible to split CPUs in one cluster to multiple NUMA nodes in a (ARM64 or RISCv)
baremetal environment. QEMU needs to emulate the environment close to the baremetal
machine if we can.

In summary, a warning message when multiple CPUs in one cluster are split to
multiple NUMA nodes, as Daniel suggested, sounds reasonable to me. Please let me
know your thoughts, Igor :)

Thanks,
Gavin

Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines

Posted by Andrew Jones 2 years, 11 months ago

On Fri, Feb 24, 2023 at 09:16:39PM +1100, Gavin Shan wrote:
> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
> > On 2/24/23 04:09, Gavin Shan wrote:
> > > On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
> > > > On 2/23/23 05:13, Gavin Shan wrote:
> > > > > For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
> > > > > used to populate the CPU topology in the Linux guest. It's required that
> > > > > the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
> > > > > scheduling domain can't be sorted out, as the following warning message
> > > > > indicates. To avoid the unexpected confusion, this series attempts to
> > > > > rejects such kind of insane configurations.
> > > > > 
> > > > >     -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
> > > > >     -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
> > > > >     -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
> > > > >     -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
> > > > 
> > > > 
> > > > And why is this a QEMU problem? This doesn't hurt ACPI.
> > > > 
> > > > Also, this restriction impacts breaks ARM guests in the wild that are running
> > > > non-Linux OSes. I don't see why we should impact use cases that has nothing to
> > > > do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
> > > > 
> > > 
> > > With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
> > > are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
> > > cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
> > > CPUs in different clusters and sockets have been associated with one NUMA node.
> > > 
> > > If I'm correct, the configuration isn't sensible in a baremetal environment and
> > > same Linux kernel is supposed to work well for baremetal and virtualized machine.
> > > So I think QEMU needs to emulate the topology as much as we can to match with the
> > > baremetal environment. It's the reason why I think it's a QEMU problem even it
> > > doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
> > > in another thread, we may need to gurantee that the CPUs in one cluster can't be
> > > split to multiple NUMA nodes, which matches with the baremetal environment, as I
> > > can understand.
> > > 
> > > Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
> > > definitely break the configurations running in the wild.
> > 
> > 
> > What about a warning? If the user attempts to use an exotic NUMA configuration
> > like the one you mentioned we can print something like:
> > 
> > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
> > 
> > This would inform the user what might be going wrong in case Linux is crashing/error
> > out on them and then the user is free to fix their topology (or the kernel). And
> > at the same time we wouldn't break existing stuff that happens to be working.
> > 
> > 
> 
> Yes, I think a warning message is more appropriate, so that users can fix their
> irregular configurations and the existing configurations aren't disconnected.
> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
> I'm going to change the code and post next revision.
>

If there's a concern that this will break non-Linux OSes on arm, then, at
most, the change needs to be tied to the next machine type version, and
possibly it can never be made for arm. riscv is OK, since it currently
ignores smp parameters anyway. It currently derives the number of sockets
from the number of NUMA nodes, using a 1:1 mapping. When smp parameters
are eventually implemented for riscv, then this can be revisited.

Also, it sounds like not only has the rationale for this series been
changed to "platform choice", but also that the cluster <-> numa node
mapping should be 1:1, not the socket <-> numa node mapping. If that's
the case, then the series probably needs to be reworked for that.

Thanks,
drew