[Qemu-devel] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node

Laurent Vivier posted 1 patch 4 years, 7 months ago
Test docker-clang@ubuntu passed
Test FreeBSD passed
Test checkpatch passed
Test docker-mingw@fedora passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190830161345.22436-1-lvivier@redhat.com
Maintainers: David Gibson <david@gibson.dropbear.id.au>
hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
[Qemu-devel] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by Laurent Vivier 4 years, 7 months ago
When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
crashes.

This happens because linux kernel needs to know the NUMA topology at
start to be able to initialize the distance lookup table.

On pseries, the topology is provided by the firmware via the existing
CPUs and memory information. Thus a node without memory and CPU cannot be
discovered by the kernel.

To avoid the kernel crash, do not allow to start pseries with empty
nodes.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index baedadf20b8c..8be738901cf9 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2847,6 +2847,39 @@ static void spapr_machine_init(MachineState *machine)
     /* init CPUs */
     spapr_init_cpus(spapr);
 
+    /*
+     * check we don't have a memory-less/cpu-less NUMA node
+     * Firmware relies on the existing memory/cpu topology to provide the
+     * NUMA topology to the kernel.
+     * And the linux kernel needs to know the NUMA topology at start
+     * to be able to hotplug CPUs later.
+     */
+    if (nb_numa_nodes) {
+        for (i = 0; i < nb_numa_nodes; ++i) {
+            /* check for memory-less node */
+            if (numa_info[i].node_mem == 0) {
+                CPUState *cs;
+                int found = 0;
+                /* check for cpu-less node */
+                CPU_FOREACH(cs) {
+                    PowerPCCPU *cpu = POWERPC_CPU(cs);
+                    if (cpu->node_id == i) {
+                        found = 1;
+                        break;
+                    }
+                }
+                /* memory-less and cpu-less node */
+                if (!found) {
+                    error_report(
+                       "Memory-less/cpu-less nodes are not supported (node %d)",
+                                 i);
+                    exit(1);
+                }
+            }
+        }
+
+    }
+
     if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
         ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
                               spapr->max_compat_pvr)) {
-- 
2.21.0


Re: [Qemu-devel] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by David Gibson 4 years, 7 months ago
On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> crashes.
> 
> This happens because linux kernel needs to know the NUMA topology at
> start to be able to initialize the distance lookup table.
> 
> On pseries, the topology is provided by the firmware via the existing
> CPUs and memory information. Thus a node without memory and CPU cannot be
> discovered by the kernel.
> 
> To avoid the kernel crash, do not allow to start pseries with empty
> nodes.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Applied to ppc-for-4.2.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by Daniel P. Berrangé 4 years, 7 months ago
On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> crashes.
> 
> This happens because linux kernel needs to know the NUMA topology at
> start to be able to initialize the distance lookup table.
> 
> On pseries, the topology is provided by the firmware via the existing
> CPUs and memory information. Thus a node without memory and CPU cannot be
> discovered by the kernel.
> 
> To avoid the kernel crash, do not allow to start pseries with empty
> nodes.

This describes one possible guest OS. Is there any reasonable chance
that a non-Linux guest might be able to handle this situation correctly,
or do you expect any guest to have the same restriction ?

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by Greg Kurz 4 years, 7 months ago
On Fri, 30 Aug 2019 17:34:13 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > crashes.
> > 
> > This happens because linux kernel needs to know the NUMA topology at
> > start to be able to initialize the distance lookup table.
> > 
> > On pseries, the topology is provided by the firmware via the existing
> > CPUs and memory information. Thus a node without memory and CPU cannot be
> > discovered by the kernel.
> > 
> > To avoid the kernel crash, do not allow to start pseries with empty
> > nodes.
> 
> This describes one possible guest OS. Is there any reasonable chance
> that a non-Linux guest might be able to handle this situation correctly,
> or do you expect any guest to have the same restriction ?
> 

I can try to grab an AIX image and give a try, but anyway this looks like
a very big hammer to me... :-\

> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> 
> Regards,
> Daniel


Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by David Gibson 4 years, 7 months ago
On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote:
> On Fri, 30 Aug 2019 17:34:13 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > > crashes.
> > > 
> > > This happens because linux kernel needs to know the NUMA topology at
> > > start to be able to initialize the distance lookup table.
> > > 
> > > On pseries, the topology is provided by the firmware via the existing
> > > CPUs and memory information. Thus a node without memory and CPU cannot be
> > > discovered by the kernel.
> > > 
> > > To avoid the kernel crash, do not allow to start pseries with empty
> > > nodes.
> > 
> > This describes one possible guest OS. Is there any reasonable chance
> > that a non-Linux guest might be able to handle this situation correctly,
> > or do you expect any guest to have the same restriction ?

That's... a more complicated question than you'd think.

The problem here is it's not really obvious in PAPR how topology
information for nodes without memory should be described in the device
tree (which is the only way we given that information to the guest).

It's possible there's some way to encode this information that would
make AIX happy and we just need to fix Linux to cope with that, but
it's not really clear what it would be.

> I can try to grab an AIX image and give a try, but anyway this looks like
> a very big hammer to me... :-\

I'm not really sure why everyone seems to think losing zero-memory
node capability is such a big deal.  It's never worked in practice on
POWER and we can always put it back if we figure out a sensible way to
do it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by Greg Kurz 4 years, 7 months ago
On Mon, 2 Sep 2019 16:27:18 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote:
> > On Fri, 30 Aug 2019 17:34:13 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > > > crashes.
> > > > 
> > > > This happens because linux kernel needs to know the NUMA topology at
> > > > start to be able to initialize the distance lookup table.
> > > > 
> > > > On pseries, the topology is provided by the firmware via the existing
> > > > CPUs and memory information. Thus a node without memory and CPU cannot be
> > > > discovered by the kernel.
> > > > 
> > > > To avoid the kernel crash, do not allow to start pseries with empty
> > > > nodes.
> > > 
> > > This describes one possible guest OS. Is there any reasonable chance
> > > that a non-Linux guest might be able to handle this situation correctly,
> > > or do you expect any guest to have the same restriction ?
> 
> That's... a more complicated question than you'd think.
> 
> The problem here is it's not really obvious in PAPR how topology
> information for nodes without memory should be described in the device
> tree (which is the only way we given that information to the guest).
> 

The reported issue is to have a node without memory AND without cpu.

> It's possible there's some way to encode this information that would
> make AIX happy and we just need to fix Linux to cope with that, but
> it's not really clear what it would be.
> 
> > I can try to grab an AIX image and give a try, but anyway this looks like
> > a very big hammer to me... :-\
> 
> I'm not really sure why everyone seems to think losing zero-memory
> node capability is such a big deal.  It's never worked in practice on
> POWER and we can always put it back if we figure out a sensible way to
> do it.
> 

It isn't really about losing the memory-less/cpu-less node capability, but
more about finding the appropriate fix. The changelog doesn't give much
clues on what's happening exactly: QEMU command line ? linux call stack ?

For example, I could hit a crash with the following command line:

-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0 \
-numa node,nodeid=1

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
(qemu) device_add host-spapr-cpu-core,core-id=1

[   24.507552] Built 1 zonelists, mobility grouping on.  Total pages: 7656
[   24.507592] Policy zone: Normal
[   24.553481] WARNING: workqueue cpumask: online intersect > possible intersect
[   24.608814] BUG: Unable to handle kernel data access at 0x14e13da04c5bc37e
[   24.608875] Faulting instruction address: 0xc000000000175650
[   24.608931] Oops: Kernel access of bad area, sig: 11 [#1]
[   24.608976] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries
[   24.609042] Modules linked in: virtio_net vmx_crypto net_failover failover crct10dif_vpmsum ip_tables xfs libcrc32c crc32c_vpmsum virtio_blk kvm rpadlpar_io rpaphp 9p fscache 9pnet_virtio 9pnet
[   24.609222] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.17-300.fc30.ppc64le #1
[   24.609286] NIP:  c000000000175650 LR: c000000000175310 CTR: 0000000000000000
[   24.609351] REGS: c00000001e597210 TRAP: 0380   Not tainted  (5.1.17-300.fc30.ppc64le)
[   24.609414] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44444248  XER: 00000000
[   24.609482] CFAR: c000000000175528 IRQMASK: 0 
[   24.609482] GPR00: c000000000175310 c00000001e5974a0 c0000000015fc400 0000000000000002 
[   24.609482] GPR04: 0000000000000001 0000000000000001 0000000000000001 0000000000000400 
[   24.609482] GPR08: 14e13da04c5bc37e 0000000000000000 0000000000000000 0000000000000000 
[   24.609482] GPR12: 0000000024022248 c00000000fffee00 0000000000000007 c00000001e0e8fb0 
[   24.609482] GPR16: c00000000162dc70 0000000000000008 c00000001e5976d8 0000000020000000 
[   24.609482] GPR20: 0000000100000003 0000000000000001 0000000000000000 14e13da04c5bc35e 
[   24.609482] GPR24: c000000001630164 0000000000000010 14e13da04c5bc37e 0000000000000000 
[   24.609482] GPR28: 0000000000000002 c0000000142a0e00 c00000001ff25d80 c00000001e5975a8 
[   24.610052] NIP [c000000000175650] find_busiest_group+0x510/0xe10
[   24.610107] LR [c000000000175310] find_busiest_group+0x1d0/0xe10
[   24.610169] Call Trace:
[   24.610203] [c00000001e5974a0] [c000000000175310] find_busiest_group+0x1d0/0xe10 (unreliable)
[   24.610304] [c00000001e597680] [c000000000176110] load_balance+0x1c0/0xe80
[   24.610377] [c00000001e5977d0] [c000000000176ff8] rebalance_domains+0x228/0x380
[   24.610467] [c00000001e597880] [c000000000c7c170] __do_softirq+0x170/0x404
[   24.610542] [c00000001e597980] [c000000000124368] irq_exit+0xd8/0x110
[   24.610617] [c00000001e5979a0] [c000000000028778] timer_interrupt+0x128/0x2e0
[   24.610706] [c00000001e597a00] [c000000000009314] decrementer_common+0x154/0x160
[   24.610799] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[   24.610799]     LR = check_and_cede_processor+0x48/0x60
[   24.610915] [c00000001e597d00] [c00000001e597d60] 0xc00000001e597d60 (unreliable)
[   24.611004] [c00000001e597d60] [c0000000009e22a8] shared_cede_loop+0x68/0x180
[   24.611096] [c00000001e597da0] [c0000000009dec64] cpuidle_enter_state+0xa4/0x660
[   24.611191] [c00000001e597e30] [c0000000001647a0] call_cpuidle+0x50/0xa0
[   24.611270] [c00000001e597e50] [c000000000164d6c] do_idle+0x2cc/0x3b0
[   24.611350] [c00000001e597ec0] [c00000000016508c] cpu_startup_entry+0x3c/0x50
[   24.611445] [c00000001e597ef0] [c000000000051dd0] start_secondary+0x630/0x660
[   24.611539] [c00000001e597f90] [c00000000000b25c] start_secondary_prolog+0x10/0x14
[   24.611632] Instruction dump:
[   24.611680] 7c374800 41820234 e8920016 3b570020 8152002c 7c893670 7d290194 548506be 
[   24.611775] 788606a0 7d2907b4 79291f24 7d1a4a14 <7cfa482a> 7ce72c36 78e707e0 2d270000 
[   24.611871] ---[ end trace 0e5e3ed14d31f59d ]---
[   24.617852] 
[   25.617885] Kernel panic - not syncing: Aiee, killing interrupt handler!

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 1
node 1 size: 0 MB
node 1 plugged: 0 MB

but the crash doesn't occur with:

-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0 \
-numa node,nodeid=1 \
-device spapr-pci-host-bridge,index=1,id=phb1,numa_node=1

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
(qemu) device_add host-spapr-cpu-core,core-id=1

[  154.637304] Policy zone: Normal
[  154.665463] WARNING: workqueue cpumask: online intersect > possible intersect

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 1
node 1 size: 0 MB
node 1 plugged: 0 MB


nor with:

-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0,cpus=0 \
-numa node,nodeid=1

qemu-system-ppc64: warning: CPU(s) not present in any NUMA nodes: CPU 1 [core-id: 1]
qemu-system-ppc64: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
(qemu) device_add host-spapr-cpu-core,core-id=1
(qemu) info numa 
2 nodes
node 0 cpus: 0 1
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB

so I don't know why linux crashes, but it isn't exactly because of having
a cpu-less/memory-less node and this patch catches the non-crashing cases
anyway.
Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by Daniel P. Berrangé 4 years, 7 months ago
On Mon, Sep 02, 2019 at 04:27:18PM +1000, David Gibson wrote:
> On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote:
> > On Fri, 30 Aug 2019 17:34:13 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > > > crashes.
> > > > 
> > > > This happens because linux kernel needs to know the NUMA topology at
> > > > start to be able to initialize the distance lookup table.
> > > > 
> > > > On pseries, the topology is provided by the firmware via the existing
> > > > CPUs and memory information. Thus a node without memory and CPU cannot be
> > > > discovered by the kernel.
> > > > 
> > > > To avoid the kernel crash, do not allow to start pseries with empty
> > > > nodes.
> > > 
> > > This describes one possible guest OS. Is there any reasonable chance
> > > that a non-Linux guest might be able to handle this situation correctly,
> > > or do you expect any guest to have the same restriction ?
> 
> That's... a more complicated question than you'd think.
> 
> The problem here is it's not really obvious in PAPR how topology
> information for nodes without memory should be described in the device
> tree (which is the only way we given that information to the guest).
> 
> It's possible there's some way to encode this information that would
> make AIX happy and we just need to fix Linux to cope with that, but
> it's not really clear what it would be.
> 
> > I can try to grab an AIX image and give a try, but anyway this looks like
> > a very big hammer to me... :-\
> 
> I'm not really sure why everyone seems to think losing zero-memory
> node capability is such a big deal.  It's never worked in practice on
> POWER and we can always put it back if we figure out a sensible way to
> do it.

I'm not that bothered - I just wanted to double check that we were not
intentionally breaking a non-Linux guest OS that was known to work today.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Posted by David Gibson 4 years, 7 months ago
On Mon, Sep 02, 2019 at 09:57:36AM +0100, Daniel P. Berrangé wrote:
> On Mon, Sep 02, 2019 at 04:27:18PM +1000, David Gibson wrote:
> > On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote:
> > > On Fri, 30 Aug 2019 17:34:13 +0100
> > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > 
> > > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > > > > crashes.
> > > > > 
> > > > > This happens because linux kernel needs to know the NUMA topology at
> > > > > start to be able to initialize the distance lookup table.
> > > > > 
> > > > > On pseries, the topology is provided by the firmware via the existing
> > > > > CPUs and memory information. Thus a node without memory and CPU cannot be
> > > > > discovered by the kernel.
> > > > > 
> > > > > To avoid the kernel crash, do not allow to start pseries with empty
> > > > > nodes.
> > > > 
> > > > This describes one possible guest OS. Is there any reasonable chance
> > > > that a non-Linux guest might be able to handle this situation correctly,
> > > > or do you expect any guest to have the same restriction ?
> > 
> > That's... a more complicated question than you'd think.
> > 
> > The problem here is it's not really obvious in PAPR how topology
> > information for nodes without memory should be described in the device
> > tree (which is the only way we given that information to the guest).
> > 
> > It's possible there's some way to encode this information that would
> > make AIX happy and we just need to fix Linux to cope with that, but
> > it's not really clear what it would be.
> > 
> > > I can try to grab an AIX image and give a try, but anyway this looks like
> > > a very big hammer to me... :-\
> > 
> > I'm not really sure why everyone seems to think losing zero-memory
> > node capability is such a big deal.  It's never worked in practice on
> > POWER and we can always put it back if we figure out a sensible way to
> > do it.
> 
> I'm not that bothered - I just wanted to double check that we were not
> intentionally breaking a non-Linux guest OS that was known to work today.

There are no non-Linux guests that are known to work today, unless you
count the kvm-unit-tests micro-OS.  AIX support is coming along, but
it's by no means established.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson