[Qemu-devel] [PATCH] hw/ppc: disable hotplug before CAS is completed

Daniel Henrique Barboza posted 1 patch 6 years, 7 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20170815202846.24749-1-danielhb@linux.vnet.ibm.com
Test FreeBSD passed
Test checkpatch passed
Test docker passed
Test s390x passed
hw/ppc/spapr.c              | 26 ++++++++++++++++++++++++++
hw/ppc/spapr_ovec.c         |  7 +++++++
include/hw/ppc/spapr_ovec.h |  1 +
3 files changed, 34 insertions(+)
[Qemu-devel] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by Daniel Henrique Barboza 6 years, 7 months ago
This patch is a follow up on the discussions that started with
Laurent's patch series "spapr: disable hotplugging without OS" [1]
and discussions made at patch "spapr: reset DRCs on migration
pre_load" [2].

At this moment, we do not support CPU/memory hotplug in early
boot stages, before CAS. The reason is that the hotplug event
can't be handled at SLOF level (or even in PRELAUNCH runstate) and
at the same time can't be canceled. This leads to devices being
unable to be hot unplugged and, in some cases, guest kernel Ooops.
After CAS, with the FDT in place, the guest can handle the hotplug
events and everything works as usual.

An attempt to try to support hotplug before CAS was made, but not
successful. The key difference in the current code flow between a
coldplugged and a hotplugged device, in the PRELAUNCH state, is that
the coldplugged device is registered at the base FDT, allowing its
DRC to go straight to CONFIGURED state. In theory, this can also be
done with a hotplugged device if we can add it to the base of the
existing FDT. However, tampering with the FDT after writing in the
guest memory, besides being a dubitable idea, is also not
possible. The FDT is written in ppc_spapr_reset and there is no
way to retrieve it - we can calculate the fdt_address but the
fdt_size isn't stored. Storing the fdt_size to allow for
later retrieval is yet another state that would need to be
migrated. In short, it is not worth the trouble.

All this said, this patch opted to disable CPU/mem hotplug at early
boot stages. CAS detection is made by checking if there are
any bits set in ov5_cas to avoid adding an extra state that
would need tracking/migration. The patch also makes sure that
it doesn't interfere with hotplug in the INMIGRATE state.

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html

Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
---
 hw/ppc/spapr.c              | 26 ++++++++++++++++++++++++++
 hw/ppc/spapr_ovec.c         |  7 +++++++
 include/hw/ppc/spapr_ovec.h |  1 +
 3 files changed, 34 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index f7a1972..bdcc813 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2803,6 +2803,19 @@ out:
     error_propagate(errp, local_err);
 }
 
+/*
+ * 'h_client_architecture_support' will set at least OV5_FORM1_AFFINITY
+ * in ov5_cas when intersecting it with spapr->ov5 and ov5_guest. It's safe
+ * then to assume that CAS ov5_cas will have something set after CAS.
+ */
+static bool spapr_cas_completed(sPAPRMachineState *spapr)
+{
+    if (spapr->ov5_cas == NULL) {
+        return false;
+    }
+    return !spapr_ovec_is_unset(spapr->ov5_cas);
+}
+
 static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
                                   Error **errp)
 {
@@ -3256,6 +3269,19 @@ static void spapr_machine_device_unplug_request(HotplugHandler *hotplug_dev,
 static void spapr_machine_device_pre_plug(HotplugHandler *hotplug_dev,
                                           DeviceState *dev, Error **errp)
 {
+    sPAPRMachineState *spapr;
+    Error *local_err = NULL;
+
+    if (dev->hotplugged && !runstate_check(RUN_STATE_INMIGRATE)) {
+        spapr = SPAPR_MACHINE(hotplug_dev);
+        if (!spapr_cas_completed(spapr)) {
+            error_setg(&local_err,
+                       "CPU/memory hotplug not supported at early boot");
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+
     if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
         spapr_memory_pre_plug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_SPAPR_CPU_CORE)) {
diff --git a/hw/ppc/spapr_ovec.c b/hw/ppc/spapr_ovec.c
index 41df4c3..fe7bc85 100644
--- a/hw/ppc/spapr_ovec.c
+++ b/hw/ppc/spapr_ovec.c
@@ -134,6 +134,13 @@ bool spapr_ovec_test(sPAPROptionVector *ov, long bitnr)
     return test_bit(bitnr, ov->bitmap) ? true : false;
 }
 
+bool spapr_ovec_is_unset(sPAPROptionVector *ov)
+{
+    unsigned long lastbit;
+    lastbit = find_last_bit(ov->bitmap, OV_MAXBITS);
+    return (lastbit == OV_MAXBITS);
+}
+
 static void guest_byte_to_bitmap(uint8_t entry, unsigned long *bitmap,
                                  long bitmap_offset)
 {
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 9edfa5f..8126374 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -71,6 +71,7 @@ void spapr_ovec_cleanup(sPAPROptionVector *ov);
 void spapr_ovec_set(sPAPROptionVector *ov, long bitnr);
 void spapr_ovec_clear(sPAPROptionVector *ov, long bitnr);
 bool spapr_ovec_test(sPAPROptionVector *ov, long bitnr);
+bool spapr_ovec_is_unset(sPAPROptionVector *ov);
 sPAPROptionVector *spapr_ovec_parse_vector(target_ulong table_addr, int vector);
 int spapr_ovec_populate_dt(void *fdt, int fdt_offset,
                            sPAPROptionVector *ov, const char *name);
-- 
2.9.4


Re: [Qemu-devel] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by David Gibson 6 years, 7 months ago
On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
> This patch is a follow up on the discussions that started with
> Laurent's patch series "spapr: disable hotplugging without OS" [1]
> and discussions made at patch "spapr: reset DRCs on migration
> pre_load" [2].
> 
> At this moment, we do not support CPU/memory hotplug in early
> boot stages, before CAS. The reason is that the hotplug event
> can't be handled at SLOF level (or even in PRELAUNCH runstate) and
> at the same time can't be canceled. This leads to devices being
> unable to be hot unplugged and, in some cases, guest kernel Ooops.
> After CAS, with the FDT in place, the guest can handle the hotplug
> events and everything works as usual.
> 
> An attempt to try to support hotplug before CAS was made, but not
> successful. The key difference in the current code flow between a
> coldplugged and a hotplugged device, in the PRELAUNCH state, is that
> the coldplugged device is registered at the base FDT, allowing its
> DRC to go straight to CONFIGURED state. In theory, this can also be
> done with a hotplugged device if we can add it to the base of the
> existing FDT. However, tampering with the FDT after writing in the
> guest memory, besides being a dubitable idea, is also not
> possible. The FDT is written in ppc_spapr_reset and there is no
> way to retrieve it - we can calculate the fdt_address but the
> fdt_size isn't stored. Storing the fdt_size to allow for
> later retrieval is yet another state that would need to be
> migrated. In short, it is not worth the trouble.
> 
> All this said, this patch opted to disable CPU/mem hotplug at early
> boot stages. CAS detection is made by checking if there are
> any bits set in ov5_cas to avoid adding an extra state that
> would need tracking/migration. The patch also makes sure that
> it doesn't interfere with hotplug in the INMIGRATE state.
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
> [2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>

I don't think this is a good idea.

1) After my DRC cleanups, early hotplug works just fine for me.  I'm
not sure why it isn't for you: we need to understand that before
proceeding.

2) libvirt actually uses early hotplug fairly often (before even
starting the firmware).  At the moment this works - at least in some
cases (see above), though there are some wrinkles to work out.  This
will break it completely and require an entirely different approach to
fix again.

3) There's no fundamental reason early hotplug shouldn't work - the
event will just be queued until the OS boots and processes it.

I know I suggested disabling early hotplug earlier, but that was
before I'd dug into the DRC layer and properly understood what was
going on here.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [Qemu-ppc] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by Daniel Henrique Barboza 6 years, 7 months ago

On 08/17/2017 04:52 AM, David Gibson wrote:
> On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
>> This patch is a follow up on the discussions that started with
>> Laurent's patch series "spapr: disable hotplugging without OS" [1]
>> and discussions made at patch "spapr: reset DRCs on migration
>> pre_load" [2].
>>
>> At this moment, we do not support CPU/memory hotplug in early
>> boot stages, before CAS. The reason is that the hotplug event
>> can't be handled at SLOF level (or even in PRELAUNCH runstate) and
>> at the same time can't be canceled. This leads to devices being
>> unable to be hot unplugged and, in some cases, guest kernel Ooops.
>> After CAS, with the FDT in place, the guest can handle the hotplug
>> events and everything works as usual.
>>
>> An attempt to try to support hotplug before CAS was made, but not
>> successful. The key difference in the current code flow between a
>> coldplugged and a hotplugged device, in the PRELAUNCH state, is that
>> the coldplugged device is registered at the base FDT, allowing its
>> DRC to go straight to CONFIGURED state. In theory, this can also be
>> done with a hotplugged device if we can add it to the base of the
>> existing FDT. However, tampering with the FDT after writing in the
>> guest memory, besides being a dubitable idea, is also not
>> possible. The FDT is written in ppc_spapr_reset and there is no
>> way to retrieve it - we can calculate the fdt_address but the
>> fdt_size isn't stored. Storing the fdt_size to allow for
>> later retrieval is yet another state that would need to be
>> migrated. In short, it is not worth the trouble.
>>
>> All this said, this patch opted to disable CPU/mem hotplug at early
>> boot stages. CAS detection is made by checking if there are
>> any bits set in ov5_cas to avoid adding an extra state that
>> would need tracking/migration. The patch also makes sure that
>> it doesn't interfere with hotplug in the INMIGRATE state.
>>
>> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
>> [2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> I don't think this is a good idea.
>
> 1) After my DRC cleanups, early hotplug works just fine for me.  I'm
> not sure why it isn't for you: we need to understand that before
> proceeding.
>
> 2) libvirt actually uses early hotplug fairly often (before even
> starting the firmware).  At the moment this works - at least in some
> cases (see above), though there are some wrinkles to work out.  This
> will break it completely and require an entirely different approach to
> fix again.

Now that you mentioned I remember having this same discussion with you,
about the same topic. Back then we decided to leave it alone, since you 
couldn't
reproduce the behavior but I could.

I still can reproduce this bug and ended up investigating a bit more today:


- one difference in QEMU between hotplugging before and after CAS is here:

hw/ppc/spapr_events.c - rtas_event_log_to_source

     switch (log_type) {
     case RTAS_LOG_TYPE_HOTPLUG:
         source = spapr_event_sources_get_source(spapr->event_sources,
EVENT_CLASS_HOT_PLUG);
         if (spapr_ovec_test(spapr->ov5_cas, OV5_HP_EVT)) {
             g_assert(source->enabled);
             break;
         }
         /* fall back to epow for legacy hotplug interrupt source */
     case RTAS_LOG_TYPE_EPOW:
         source = spapr_event_sources_get_source(spapr->event_sources,
                                                 EVENT_CLASS_EPOW);
         break;
     default:


Note the ovec_test for OV5_HP_EVT. When hotplugging a CPU In early boot, 
ov5_cas
doesn't have anything set, making this check fails and due to the 
'break' position there
(that I believe it is intended), it falls back to log the event as EPOW 
instead of HOT_PLUG.

I tried to hack this code by adding another break and ensure that the 
event got logged
as HOT_PLUG (like it happens in the post-CAS) but then I got a kernel 
panic at boot. So
I am not sure if this code needs any change or afterthought.

- hotplugging the CPU at early stage gives me a warning message in SLOF:

--------------

Calling ibm,client-architecture-support...Node not supported
Node not supported
  not implemented
memory layout at init:
---------------

The code that gives the 'Node not supported' message is related to the 
fdt-create-cas-node
function of board-qemu/slof/fdt.fs . The code is looking for either 
"memory@" or
"ibm,dynamic-reconfiguration-memory" nodes, giving this error when finding a
CPU node.


- if I hotplug another CPU after the guest completes the boot, the 
previously added
CPU suddenly turns online too:

[ started VM with -S ]

(qemu) device_add host-spapr-cpu-core,id=core1,core-id=1
(qemu) cont

[ guest finishes boot ]

danielhb@ubuntu1710:~$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              1
On-line CPU(s) list: 0
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.1 (pvr 004b 0201)
Model name:          POWER8E (raw), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
NUMA node0 CPU(s):   0
danielhb@ubuntu1710:~$
danielhb@ubuntu1710:~$
danielhb@ubuntu1710:~$ (qemu)
(qemu) info cpus
* CPU #0: nip=0xc0000000000a464c thread_id=131946
   CPU #1: nip=0x0000000000000000 (halted) thread_id=131954
(qemu)
(qemu) device_add host-spapr-cpu-core,id=core2,core-id=2
(qemu)
(qemu) info cpus
* CPU #0: nip=0xc0000000000a464c thread_id=131946
   CPU #1: nip=0xc0000000000a464c thread_id=131954
   CPU #2: nip=0xc0000000000a464c thread_id=132144
(qemu)

danielhb@ubuntu1710:~$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              3
On-line CPU(s) list: 0-2
Thread(s) per core:  1
Core(s) per socket:  3
Socket(s):           1
NUMA node(s):        1
Model:               2.1 (pvr 004b 0201)
Model name:          POWER8E (raw), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
NUMA node0 CPU(s):   0-2
danielhb@ubuntu1710:~$


This makes me believe that the issue is that the guest isn't aware of 
the CPU
presence, making me wonder whether this has something to do with the 
qemu_irq_pulse
in the end of spapr_hotplug_req_event being lost. In the second hotplug, we
re-assert the IRQ in the end of check-exception and the guest is made 
aware of
the queued hotplug event that was ignored at first.



Daniel
>
> 3) There's no fundamental reason early hotplug shouldn't work - the
> event will just be queued until the OS boots and processes it.
>
> I know I suggested disabling early hotplug earlier, but that was
> before I'd dug into the DRC layer and properly understood what was
> going on here.
>


Re: [Qemu-devel] [Qemu-ppc] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by David Gibson 6 years, 7 months ago
On Thu, Aug 17, 2017 at 06:31:28PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 08/17/2017 04:52 AM, David Gibson wrote:
> > On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
> > > This patch is a follow up on the discussions that started with
> > > Laurent's patch series "spapr: disable hotplugging without OS" [1]
> > > and discussions made at patch "spapr: reset DRCs on migration
> > > pre_load" [2].
> > > 
> > > At this moment, we do not support CPU/memory hotplug in early
> > > boot stages, before CAS. The reason is that the hotplug event
> > > can't be handled at SLOF level (or even in PRELAUNCH runstate) and
> > > at the same time can't be canceled. This leads to devices being
> > > unable to be hot unplugged and, in some cases, guest kernel Ooops.
> > > After CAS, with the FDT in place, the guest can handle the hotplug
> > > events and everything works as usual.
> > > 
> > > An attempt to try to support hotplug before CAS was made, but not
> > > successful. The key difference in the current code flow between a
> > > coldplugged and a hotplugged device, in the PRELAUNCH state, is that
> > > the coldplugged device is registered at the base FDT, allowing its
> > > DRC to go straight to CONFIGURED state. In theory, this can also be
> > > done with a hotplugged device if we can add it to the base of the
> > > existing FDT. However, tampering with the FDT after writing in the
> > > guest memory, besides being a dubitable idea, is also not
> > > possible. The FDT is written in ppc_spapr_reset and there is no
> > > way to retrieve it - we can calculate the fdt_address but the
> > > fdt_size isn't stored. Storing the fdt_size to allow for
> > > later retrieval is yet another state that would need to be
> > > migrated. In short, it is not worth the trouble.
> > > 
> > > All this said, this patch opted to disable CPU/mem hotplug at early
> > > boot stages. CAS detection is made by checking if there are
> > > any bits set in ov5_cas to avoid adding an extra state that
> > > would need tracking/migration. The patch also makes sure that
> > > it doesn't interfere with hotplug in the INMIGRATE state.
> > > 
> > > [1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
> > > [2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
> > > 
> > > Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> > I don't think this is a good idea.
> > 
> > 1) After my DRC cleanups, early hotplug works just fine for me.  I'm
> > not sure why it isn't for you: we need to understand that before
> > proceeding.
> > 
> > 2) libvirt actually uses early hotplug fairly often (before even
> > starting the firmware).  At the moment this works - at least in some
> > cases (see above), though there are some wrinkles to work out.  This
> > will break it completely and require an entirely different approach to
> > fix again.
> 
> Now that you mentioned I remember having this same discussion with you,
> about the same topic. Back then we decided to leave it alone, since you
> couldn't
> reproduce the behavior but I could.
> 
> I still can reproduce this bug and ended up investigating a bit more today:
> 
> 
> - one difference in QEMU between hotplugging before and after CAS is here:
> 
> hw/ppc/spapr_events.c - rtas_event_log_to_source
> 
>     switch (log_type) {
>     case RTAS_LOG_TYPE_HOTPLUG:
>         source = spapr_event_sources_get_source(spapr->event_sources,
> EVENT_CLASS_HOT_PLUG);
>         if (spapr_ovec_test(spapr->ov5_cas, OV5_HP_EVT)) {
>             g_assert(source->enabled);
>             break;
>         }
>         /* fall back to epow for legacy hotplug interrupt source */
>     case RTAS_LOG_TYPE_EPOW:
>         source = spapr_event_sources_get_source(spapr->event_sources,
>                                                 EVENT_CLASS_EPOW);
>         break;
>     default:
> 
> 
> Note the ovec_test for OV5_HP_EVT. When hotplugging a CPU In early boot,
> ov5_cas
> doesn't have anything set, making this check fails and due to the 'break'
> position there
> (that I believe it is intended), it falls back to log the event as EPOW
> instead of HOT_PLUG.

Ah.. I'm not sure if this is the cause of the problem you're seeing,
but it's a good point regardless.  We're queuing an event before we
know how to queue events, which is definitely wrong.

Urgh.. how to fix this.  Banning hotplug until after CAS seems
logically sound, but will break the libvirt use case.  So we could..

1) Queue events internally, but only deliver them to the right actual
RTAS queue after CAS.  Blech.

2) At CAS time, reset all the DRCs, and emit extra DT information for
anything hotplugged since init.  I've suggested this before, but
dropped it when early hotplug seemed to be working for me.

3) Work around in libvirt by having it issue a system_reset after the
hotplugs.

In the short to medium term at least, I think (2) is probably our
best option.


> I tried to hack this code by adding another break and ensure that the event
> got logged
> as HOT_PLUG (like it happens in the post-CAS) but then I got a kernel panic
> at boot. So
> I am not sure if this code needs any change or afterthought.
> 
> - hotplugging the CPU at early stage gives me a warning message in SLOF:
> 
> --------------
> 
> Calling ibm,client-architecture-support...Node not supported
> Node not supported
>  not implemented
> memory layout at init:
> ---------------
> 
> The code that gives the 'Node not supported' message is related to the
> fdt-create-cas-node
> function of board-qemu/slof/fdt.fs . The code is looking for either
> "memory@" or
> "ibm,dynamic-reconfiguration-memory" nodes, giving this error when finding a
> CPU node.
> 
> 
> - if I hotplug another CPU after the guest completes the boot, the
> previously added
> CPU suddenly turns online too:

Right the interrupt from the new event is unjamming the queue, so the
guest kernel / drmgr is then able to add the earlier cpu as well.

> [ started VM with -S ]
> 
> (qemu) device_add host-spapr-cpu-core,id=core1,core-id=1
> (qemu) cont
> 
> [ guest finishes boot ]
> 
> danielhb@ubuntu1710:~$ lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              1
> On-line CPU(s) list: 0
> Thread(s) per core:  1
> Core(s) per socket:  1
> Socket(s):           1
> NUMA node(s):        1
> Model:               2.1 (pvr 004b 0201)
> Model name:          POWER8E (raw), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           64K
> L1i cache:           32K
> NUMA node0 CPU(s):   0
> danielhb@ubuntu1710:~$
> danielhb@ubuntu1710:~$
> danielhb@ubuntu1710:~$ (qemu)
> (qemu) info cpus
> * CPU #0: nip=0xc0000000000a464c thread_id=131946
>   CPU #1: nip=0x0000000000000000 (halted) thread_id=131954
> (qemu)
> (qemu) device_add host-spapr-cpu-core,id=core2,core-id=2
> (qemu)
> (qemu) info cpus
> * CPU #0: nip=0xc0000000000a464c thread_id=131946
>   CPU #1: nip=0xc0000000000a464c thread_id=131954
>   CPU #2: nip=0xc0000000000a464c thread_id=132144
> (qemu)
> 
> danielhb@ubuntu1710:~$ lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              3
> On-line CPU(s) list: 0-2
> Thread(s) per core:  1
> Core(s) per socket:  3
> Socket(s):           1
> NUMA node(s):        1
> Model:               2.1 (pvr 004b 0201)
> Model name:          POWER8E (raw), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           64K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-2
> danielhb@ubuntu1710:~$
> 
> 
> This makes me believe that the issue is that the guest isn't aware of the
> CPU
> presence, making me wonder whether this has something to do with the
> qemu_irq_pulse
> in the end of spapr_hotplug_req_event being lost. In the second hotplug, we
> re-assert the IRQ in the end of check-exception and the guest is made aware
> of
> the queued hotplug event that was ignored at first.

Yeah, I think that's it.


> 
> 
> 
> Daniel
> > 
> > 3) There's no fundamental reason early hotplug shouldn't work - the
> > event will just be queued until the OS boots and processes it.
> > 
> > I know I suggested disabling early hotplug earlier, but that was
> > before I'd dug into the DRC layer and properly understood what was
> > going on here.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [Qemu-ppc] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by Daniel Henrique Barboza 6 years, 7 months ago

On 08/18/2017 01:14 AM, David Gibson wrote:
> On Thu, Aug 17, 2017 at 06:31:28PM -0300, Daniel Henrique Barboza wrote:
>>
>> On 08/17/2017 04:52 AM, David Gibson wrote:
>>> On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
>>>> This patch is a follow up on the discussions that started with
>>>> Laurent's patch series "spapr: disable hotplugging without OS" [1]
>>>> and discussions made at patch "spapr: reset DRCs on migration
>>>> pre_load" [2].
>>>>
>>>> At this moment, we do not support CPU/memory hotplug in early
>>>> boot stages, before CAS. The reason is that the hotplug event
>>>> can't be handled at SLOF level (or even in PRELAUNCH runstate) and
>>>> at the same time can't be canceled. This leads to devices being
>>>> unable to be hot unplugged and, in some cases, guest kernel Ooops.
>>>> After CAS, with the FDT in place, the guest can handle the hotplug
>>>> events and everything works as usual.
>>>>
>>>> An attempt to try to support hotplug before CAS was made, but not
>>>> successful. The key difference in the current code flow between a
>>>> coldplugged and a hotplugged device, in the PRELAUNCH state, is that
>>>> the coldplugged device is registered at the base FDT, allowing its
>>>> DRC to go straight to CONFIGURED state. In theory, this can also be
>>>> done with a hotplugged device if we can add it to the base of the
>>>> existing FDT. However, tampering with the FDT after writing in the
>>>> guest memory, besides being a dubitable idea, is also not
>>>> possible. The FDT is written in ppc_spapr_reset and there is no
>>>> way to retrieve it - we can calculate the fdt_address but the
>>>> fdt_size isn't stored. Storing the fdt_size to allow for
>>>> later retrieval is yet another state that would need to be
>>>> migrated. In short, it is not worth the trouble.
>>>>
>>>> All this said, this patch opted to disable CPU/mem hotplug at early
>>>> boot stages. CAS detection is made by checking if there are
>>>> any bits set in ov5_cas to avoid adding an extra state that
>>>> would need tracking/migration. The patch also makes sure that
>>>> it doesn't interfere with hotplug in the INMIGRATE state.
>>>>
>>>> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
>>>> [2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
>>>>
>>>> Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
>>> I don't think this is a good idea.
>>>
>>> 1) After my DRC cleanups, early hotplug works just fine for me.  I'm
>>> not sure why it isn't for you: we need to understand that before
>>> proceeding.
>>>
>>> 2) libvirt actually uses early hotplug fairly often (before even
>>> starting the firmware).  At the moment this works - at least in some
>>> cases (see above), though there are some wrinkles to work out.  This
>>> will break it completely and require an entirely different approach to
>>> fix again.
>> Now that you mentioned I remember having this same discussion with you,
>> about the same topic. Back then we decided to leave it alone, since you
>> couldn't
>> reproduce the behavior but I could.
>>
>> I still can reproduce this bug and ended up investigating a bit more today:
>>
>>
>> - one difference in QEMU between hotplugging before and after CAS is here:
>>
>> hw/ppc/spapr_events.c - rtas_event_log_to_source
>>
>>      switch (log_type) {
>>      case RTAS_LOG_TYPE_HOTPLUG:
>>          source = spapr_event_sources_get_source(spapr->event_sources,
>> EVENT_CLASS_HOT_PLUG);
>>          if (spapr_ovec_test(spapr->ov5_cas, OV5_HP_EVT)) {
>>              g_assert(source->enabled);
>>              break;
>>          }
>>          /* fall back to epow for legacy hotplug interrupt source */
>>      case RTAS_LOG_TYPE_EPOW:
>>          source = spapr_event_sources_get_source(spapr->event_sources,
>>                                                  EVENT_CLASS_EPOW);
>>          break;
>>      default:
>>
>>
>> Note the ovec_test for OV5_HP_EVT. When hotplugging a CPU In early boot,
>> ov5_cas
>> doesn't have anything set, making this check fails and due to the 'break'
>> position there
>> (that I believe it is intended), it falls back to log the event as EPOW
>> instead of HOT_PLUG.
> Ah.. I'm not sure if this is the cause of the problem you're seeing,
> but it's a good point regardless.  We're queuing an event before we
> know how to queue events, which is definitely wrong.
>
> Urgh.. how to fix this.  Banning hotplug until after CAS seems
> logically sound, but will break the libvirt use case.  So we could..
>
> 1) Queue events internally, but only deliver them to the right actual
> RTAS queue after CAS.  Blech.
>
> 2) At CAS time, reset all the DRCs, and emit extra DT information for
> anything hotplugged since init.  I've suggested this before, but
> dropped it when early hotplug seemed to be working for me.

Update: I've been trying to do the (2) solution here by adding extra
DT info of the hotplugged CPU, while resetting its DRC. The problem I am
facing is that the kernel panics at parse_numa_properties when trying to
find the node id of the added CPU. This is intriguing because the extra DT
info is enough to make the kernel start the added CPU at smp_release_cpus().

The DT information is being added at the CAS response, in 
spapr_fixup_cpu_dt,
and it's the same information that is added for the coldplugged CPUs. I 
have a
theory that the guest (SLOF?) isn't prepared to handle a new CPU being 
added at
this point, which might be leading to this kernel bug I am seeing. I 
appreciate if
anyone with good CAS/SLOF knowledge can weight in here - there is a good 
chance
that adding a new CPU element at this point in the DT isn't supported at 
all.


Now, in other news ....

>
> 3) Work around in libvirt by having it issue a system_reset after the
> hotplugs.

... talking with Mike about this bug he proposed a similar solution to 
(3), but with
a CAS induced reset. If there are hotplugged devs at that point, set 
spapr->cas_reboot
to true. Doing that, QEMU reboots automatically and the hotplugged devs 
are considered
coldplugged. It works.

I am mentioning this because this is a simpler solution than relying on
libvirt issuing a system_reset. This is also solution that I would like 
to avoid though - I
am not looking forward to see people bugs complaining about the machine 
rebooting in
the middle of the boot. I will exhaust my attempts at (2) before 
proposing going
with this CAS reboot solution.


Thanks,


Daniel





>
> In the short to medium term at least, I think (2) is probably our
> best option.
>
>
>> I tried to hack this code by adding another break and ensure that the event
>> got logged
>> as HOT_PLUG (like it happens in the post-CAS) but then I got a kernel panic
>> at boot. So
>> I am not sure if this code needs any change or afterthought.
>>
>> - hotplugging the CPU at early stage gives me a warning message in SLOF:
>>
>> --------------
>>
>> Calling ibm,client-architecture-support...Node not supported
>> Node not supported
>>   not implemented
>> memory layout at init:
>> ---------------
>>
>> The code that gives the 'Node not supported' message is related to the
>> fdt-create-cas-node
>> function of board-qemu/slof/fdt.fs . The code is looking for either
>> "memory@" or
>> "ibm,dynamic-reconfiguration-memory" nodes, giving this error when finding a
>> CPU node.
>>
>>
>> - if I hotplug another CPU after the guest completes the boot, the
>> previously added
>> CPU suddenly turns online too:
> Right the interrupt from the new event is unjamming the queue, so the
> guest kernel / drmgr is then able to add the earlier cpu as well.
>
>> [ started VM with -S ]
>>
>> (qemu) device_add host-spapr-cpu-core,id=core1,core-id=1
>> (qemu) cont
>>
>> [ guest finishes boot ]
>>
>> danielhb@ubuntu1710:~$ lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              1
>> On-line CPU(s) list: 0
>> Thread(s) per core:  1
>> Core(s) per socket:  1
>> Socket(s):           1
>> NUMA node(s):        1
>> Model:               2.1 (pvr 004b 0201)
>> Model name:          POWER8E (raw), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           64K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0
>> danielhb@ubuntu1710:~$
>> danielhb@ubuntu1710:~$
>> danielhb@ubuntu1710:~$ (qemu)
>> (qemu) info cpus
>> * CPU #0: nip=0xc0000000000a464c thread_id=131946
>>    CPU #1: nip=0x0000000000000000 (halted) thread_id=131954
>> (qemu)
>> (qemu) device_add host-spapr-cpu-core,id=core2,core-id=2
>> (qemu)
>> (qemu) info cpus
>> * CPU #0: nip=0xc0000000000a464c thread_id=131946
>>    CPU #1: nip=0xc0000000000a464c thread_id=131954
>>    CPU #2: nip=0xc0000000000a464c thread_id=132144
>> (qemu)
>>
>> danielhb@ubuntu1710:~$ lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              3
>> On-line CPU(s) list: 0-2
>> Thread(s) per core:  1
>> Core(s) per socket:  3
>> Socket(s):           1
>> NUMA node(s):        1
>> Model:               2.1 (pvr 004b 0201)
>> Model name:          POWER8E (raw), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           64K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-2
>> danielhb@ubuntu1710:~$
>>
>>
>> This makes me believe that the issue is that the guest isn't aware of the
>> CPU
>> presence, making me wonder whether this has something to do with the
>> qemu_irq_pulse
>> in the end of spapr_hotplug_req_event being lost. In the second hotplug, we
>> re-assert the IRQ in the end of check-exception and the guest is made aware
>> of
>> the queued hotplug event that was ignored at first.
> Yeah, I think that's it.
>
>
>>
>>
>> Daniel
>>> 3) There's no fundamental reason early hotplug shouldn't work - the
>>> event will just be queued until the OS boots and processes it.
>>>
>>> I know I suggested disabling early hotplug earlier, but that was
>>> before I'd dug into the DRC layer and properly understood what was
>>> going on here.
>>>


Re: [Qemu-devel] [Qemu-ppc] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by David Gibson 6 years, 7 months ago
On Tue, Aug 22, 2017 at 08:50:44PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 08/18/2017 01:14 AM, David Gibson wrote:
> > On Thu, Aug 17, 2017 at 06:31:28PM -0300, Daniel Henrique Barboza wrote:
> > > 
> > > On 08/17/2017 04:52 AM, David Gibson wrote:
> > > > On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
> > > > > This patch is a follow up on the discussions that started with
> > > > > Laurent's patch series "spapr: disable hotplugging without OS" [1]
> > > > > and discussions made at patch "spapr: reset DRCs on migration
> > > > > pre_load" [2].
> > > > > 
> > > > > At this moment, we do not support CPU/memory hotplug in early
> > > > > boot stages, before CAS. The reason is that the hotplug event
> > > > > can't be handled at SLOF level (or even in PRELAUNCH runstate) and
> > > > > at the same time can't be canceled. This leads to devices being
> > > > > unable to be hot unplugged and, in some cases, guest kernel Ooops.
> > > > > After CAS, with the FDT in place, the guest can handle the hotplug
> > > > > events and everything works as usual.
> > > > > 
> > > > > An attempt to try to support hotplug before CAS was made, but not
> > > > > successful. The key difference in the current code flow between a
> > > > > coldplugged and a hotplugged device, in the PRELAUNCH state, is that
> > > > > the coldplugged device is registered at the base FDT, allowing its
> > > > > DRC to go straight to CONFIGURED state. In theory, this can also be
> > > > > done with a hotplugged device if we can add it to the base of the
> > > > > existing FDT. However, tampering with the FDT after writing in the
> > > > > guest memory, besides being a dubitable idea, is also not
> > > > > possible. The FDT is written in ppc_spapr_reset and there is no
> > > > > way to retrieve it - we can calculate the fdt_address but the
> > > > > fdt_size isn't stored. Storing the fdt_size to allow for
> > > > > later retrieval is yet another state that would need to be
> > > > > migrated. In short, it is not worth the trouble.
> > > > > 
> > > > > All this said, this patch opted to disable CPU/mem hotplug at early
> > > > > boot stages. CAS detection is made by checking if there are
> > > > > any bits set in ov5_cas to avoid adding an extra state that
> > > > > would need tracking/migration. The patch also makes sure that
> > > > > it doesn't interfere with hotplug in the INMIGRATE state.
> > > > > 
> > > > > [1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
> > > > > [2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
> > > > > 
> > > > > Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
> > > > I don't think this is a good idea.
> > > > 
> > > > 1) After my DRC cleanups, early hotplug works just fine for me.  I'm
> > > > not sure why it isn't for you: we need to understand that before
> > > > proceeding.
> > > > 
> > > > 2) libvirt actually uses early hotplug fairly often (before even
> > > > starting the firmware).  At the moment this works - at least in some
> > > > cases (see above), though there are some wrinkles to work out.  This
> > > > will break it completely and require an entirely different approach to
> > > > fix again.
> > > Now that you mentioned I remember having this same discussion with you,
> > > about the same topic. Back then we decided to leave it alone, since you
> > > couldn't
> > > reproduce the behavior but I could.
> > > 
> > > I still can reproduce this bug and ended up investigating a bit more today:
> > > 
> > > 
> > > - one difference in QEMU between hotplugging before and after CAS is here:
> > > 
> > > hw/ppc/spapr_events.c - rtas_event_log_to_source
> > > 
> > >      switch (log_type) {
> > >      case RTAS_LOG_TYPE_HOTPLUG:
> > >          source = spapr_event_sources_get_source(spapr->event_sources,
> > > EVENT_CLASS_HOT_PLUG);
> > >          if (spapr_ovec_test(spapr->ov5_cas, OV5_HP_EVT)) {
> > >              g_assert(source->enabled);
> > >              break;
> > >          }
> > >          /* fall back to epow for legacy hotplug interrupt source */
> > >      case RTAS_LOG_TYPE_EPOW:
> > >          source = spapr_event_sources_get_source(spapr->event_sources,
> > >                                                  EVENT_CLASS_EPOW);
> > >          break;
> > >      default:
> > > 
> > > 
> > > Note the ovec_test for OV5_HP_EVT. When hotplugging a CPU In early boot,
> > > ov5_cas
> > > doesn't have anything set, making this check fails and due to the 'break'
> > > position there
> > > (that I believe it is intended), it falls back to log the event as EPOW
> > > instead of HOT_PLUG.
> > Ah.. I'm not sure if this is the cause of the problem you're seeing,
> > but it's a good point regardless.  We're queuing an event before we
> > know how to queue events, which is definitely wrong.
> > 
> > Urgh.. how to fix this.  Banning hotplug until after CAS seems
> > logically sound, but will break the libvirt use case.  So we could..
> > 
> > 1) Queue events internally, but only deliver them to the right actual
> > RTAS queue after CAS.  Blech.
> > 
> > 2) At CAS time, reset all the DRCs, and emit extra DT information for
> > anything hotplugged since init.  I've suggested this before, but
> > dropped it when early hotplug seemed to be working for me.
> 
> Update: I've been trying to do the (2) solution here by adding extra
> DT info of the hotplugged CPU, while resetting its DRC. The problem I am
> facing is that the kernel panics at parse_numa_properties when trying to
> find the node id of the added CPU. This is intriguing because the extra DT
> info is enough to make the kernel start the added CPU at smp_release_cpus().
> 
> The DT information is being added at the CAS response, in
> spapr_fixup_cpu_dt,
> and it's the same information that is added for the coldplugged CPUs. I have
> a
> theory that the guest (SLOF?) isn't prepared to handle a new CPU being added
> at
> this point, which might be leading to this kernel bug I am seeing. I
> appreciate if
> anyone with good CAS/SLOF knowledge can weight in here - there is a good
> chance
> that adding a new CPU element at this point in the DT isn't supported at
> all.
> 
> 
> Now, in other news ....
> 
> > 
> > 3) Work around in libvirt by having it issue a system_reset after the
> > hotplugs.
> 
> ... talking with Mike about this bug he proposed a similar solution to (3),
> but with
> a CAS induced reset. If there are hotplugged devs at that point, set
> spapr->cas_reboot
> to true. Doing that, QEMU reboots automatically and the hotplugged devs are
> considered
> coldplugged. It works.
> 
> I am mentioning this because this is a simpler solution than relying on
> libvirt issuing a system_reset. This is also solution that I would like to
> avoid though - I
> am not looking forward to see people bugs complaining about the machine
> rebooting in
> the middle of the boot. I will exhaust my attempts at (2) before proposing
> going
> with this CAS reboot solution.

Ah!  Actually I think that's a great idea.  I mean, it's an ugly hack,
but it should also get us something that more or less works with
*much* less work than other approaches.  So I think it's a good
interim solution.

Do you have a patch?  It's probably too late for qemu 2.10, but if we
can get it in as soon as 2.11 opens, then we should be able to get it
into the RH releases at least, which will be very helpful.

> 
> 
> Thanks,
> 
> 
> Daniel
> 
> 
> 
> 
> 
> > 
> > In the short to medium term at least, I think (2) is probably our
> > best option.
> > 
> > 
> > > I tried to hack this code by adding another break and ensure that the event
> > > got logged
> > > as HOT_PLUG (like it happens in the post-CAS) but then I got a kernel panic
> > > at boot. So
> > > I am not sure if this code needs any change or afterthought.
> > > 
> > > - hotplugging the CPU at early stage gives me a warning message in SLOF:
> > > 
> > > --------------
> > > 
> > > Calling ibm,client-architecture-support...Node not supported
> > > Node not supported
> > >   not implemented
> > > memory layout at init:
> > > ---------------
> > > 
> > > The code that gives the 'Node not supported' message is related to the
> > > fdt-create-cas-node
> > > function of board-qemu/slof/fdt.fs . The code is looking for either
> > > "memory@" or
> > > "ibm,dynamic-reconfiguration-memory" nodes, giving this error when finding a
> > > CPU node.
> > > 
> > > 
> > > - if I hotplug another CPU after the guest completes the boot, the
> > > previously added
> > > CPU suddenly turns online too:
> > Right the interrupt from the new event is unjamming the queue, so the
> > guest kernel / drmgr is then able to add the earlier cpu as well.
> > 
> > > [ started VM with -S ]
> > > 
> > > (qemu) device_add host-spapr-cpu-core,id=core1,core-id=1
> > > (qemu) cont
> > > 
> > > [ guest finishes boot ]
> > > 
> > > danielhb@ubuntu1710:~$ lscpu
> > > Architecture:        ppc64le
> > > Byte Order:          Little Endian
> > > CPU(s):              1
> > > On-line CPU(s) list: 0
> > > Thread(s) per core:  1
> > > Core(s) per socket:  1
> > > Socket(s):           1
> > > NUMA node(s):        1
> > > Model:               2.1 (pvr 004b 0201)
> > > Model name:          POWER8E (raw), altivec supported
> > > Hypervisor vendor:   KVM
> > > Virtualization type: para
> > > L1d cache:           64K
> > > L1i cache:           32K
> > > NUMA node0 CPU(s):   0
> > > danielhb@ubuntu1710:~$
> > > danielhb@ubuntu1710:~$
> > > danielhb@ubuntu1710:~$ (qemu)
> > > (qemu) info cpus
> > > * CPU #0: nip=0xc0000000000a464c thread_id=131946
> > >    CPU #1: nip=0x0000000000000000 (halted) thread_id=131954
> > > (qemu)
> > > (qemu) device_add host-spapr-cpu-core,id=core2,core-id=2
> > > (qemu)
> > > (qemu) info cpus
> > > * CPU #0: nip=0xc0000000000a464c thread_id=131946
> > >    CPU #1: nip=0xc0000000000a464c thread_id=131954
> > >    CPU #2: nip=0xc0000000000a464c thread_id=132144
> > > (qemu)
> > > 
> > > danielhb@ubuntu1710:~$ lscpu
> > > Architecture:        ppc64le
> > > Byte Order:          Little Endian
> > > CPU(s):              3
> > > On-line CPU(s) list: 0-2
> > > Thread(s) per core:  1
> > > Core(s) per socket:  3
> > > Socket(s):           1
> > > NUMA node(s):        1
> > > Model:               2.1 (pvr 004b 0201)
> > > Model name:          POWER8E (raw), altivec supported
> > > Hypervisor vendor:   KVM
> > > Virtualization type: para
> > > L1d cache:           64K
> > > L1i cache:           32K
> > > NUMA node0 CPU(s):   0-2
> > > danielhb@ubuntu1710:~$
> > > 
> > > 
> > > This makes me believe that the issue is that the guest isn't aware of the
> > > CPU
> > > presence, making me wonder whether this has something to do with the
> > > qemu_irq_pulse
> > > in the end of spapr_hotplug_req_event being lost. In the second hotplug, we
> > > re-assert the IRQ in the end of check-exception and the guest is made aware
> > > of
> > > the queued hotplug event that was ignored at first.
> > Yeah, I think that's it.
> > 
> > 
> > > 
> > > 
> > > Daniel
> > > > 3) There's no fundamental reason early hotplug shouldn't work - the
> > > > event will just be queued until the OS boots and processes it.
> > > > 
> > > > I know I suggested disabling early hotplug earlier, but that was
> > > > before I'd dug into the DRC layer and properly understood what was
> > > > going on here.
> > > > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [Qemu-ppc] [PATCH] hw/ppc: disable hotplug before CAS is completed
Posted by Daniel Henrique Barboza 6 years, 7 months ago

On 08/22/2017 09:24 PM, David Gibson wrote:
> On Tue, Aug 22, 2017 at 08:50:44PM -0300, Daniel Henrique Barboza wrote:
>>
>> On 08/18/2017 01:14 AM, David Gibson wrote:
>>> On Thu, Aug 17, 2017 at 06:31:28PM -0300, Daniel Henrique Barboza wrote:
>>>> On 08/17/2017 04:52 AM, David Gibson wrote:
>>>>> On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
>>>>>> This patch is a follow up on the discussions that started with
>>>>>> Laurent's patch series "spapr: disable hotplugging without OS" [1]
>>>>>> and discussions made at patch "spapr: reset DRCs on migration
>>>>>> pre_load" [2].
>>>>>>
>>>>>> At this moment, we do not support CPU/memory hotplug in early
>>>>>> boot stages, before CAS. The reason is that the hotplug event
>>>>>> can't be handled at SLOF level (or even in PRELAUNCH runstate) and
>>>>>> at the same time can't be canceled. This leads to devices being
>>>>>> unable to be hot unplugged and, in some cases, guest kernel Ooops.
>>>>>> After CAS, with the FDT in place, the guest can handle the hotplug
>>>>>> events and everything works as usual.
>>>>>>
>>>>>> An attempt to try to support hotplug before CAS was made, but not
>>>>>> successful. The key difference in the current code flow between a
>>>>>> coldplugged and a hotplugged device, in the PRELAUNCH state, is that
>>>>>> the coldplugged device is registered at the base FDT, allowing its
>>>>>> DRC to go straight to CONFIGURED state. In theory, this can also be
>>>>>> done with a hotplugged device if we can add it to the base of the
>>>>>> existing FDT. However, tampering with the FDT after writing in the
>>>>>> guest memory, besides being a dubitable idea, is also not
>>>>>> possible. The FDT is written in ppc_spapr_reset and there is no
>>>>>> way to retrieve it - we can calculate the fdt_address but the
>>>>>> fdt_size isn't stored. Storing the fdt_size to allow for
>>>>>> later retrieval is yet another state that would need to be
>>>>>> migrated. In short, it is not worth the trouble.
>>>>>>
>>>>>> All this said, this patch opted to disable CPU/mem hotplug at early
>>>>>> boot stages. CAS detection is made by checking if there are
>>>>>> any bits set in ov5_cas to avoid adding an extra state that
>>>>>> would need tracking/migration. The patch also makes sure that
>>>>>> it doesn't interfere with hotplug in the INMIGRATE state.
>>>>>>
>>>>>> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
>>>>>> [2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
>>>>>>
>>>>>> Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
>>>>> I don't think this is a good idea.
>>>>>
>>>>> 1) After my DRC cleanups, early hotplug works just fine for me.  I'm
>>>>> not sure why it isn't for you: we need to understand that before
>>>>> proceeding.
>>>>>
>>>>> 2) libvirt actually uses early hotplug fairly often (before even
>>>>> starting the firmware).  At the moment this works - at least in some
>>>>> cases (see above), though there are some wrinkles to work out.  This
>>>>> will break it completely and require an entirely different approach to
>>>>> fix again.
>>>> Now that you mentioned I remember having this same discussion with you,
>>>> about the same topic. Back then we decided to leave it alone, since you
>>>> couldn't
>>>> reproduce the behavior but I could.
>>>>
>>>> I still can reproduce this bug and ended up investigating a bit more today:
>>>>
>>>>
>>>> - one difference in QEMU between hotplugging before and after CAS is here:
>>>>
>>>> hw/ppc/spapr_events.c - rtas_event_log_to_source
>>>>
>>>>       switch (log_type) {
>>>>       case RTAS_LOG_TYPE_HOTPLUG:
>>>>           source = spapr_event_sources_get_source(spapr->event_sources,
>>>> EVENT_CLASS_HOT_PLUG);
>>>>           if (spapr_ovec_test(spapr->ov5_cas, OV5_HP_EVT)) {
>>>>               g_assert(source->enabled);
>>>>               break;
>>>>           }
>>>>           /* fall back to epow for legacy hotplug interrupt source */
>>>>       case RTAS_LOG_TYPE_EPOW:
>>>>           source = spapr_event_sources_get_source(spapr->event_sources,
>>>>                                                   EVENT_CLASS_EPOW);
>>>>           break;
>>>>       default:
>>>>
>>>>
>>>> Note the ovec_test for OV5_HP_EVT. When hotplugging a CPU In early boot,
>>>> ov5_cas
>>>> doesn't have anything set, making this check fails and due to the 'break'
>>>> position there
>>>> (that I believe it is intended), it falls back to log the event as EPOW
>>>> instead of HOT_PLUG.
>>> Ah.. I'm not sure if this is the cause of the problem you're seeing,
>>> but it's a good point regardless.  We're queuing an event before we
>>> know how to queue events, which is definitely wrong.
>>>
>>> Urgh.. how to fix this.  Banning hotplug until after CAS seems
>>> logically sound, but will break the libvirt use case.  So we could..
>>>
>>> 1) Queue events internally, but only deliver them to the right actual
>>> RTAS queue after CAS.  Blech.
>>>
>>> 2) At CAS time, reset all the DRCs, and emit extra DT information for
>>> anything hotplugged since init.  I've suggested this before, but
>>> dropped it when early hotplug seemed to be working for me.
>> Update: I've been trying to do the (2) solution here by adding extra
>> DT info of the hotplugged CPU, while resetting its DRC. The problem I am
>> facing is that the kernel panics at parse_numa_properties when trying to
>> find the node id of the added CPU. This is intriguing because the extra DT
>> info is enough to make the kernel start the added CPU at smp_release_cpus().
>>
>> The DT information is being added at the CAS response, in
>> spapr_fixup_cpu_dt,
>> and it's the same information that is added for the coldplugged CPUs. I have
>> a
>> theory that the guest (SLOF?) isn't prepared to handle a new CPU being added
>> at
>> this point, which might be leading to this kernel bug I am seeing. I
>> appreciate if
>> anyone with good CAS/SLOF knowledge can weight in here - there is a good
>> chance
>> that adding a new CPU element at this point in the DT isn't supported at
>> all.
>>
>>
>> Now, in other news ....
>>
>>> 3) Work around in libvirt by having it issue a system_reset after the
>>> hotplugs.
>> ... talking with Mike about this bug he proposed a similar solution to (3),
>> but with
>> a CAS induced reset. If there are hotplugged devs at that point, set
>> spapr->cas_reboot
>> to true. Doing that, QEMU reboots automatically and the hotplugged devs are
>> considered
>> coldplugged. It works.
>>
>> I am mentioning this because this is a simpler solution than relying on
>> libvirt issuing a system_reset. This is also solution that I would like to
>> avoid though - I
>> am not looking forward to see people bugs complaining about the machine
>> rebooting in
>> the middle of the boot. I will exhaust my attempts at (2) before proposing
>> going
>> with this CAS reboot solution.
> Ah!  Actually I think that's a great idea.  I mean, it's an ugly hack,
> but it should also get us something that more or less works with
> *much* less work than other approaches.  So I think it's a good
> interim solution.
Alright then!

>
> Do you have a patch?  It's probably too late for qemu 2.10, but if we
> can get it in as soon as 2.11 opens, then we should be able to get it
> into the RH releases at least, which will be very helpful.

I'll test the code with PCI hotplug (have tested only with Logical DRCs 
so far),
make changes if required and send it.


Daniel

>
>>
>> Thanks,
>>
>>
>> Daniel
>>
>>
>>
>>
>>
>>> In the short to medium term at least, I think (2) is probably our
>>> best option.
>>>
>>>
>>>> I tried to hack this code by adding another break and ensure that the event
>>>> got logged
>>>> as HOT_PLUG (like it happens in the post-CAS) but then I got a kernel panic
>>>> at boot. So
>>>> I am not sure if this code needs any change or afterthought.
>>>>
>>>> - hotplugging the CPU at early stage gives me a warning message in SLOF:
>>>>
>>>> --------------
>>>>
>>>> Calling ibm,client-architecture-support...Node not supported
>>>> Node not supported
>>>>    not implemented
>>>> memory layout at init:
>>>> ---------------
>>>>
>>>> The code that gives the 'Node not supported' message is related to the
>>>> fdt-create-cas-node
>>>> function of board-qemu/slof/fdt.fs . The code is looking for either
>>>> "memory@" or
>>>> "ibm,dynamic-reconfiguration-memory" nodes, giving this error when finding a
>>>> CPU node.
>>>>
>>>>
>>>> - if I hotplug another CPU after the guest completes the boot, the
>>>> previously added
>>>> CPU suddenly turns online too:
>>> Right the interrupt from the new event is unjamming the queue, so the
>>> guest kernel / drmgr is then able to add the earlier cpu as well.
>>>
>>>> [ started VM with -S ]
>>>>
>>>> (qemu) device_add host-spapr-cpu-core,id=core1,core-id=1
>>>> (qemu) cont
>>>>
>>>> [ guest finishes boot ]
>>>>
>>>> danielhb@ubuntu1710:~$ lscpu
>>>> Architecture:        ppc64le
>>>> Byte Order:          Little Endian
>>>> CPU(s):              1
>>>> On-line CPU(s) list: 0
>>>> Thread(s) per core:  1
>>>> Core(s) per socket:  1
>>>> Socket(s):           1
>>>> NUMA node(s):        1
>>>> Model:               2.1 (pvr 004b 0201)
>>>> Model name:          POWER8E (raw), altivec supported
>>>> Hypervisor vendor:   KVM
>>>> Virtualization type: para
>>>> L1d cache:           64K
>>>> L1i cache:           32K
>>>> NUMA node0 CPU(s):   0
>>>> danielhb@ubuntu1710:~$
>>>> danielhb@ubuntu1710:~$
>>>> danielhb@ubuntu1710:~$ (qemu)
>>>> (qemu) info cpus
>>>> * CPU #0: nip=0xc0000000000a464c thread_id=131946
>>>>     CPU #1: nip=0x0000000000000000 (halted) thread_id=131954
>>>> (qemu)
>>>> (qemu) device_add host-spapr-cpu-core,id=core2,core-id=2
>>>> (qemu)
>>>> (qemu) info cpus
>>>> * CPU #0: nip=0xc0000000000a464c thread_id=131946
>>>>     CPU #1: nip=0xc0000000000a464c thread_id=131954
>>>>     CPU #2: nip=0xc0000000000a464c thread_id=132144
>>>> (qemu)
>>>>
>>>> danielhb@ubuntu1710:~$ lscpu
>>>> Architecture:        ppc64le
>>>> Byte Order:          Little Endian
>>>> CPU(s):              3
>>>> On-line CPU(s) list: 0-2
>>>> Thread(s) per core:  1
>>>> Core(s) per socket:  3
>>>> Socket(s):           1
>>>> NUMA node(s):        1
>>>> Model:               2.1 (pvr 004b 0201)
>>>> Model name:          POWER8E (raw), altivec supported
>>>> Hypervisor vendor:   KVM
>>>> Virtualization type: para
>>>> L1d cache:           64K
>>>> L1i cache:           32K
>>>> NUMA node0 CPU(s):   0-2
>>>> danielhb@ubuntu1710:~$
>>>>
>>>>
>>>> This makes me believe that the issue is that the guest isn't aware of the
>>>> CPU
>>>> presence, making me wonder whether this has something to do with the
>>>> qemu_irq_pulse
>>>> in the end of spapr_hotplug_req_event being lost. In the second hotplug, we
>>>> re-assert the IRQ in the end of check-exception and the guest is made aware
>>>> of
>>>> the queued hotplug event that was ignored at first.
>>> Yeah, I think that's it.
>>>
>>>
>>>>
>>>> Daniel
>>>>> 3) There's no fundamental reason early hotplug shouldn't work - the
>>>>> event will just be queued until the OS boots and processes it.
>>>>>
>>>>> I know I suggested disabling early hotplug earlier, but that was
>>>>> before I'd dug into the DRC layer and properly understood what was
>>>>> going on here.
>>>>>