[v1] spapr: DRC cleanups (part VI)

[Qemu-devel] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by David Gibson 8 years, 7 months ago

This sixth set of DRC cleanup patches (to be applied on top of the
patches from part V) is a complete rework of DRC state management.  We
stop tracking some unnecessary things, and change the basic state
representation to a simpler and more robust model.

Most of the patches in this set "break" migration from earlier git
snapshots, but not from any released qemu version.  The previous
migration stream format had multiple problems, so better to fix them
now, before 2.10 is out.

David Gibson (5):
  spapr: Remove 'awaiting_allocation' DRC flag
  spapr: Refactor spapr_drc_detach()
  spapr: Cleanups relating to DRC awaiting_release field
  spapr: Consolidate DRC state variables
  spapr: Remove sPAPRConfigureConnectorState sub-structure

 hw/ppc/spapr.c             |   9 +-
 hw/ppc/spapr_drc.c         | 321 +++++++++++++++++++++------------------------
 hw/ppc/spapr_pci.c         |  13 +-
 hw/ppc/trace-events        |   3 +-
 include/hw/ppc/spapr_drc.h |  53 +++++---
 5 files changed, 187 insertions(+), 212 deletions(-)

-- 
2.9.4

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by Daniel Henrique Barboza 8 years, 7 months ago

I just tested this patch set on top of current ppc-for-2.10 branch 
(which contains
the patches from part V). It applied cleanly but required a couple of 
trivial
fixes to build probably because it was made on top of an older code base.

The trivial migration test worked fine. The libvirt scenario (attaching 
a device on
target before migration, try to unplug after migration) isn't working as 
expected
but we have a different result with this series. Instead of silently 
failing to unplug
with error messages on dmesg, the hot unplug works on QEMU level:

(qemu) device_del core1
(qemu)
(qemu) info cpus
* CPU #0: nip=0xc0000000000a3e0c thread_id=86162
(qemu) info hotpluggable-cpus
Hotpluggable CPUs:
   type: "host-spapr-cpu-core"
   vcpus_count: "1"
   CPUInstance Properties:
     core-id: "3"
   type: "host-spapr-cpu-core"
   vcpus_count: "1"
   CPUInstance Properties:
     core-id: "2"
   type: "host-spapr-cpu-core"
   vcpus_count: "1"
   CPUInstance Properties:
     core-id: "1"
   type: "host-spapr-cpu-core"
   vcpus_count: "1"
   qom_path: "/machine/unattached/device[0]"
   CPUInstance Properties:
     core-id: "0"
(qemu)


However, any operation on the guest afterwards (tried with lscpu and 
dmesg) seems
to hung the guest. This is what I got when trying to do a dmesg after the
hot unplug:

danielhb@ubuntu1704:~$
danielhb@ubuntu1704:~$
danielhb@ubuntu1704:~$ dmesg | tail -n 5
^C

[  362.749693] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
[  362.749819]       Not tainted 4.10.0-26-generic #30-Ubuntu
[  362.749892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  362.750224] INFO: task kworker/0:3:1887 blocked for more than 120 
seconds.
[  362.750320]       Not tainted 4.10.0-26-generic #30-Ubuntu
[  362.750394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  483.568842] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
[  483.568997]       Not tainted 4.10.0-26-generic #30-Ubuntu
[  483.569067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  483.569364] INFO: task kworker/0:3:1887 blocked for more than 120 
seconds.
(...)


I am not sure if it was intended for this scenario to work with this 
patch set already. Figured
it can serve as a FYI for the next series.


Given that hotplug/unplug without migration still works and on the 
assumption that
we'll look more into this libvirt scenario in the next spins, +1.


Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>



On 06/21/2017 06:18 AM, David Gibson wrote:
> This sixth set of DRC cleanup patches (to be applied on top of the
> patches from part V) is a complete rework of DRC state management.  We
> stop tracking some unnecessary things, and change the basic state
> representation to a simpler and more robust model.
>
> Most of the patches in this set "break" migration from earlier git
> snapshots, but not from any released qemu version.  The previous
> migration stream format had multiple problems, so better to fix them
> now, before 2.10 is out.
>
> David Gibson (5):
>    spapr: Remove 'awaiting_allocation' DRC flag
>    spapr: Refactor spapr_drc_detach()
>    spapr: Cleanups relating to DRC awaiting_release field
>    spapr: Consolidate DRC state variables
>    spapr: Remove sPAPRConfigureConnectorState sub-structure
>
>   hw/ppc/spapr.c             |   9 +-
>   hw/ppc/spapr_drc.c         | 321 +++++++++++++++++++++------------------------
>   hw/ppc/spapr_pci.c         |  13 +-
>   hw/ppc/trace-events        |   3 +-
>   include/hw/ppc/spapr_drc.h |  53 +++++---
>   5 files changed, 187 insertions(+), 212 deletions(-)
>

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by David Gibson 8 years, 7 months ago

On Tue, Jul 04, 2017 at 06:13:31PM -0300, Daniel Henrique Barboza wrote:
> I just tested this patch set on top of current ppc-for-2.10 branch (which
> contains
> the patches from part V). It applied cleanly but required a couple of
> trivial
> fixes to build probably because it was made on top of an older code base.

Right, I fixed that up locally already, but haven't gotten around to
reposting yet.  You can look at the 'drcVI' branch on my github tree,
if you're interested.

> The trivial migration test worked fine. The libvirt scenario (attaching a
> device on
> target before migration, try to unplug after migration) isn't working as
> expected
> but we have a different result with this series. Instead of silently failing
> to unplug
> with error messages on dmesg, the hot unplug works on QEMU level:

Thanks for testing.  Just to clarify what you're saying here, you
haven't spotted a regression with this series, but there is a case
which was broken and is still broken with slightly different
symptoms.  Yes?

> 
> (qemu) device_del core1
> (qemu)
> (qemu) info cpus
> * CPU #0: nip=0xc0000000000a3e0c thread_id=86162
> (qemu) info hotpluggable-cpus
> Hotpluggable CPUs:
>   type: "host-spapr-cpu-core"
>   vcpus_count: "1"
>   CPUInstance Properties:
>     core-id: "3"
>   type: "host-spapr-cpu-core"
>   vcpus_count: "1"
>   CPUInstance Properties:
>     core-id: "2"
>   type: "host-spapr-cpu-core"
>   vcpus_count: "1"
>   CPUInstance Properties:
>     core-id: "1"
>   type: "host-spapr-cpu-core"
>   vcpus_count: "1"
>   qom_path: "/machine/unattached/device[0]"
>   CPUInstance Properties:
>     core-id: "0"
> (qemu)
> 
> 
> However, any operation on the guest afterwards (tried with lscpu and dmesg)
> seems
> to hung the guest. This is what I got when trying to do a dmesg after the
> hot unplug:

Ouch.  That's bad.  I'll have to look into it.

I have rather a lot on my plate at the moment - if you get a chance to
work out which of the patches in the series causes this behaviour,
that could be handy.

> danielhb@ubuntu1704:~$
> danielhb@ubuntu1704:~$
> danielhb@ubuntu1704:~$ dmesg | tail -n 5
> ^C
> 
> [  362.749693] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
> [  362.749819]       Not tainted 4.10.0-26-generic #30-Ubuntu
> [  362.749892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  362.750224] INFO: task kworker/0:3:1887 blocked for more than 120
> seconds.
> [  362.750320]       Not tainted 4.10.0-26-generic #30-Ubuntu
> [  362.750394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  483.568842] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
> [  483.568997]       Not tainted 4.10.0-26-generic #30-Ubuntu
> [  483.569067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  483.569364] INFO: task kworker/0:3:1887 blocked for more than 120
> seconds.
> (...)
> 
> 
> I am not sure if it was intended for this scenario to work with this patch
> set already. Figured
> it can serve as a FYI for the next series.

This might be due to a problem I've discussed previously with
Laurent.  libvirt uses device_add on the destination's monitor before
sending the migration stream, rather than starting the destination
with already-hotplugged devices as -device.

To my surprise, there doesn't seem to be a system reset between that
device_add phase and processing the incoming stream, which means DRCs
for those devices might be in the wrong state.

I might need to explicitly trigger a reset of DRC state at incoming
migration time to handle this.

> Given that hotplug/unplug without migration still works and on the
> assumption that
> we'll look more into this libvirt scenario in the next spins, +1.
> 
> 
> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
> 
> 
> 
> On 06/21/2017 06:18 AM, David Gibson wrote:
> > This sixth set of DRC cleanup patches (to be applied on top of the
> > patches from part V) is a complete rework of DRC state management.  We
> > stop tracking some unnecessary things, and change the basic state
> > representation to a simpler and more robust model.
> > 
> > Most of the patches in this set "break" migration from earlier git
> > snapshots, but not from any released qemu version.  The previous
> > migration stream format had multiple problems, so better to fix them
> > now, before 2.10 is out.
> > 
> > David Gibson (5):
> >    spapr: Remove 'awaiting_allocation' DRC flag
> >    spapr: Refactor spapr_drc_detach()
> >    spapr: Cleanups relating to DRC awaiting_release field
> >    spapr: Consolidate DRC state variables
> >    spapr: Remove sPAPRConfigureConnectorState sub-structure
> > 
> >   hw/ppc/spapr.c             |   9 +-
> >   hw/ppc/spapr_drc.c         | 321 +++++++++++++++++++++------------------------
> >   hw/ppc/spapr_pci.c         |  13 +-
> >   hw/ppc/trace-events        |   3 +-
> >   include/hw/ppc/spapr_drc.h |  53 +++++---
> >   5 files changed, 187 insertions(+), 212 deletions(-)
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by Daniel Henrique Barboza 8 years, 7 months ago


On 07/05/2017 08:04 AM, David Gibson wrote:
> On Tue, Jul 04, 2017 at 06:13:31PM -0300, Daniel Henrique Barboza wrote:
>> I just tested this patch set on top of current ppc-for-2.10 branch (which
>> contains
>> the patches from part V). It applied cleanly but required a couple of
>> trivial
>> fixes to build probably because it was made on top of an older code base.
> Right, I fixed that up locally already, but haven't gotten around to
> reposting yet.  You can look at the 'drcVI' branch on my github tree,
> if you're interested.
>
>> The trivial migration test worked fine. The libvirt scenario (attaching a
>> device on
>> target before migration, try to unplug after migration) isn't working as
>> expected
>> but we have a different result with this series. Instead of silently failing
>> to unplug
>> with error messages on dmesg, the hot unplug works on QEMU level:
> Thanks for testing.  Just to clarify what you're saying here, you
> haven't spotted a regression with this series, but there is a case
> which was broken and is still broken with slightly different
> symptoms.  Yes?
In my opinion, yes. It is debatable if the patch series made it worse 
because
the guest is now misbehaving, but the feature per se wasn't working
prior to it.

>
>> (qemu) device_del core1
>> (qemu)
>> (qemu) info cpus
>> * CPU #0: nip=0xc0000000000a3e0c thread_id=86162
>> (qemu) info hotpluggable-cpus
>> Hotpluggable CPUs:
>>    type: "host-spapr-cpu-core"
>>    vcpus_count: "1"
>>    CPUInstance Properties:
>>      core-id: "3"
>>    type: "host-spapr-cpu-core"
>>    vcpus_count: "1"
>>    CPUInstance Properties:
>>      core-id: "2"
>>    type: "host-spapr-cpu-core"
>>    vcpus_count: "1"
>>    CPUInstance Properties:
>>      core-id: "1"
>>    type: "host-spapr-cpu-core"
>>    vcpus_count: "1"
>>    qom_path: "/machine/unattached/device[0]"
>>    CPUInstance Properties:
>>      core-id: "0"
>> (qemu)
>>
>>
>> However, any operation on the guest afterwards (tried with lscpu and dmesg)
>> seems
>> to hung the guest. This is what I got when trying to do a dmesg after the
>> hot unplug:
> Ouch.  That's bad.  I'll have to look into it.
>
> I have rather a lot on my plate at the moment - if you get a chance to
> work out which of the patches in the series causes this behaviour,
> that could be handy.

***long post warning***

With the current master and current HEAD of ppc-for-2.10, the behavior in
the Libvirt scenario (device_add in both source and target before migration,
hot unplug after migration is completed) is that QEMU fails to hot 
unplug the
CPU from the guest OS. lscpu reports the same # of cpus even after 
device_del,
dmesg shows an error like this:

[  108.182291] pseries-hotplug-cpu: CPU with drc index 10000008 already 
exists


WIth this patch series, QEMU removes the CPU but the guest misbehaves 
like I said
in my previous message.

With the current HEAD of drcVI branch, I rolled back until I found the 
patch that was
causing this new symptom. The patch is the very first of the series:

b752844 spapr: Remove 'awaiting_allocation' DRC flag

In short, adding this single patch into the HEAD of ppc-for-2.10 is 
causing this new
behavior I saw in my tests. This is the retrieved kernel log after the 
failed unplug:

[  176.434099] random: crng init done
[  461.182729] pseries-hotplug-cpu: CPU with drc index 10000008 already 
exists
[  604.707369] INFO: task kworker/0:2:920 blocked for more than 120 seconds.
[  604.707666]       Not tainted 4.10.0-26-generic #30-Ubuntu
[  604.707881] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  604.708194] kworker/0:2     D    0   920      2 0x00000800
[  604.708248] Workqueue: events vmstat_shepherd
[  604.708251] Call Trace:
[  604.708265] [c0000000055a7830] [c000000001492090] 
sysctl_sched_migration_cost+0x0/0x4 (unreliable)
[  604.708271] [c0000000055a7a00] [c00000000001b770] __switch_to+0x2c0/0x450
[  604.708286] [c0000000055a7a60] [c000000000b51238] __schedule+0x2f8/0x990
[  604.708289] [c0000000055a7b40] [c000000000b51918] schedule+0x48/0xc0
[  604.708292] [c0000000055a7b70] [c000000000b51e40] 
schedule_preempt_disabled+0x20/0x30
[  604.708295] [c0000000055a7b90] [c000000000b54598] 
__mutex_lock_slowpath+0x208/0x380
[  604.708310] [c0000000055a7c10] [c0000000000e2ec8] 
get_online_cpus+0x58/0xa0
[  604.708312] [c0000000055a7c40] [c0000000002a5a18] 
vmstat_shepherd+0x38/0x160
[  604.708316] [c0000000055a7c90] [c0000000001061a0] 
process_one_work+0x2b0/0x5a0
[  604.708319] [c0000000055a7d20] [c000000000106538] 
worker_thread+0xa8/0x650
[  604.708322] [c0000000055a7dc0] [c00000000010f0a4] kthread+0x164/0x1b0
[  604.708326] [c0000000055a7e30] [c00000000000b4e8] 
ret_from_kernel_thread+0x5c/0x74
[  604.708341] INFO: task kworker/u8:0:3068 blocked for more than 120 
seconds.
[  604.708601]       Not tainted 4.10.0-26-generic #30-Ubuntu
[  604.708809] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  604.709107] kworker/u8:0    D    0  3068      2 0x00000800
[  604.709114] Workqueue: pseries hotplug workque pseries_hp_work_fn
[  604.709116] Call Trace:
[  604.709118] [c00000000589b3c0] [000fffffffffffff] 0xfffffffffffff 
(unreliable)
[  604.709122] [c00000000589b590] [c00000000001b770] __switch_to+0x2c0/0x450
[  604.709125] [c00000000589b5f0] [c000000000b51238] __schedule+0x2f8/0x990
[  604.709128] [c00000000589b6d0] [c000000000b51918] schedule+0x48/0xc0
[  604.709131] [c00000000589b700] [c000000000b566b4] 
schedule_timeout+0x274/0x470
[  604.709134] [c00000000589b7f0] [c000000000b528ec] 
wait_for_common+0xec/0x250
[  604.709137] [c00000000589b870] [c0000000000e3320] 
cpuhp_kick_ap_work+0x90/0x210
[  604.709140] [c00000000589b8d0] [c00000000026f57c] _cpu_down+0xdc/0x1b0
[  604.709143] [c00000000589b930] [c0000000000e44b4] do_cpu_down+0x64/0xb0
[  604.709157] [c00000000589b970] [c0000000007541b4] 
cpu_subsys_offline+0x24/0x40
[  604.709160] [c00000000589b990] [c00000000074bc04] 
device_offline+0xf4/0x130
[  604.709163] [c00000000589b9d0] [c0000000000ae4bc] 
dlpar_cpu_remove+0x24c/0x350
[  604.709166] [c00000000589bab0] [c0000000000ae6bc] 
dlpar_cpu_remove_by_index+0xfc/0x130
[  604.709169] [c00000000589bb40] [c0000000000af838] dlpar_cpu+0x78/0x540
[  604.709172] [c00000000589bbf0] [c0000000000a98a0] 
handle_dlpar_errorlog+0xf0/0x170
[  604.709174] [c00000000589bc60] [c0000000000a99b4] 
pseries_hp_work_fn+0x94/0xa0
[  604.709177] [c00000000589bc90] [c0000000001061a0] 
process_one_work+0x2b0/0x5a0
[  604.709180] [c00000000589bd20] [c000000000106538] 
worker_thread+0xa8/0x650
[  604.709183] [c00000000589bdc0] [c00000000010f0a4] kthread+0x164/0x1b0
[  604.709186] [c00000000589be30] [c00000000000b4e8] 
ret_from_kernel_thread+0x5c/0x74


Digging around in the kernel code reported in the trace and reading the 
commit message
of the patch that is causing it, my first guess is that there is some 
confusion left
in the DRC state management that, at this moment, is being fixed by the
'awaiting_allocation' flag. One potentially relevant use of this flag 
that was removed was this
piece of code:

(inside spapr_drc_detach)

-    if (drc->awaiting_allocation) {
-        drc->awaiting_release = true;
-        trace_spapr_drc_awaiting_allocation(spapr_drc_index(drc));
-        return;
-    }


Speculating a little, perhaps the guest is trying to release a DRC that 
isn't ready to be released
yet (inconsistent state probably) and this code was preventing it from 
doing so. I'll
know more when I dig further.


>
>> danielhb@ubuntu1704:~$
>> danielhb@ubuntu1704:~$
>> danielhb@ubuntu1704:~$ dmesg | tail -n 5
>> ^C
>>
>> [  362.749693] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
>> [  362.749819]       Not tainted 4.10.0-26-generic #30-Ubuntu
>> [  362.749892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [  362.750224] INFO: task kworker/0:3:1887 blocked for more than 120
>> seconds.
>> [  362.750320]       Not tainted 4.10.0-26-generic #30-Ubuntu
>> [  362.750394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [  483.568842] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
>> [  483.568997]       Not tainted 4.10.0-26-generic #30-Ubuntu
>> [  483.569067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [  483.569364] INFO: task kworker/0:3:1887 blocked for more than 120
>> seconds.
>> (...)
>>
>>
>> I am not sure if it was intended for this scenario to work with this patch
>> set already. Figured
>> it can serve as a FYI for the next series.
> This might be due to a problem I've discussed previously with
> Laurent.  libvirt uses device_add on the destination's monitor before
> sending the migration stream, rather than starting the destination
> with already-hotplugged devices as -device.
>
> To my surprise, there doesn't seem to be a system reset between that
> device_add phase and processing the incoming stream, which means DRCs
> for those devices might be in the wrong state.
>
> I might need to explicitly trigger a reset of DRC state at incoming
> migration time to handle this.

Worth a try. I'll see if I can make a POC of this DRC reset at incoming
migration time.


Daniel

>
>> Given that hotplug/unplug without migration still works and on the
>> assumption that
>> we'll look more into this libvirt scenario in the next spins, +1.
>>
>>
>> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
>>
>>
>>
>> On 06/21/2017 06:18 AM, David Gibson wrote:
>>> This sixth set of DRC cleanup patches (to be applied on top of the
>>> patches from part V) is a complete rework of DRC state management.  We
>>> stop tracking some unnecessary things, and change the basic state
>>> representation to a simpler and more robust model.
>>>
>>> Most of the patches in this set "break" migration from earlier git
>>> snapshots, but not from any released qemu version.  The previous
>>> migration stream format had multiple problems, so better to fix them
>>> now, before 2.10 is out.
>>>
>>> David Gibson (5):
>>>     spapr: Remove 'awaiting_allocation' DRC flag
>>>     spapr: Refactor spapr_drc_detach()
>>>     spapr: Cleanups relating to DRC awaiting_release field
>>>     spapr: Consolidate DRC state variables
>>>     spapr: Remove sPAPRConfigureConnectorState sub-structure
>>>
>>>    hw/ppc/spapr.c             |   9 +-
>>>    hw/ppc/spapr_drc.c         | 321 +++++++++++++++++++++------------------------
>>>    hw/ppc/spapr_pci.c         |  13 +-
>>>    hw/ppc/trace-events        |   3 +-
>>>    include/hw/ppc/spapr_drc.h |  53 +++++---
>>>    5 files changed, 187 insertions(+), 212 deletions(-)
>>>

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by Michael Roth 8 years, 7 months ago

Quoting Daniel Henrique Barboza (2017-07-05 16:53:57)
> 
> 
> On 07/05/2017 08:04 AM, David Gibson wrote:
> > On Tue, Jul 04, 2017 at 06:13:31PM -0300, Daniel Henrique Barboza wrote:
> >> I just tested this patch set on top of current ppc-for-2.10 branch (which
> >> contains
> >> the patches from part V). It applied cleanly but required a couple of
> >> trivial
> >> fixes to build probably because it was made on top of an older code base.
> > Right, I fixed that up locally already, but haven't gotten around to
> > reposting yet.  You can look at the 'drcVI' branch on my github tree,
> > if you're interested.
> >
> >> The trivial migration test worked fine. The libvirt scenario (attaching a
> >> device on
> >> target before migration, try to unplug after migration) isn't working as
> >> expected
> >> but we have a different result with this series. Instead of silently failing
> >> to unplug
> >> with error messages on dmesg, the hot unplug works on QEMU level:
> > Thanks for testing.  Just to clarify what you're saying here, you
> > haven't spotted a regression with this series, but there is a case
> > which was broken and is still broken with slightly different
> > symptoms.  Yes?
> In my opinion, yes. It is debatable if the patch series made it worse 
> because
> the guest is now misbehaving, but the feature per se wasn't working
> prior to it.

I think it's the removal of awaiting_allocation. We know currently that
in the libvirt scenario the DRC is exposed in an pre-hotplug state of
ISOLATED/UNALLOCATED. In that state, spapr_drc_detach() completes
immediately because from the perspective of QEMU it apparently has not
been exposed to the guest yet, or the guest has already quiesced it on
it's end.

awaiting_allocation guarded against this, as it's intention was to make
sure that resource was put into an ALLOCATED state prior to getting moved
back into an UNALLOCATED state, so we didn't immediately unplug a CPU
while the hotplug was in progress.

So in your scenario the CPU is just mysteriously vanishing out from
under the guest, which probably explains the hang. The fix for this
particular scenario is to fix the initial DRC on the target. I think
we have a plan for that so I wouldn't consider this a regression
necessarily.

> 
> >
> >> (qemu) device_del core1
> >> (qemu)
> >> (qemu) info cpus
> >> * CPU #0: nip=0xc0000000000a3e0c thread_id=86162
> >> (qemu) info hotpluggable-cpus
> >> Hotpluggable CPUs:
> >>    type: "host-spapr-cpu-core"
> >>    vcpus_count: "1"
> >>    CPUInstance Properties:
> >>      core-id: "3"
> >>    type: "host-spapr-cpu-core"
> >>    vcpus_count: "1"
> >>    CPUInstance Properties:
> >>      core-id: "2"
> >>    type: "host-spapr-cpu-core"
> >>    vcpus_count: "1"
> >>    CPUInstance Properties:
> >>      core-id: "1"
> >>    type: "host-spapr-cpu-core"
> >>    vcpus_count: "1"
> >>    qom_path: "/machine/unattached/device[0]"
> >>    CPUInstance Properties:
> >>      core-id: "0"
> >> (qemu)
> >>
> >>
> >> However, any operation on the guest afterwards (tried with lscpu and dmesg)
> >> seems
> >> to hung the guest. This is what I got when trying to do a dmesg after the
> >> hot unplug:
> > Ouch.  That's bad.  I'll have to look into it.
> >
> > I have rather a lot on my plate at the moment - if you get a chance to
> > work out which of the patches in the series causes this behaviour,
> > that could be handy.
> 
> ***long post warning***
> 
> With the current master and current HEAD of ppc-for-2.10, the behavior in
> the Libvirt scenario (device_add in both source and target before migration,
> hot unplug after migration is completed) is that QEMU fails to hot 
> unplug the
> CPU from the guest OS. lscpu reports the same # of cpus even after 
> device_del,
> dmesg shows an error like this:
> 
> [  108.182291] pseries-hotplug-cpu: CPU with drc index 10000008 already 
> exists
> 
> 
> WIth this patch series, QEMU removes the CPU but the guest misbehaves 
> like I said
> in my previous message.
> 
> With the current HEAD of drcVI branch, I rolled back until I found the 
> patch that was
> causing this new symptom. The patch is the very first of the series:
> 
> b752844 spapr: Remove 'awaiting_allocation' DRC flag
> 
> In short, adding this single patch into the HEAD of ppc-for-2.10 is 
> causing this new
> behavior I saw in my tests. This is the retrieved kernel log after the 
> failed unplug:
> 
> [  176.434099] random: crng init done
> [  461.182729] pseries-hotplug-cpu: CPU with drc index 10000008 already 
> exists
> [  604.707369] INFO: task kworker/0:2:920 blocked for more than 120 seconds.
> [  604.707666]       Not tainted 4.10.0-26-generic #30-Ubuntu
> [  604.707881] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [  604.708194] kworker/0:2     D    0   920      2 0x00000800
> [  604.708248] Workqueue: events vmstat_shepherd
> [  604.708251] Call Trace:
> [  604.708265] [c0000000055a7830] [c000000001492090] 
> sysctl_sched_migration_cost+0x0/0x4 (unreliable)
> [  604.708271] [c0000000055a7a00] [c00000000001b770] __switch_to+0x2c0/0x450
> [  604.708286] [c0000000055a7a60] [c000000000b51238] __schedule+0x2f8/0x990
> [  604.708289] [c0000000055a7b40] [c000000000b51918] schedule+0x48/0xc0
> [  604.708292] [c0000000055a7b70] [c000000000b51e40] 
> schedule_preempt_disabled+0x20/0x30
> [  604.708295] [c0000000055a7b90] [c000000000b54598] 
> __mutex_lock_slowpath+0x208/0x380
> [  604.708310] [c0000000055a7c10] [c0000000000e2ec8] 
> get_online_cpus+0x58/0xa0
> [  604.708312] [c0000000055a7c40] [c0000000002a5a18] 
> vmstat_shepherd+0x38/0x160
> [  604.708316] [c0000000055a7c90] [c0000000001061a0] 
> process_one_work+0x2b0/0x5a0
> [  604.708319] [c0000000055a7d20] [c000000000106538] 
> worker_thread+0xa8/0x650
> [  604.708322] [c0000000055a7dc0] [c00000000010f0a4] kthread+0x164/0x1b0
> [  604.708326] [c0000000055a7e30] [c00000000000b4e8] 
> ret_from_kernel_thread+0x5c/0x74
> [  604.708341] INFO: task kworker/u8:0:3068 blocked for more than 120 
> seconds.
> [  604.708601]       Not tainted 4.10.0-26-generic #30-Ubuntu
> [  604.708809] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [  604.709107] kworker/u8:0    D    0  3068      2 0x00000800
> [  604.709114] Workqueue: pseries hotplug workque pseries_hp_work_fn
> [  604.709116] Call Trace:
> [  604.709118] [c00000000589b3c0] [000fffffffffffff] 0xfffffffffffff 
> (unreliable)
> [  604.709122] [c00000000589b590] [c00000000001b770] __switch_to+0x2c0/0x450
> [  604.709125] [c00000000589b5f0] [c000000000b51238] __schedule+0x2f8/0x990
> [  604.709128] [c00000000589b6d0] [c000000000b51918] schedule+0x48/0xc0
> [  604.709131] [c00000000589b700] [c000000000b566b4] 
> schedule_timeout+0x274/0x470
> [  604.709134] [c00000000589b7f0] [c000000000b528ec] 
> wait_for_common+0xec/0x250
> [  604.709137] [c00000000589b870] [c0000000000e3320] 
> cpuhp_kick_ap_work+0x90/0x210
> [  604.709140] [c00000000589b8d0] [c00000000026f57c] _cpu_down+0xdc/0x1b0
> [  604.709143] [c00000000589b930] [c0000000000e44b4] do_cpu_down+0x64/0xb0
> [  604.709157] [c00000000589b970] [c0000000007541b4] 
> cpu_subsys_offline+0x24/0x40
> [  604.709160] [c00000000589b990] [c00000000074bc04] 
> device_offline+0xf4/0x130
> [  604.709163] [c00000000589b9d0] [c0000000000ae4bc] 
> dlpar_cpu_remove+0x24c/0x350
> [  604.709166] [c00000000589bab0] [c0000000000ae6bc] 
> dlpar_cpu_remove_by_index+0xfc/0x130
> [  604.709169] [c00000000589bb40] [c0000000000af838] dlpar_cpu+0x78/0x540
> [  604.709172] [c00000000589bbf0] [c0000000000a98a0] 
> handle_dlpar_errorlog+0xf0/0x170
> [  604.709174] [c00000000589bc60] [c0000000000a99b4] 
> pseries_hp_work_fn+0x94/0xa0
> [  604.709177] [c00000000589bc90] [c0000000001061a0] 
> process_one_work+0x2b0/0x5a0
> [  604.709180] [c00000000589bd20] [c000000000106538] 
> worker_thread+0xa8/0x650
> [  604.709183] [c00000000589bdc0] [c00000000010f0a4] kthread+0x164/0x1b0
> [  604.709186] [c00000000589be30] [c00000000000b4e8] 
> ret_from_kernel_thread+0x5c/0x74
> 
> 
> Digging around in the kernel code reported in the trace and reading the 
> commit message
> of the patch that is causing it, my first guess is that there is some 
> confusion left
> in the DRC state management that, at this moment, is being fixed by the
> 'awaiting_allocation' flag. One potentially relevant use of this flag 
> that was removed was this
> piece of code:
> 
> (inside spapr_drc_detach)
> 
> -    if (drc->awaiting_allocation) {
> -        drc->awaiting_release = true;
> -        trace_spapr_drc_awaiting_allocation(spapr_drc_index(drc));
> -        return;
> -    }
> 
> 
> Speculating a little, perhaps the guest is trying to release a DRC that 
> isn't ready to be released
> yet (inconsistent state probably) and this code was preventing it from 
> doing so. I'll
> know more when I dig further.
> 
> 
> >
> >> danielhb@ubuntu1704:~$
> >> danielhb@ubuntu1704:~$
> >> danielhb@ubuntu1704:~$ dmesg | tail -n 5
> >> ^C
> >>
> >> [  362.749693] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
> >> [  362.749819]       Not tainted 4.10.0-26-generic #30-Ubuntu
> >> [  362.749892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> >> this message.
> >> [  362.750224] INFO: task kworker/0:3:1887 blocked for more than 120
> >> seconds.
> >> [  362.750320]       Not tainted 4.10.0-26-generic #30-Ubuntu
> >> [  362.750394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> >> this message.
> >> [  483.568842] INFO: task kworker/u8:1:30 blocked for more than 120 seconds.
> >> [  483.568997]       Not tainted 4.10.0-26-generic #30-Ubuntu
> >> [  483.569067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> >> this message.
> >> [  483.569364] INFO: task kworker/0:3:1887 blocked for more than 120
> >> seconds.
> >> (...)
> >>
> >>
> >> I am not sure if it was intended for this scenario to work with this patch
> >> set already. Figured
> >> it can serve as a FYI for the next series.
> > This might be due to a problem I've discussed previously with
> > Laurent.  libvirt uses device_add on the destination's monitor before
> > sending the migration stream, rather than starting the destination
> > with already-hotplugged devices as -device.
> >
> > To my surprise, there doesn't seem to be a system reset between that
> > device_add phase and processing the incoming stream, which means DRCs
> > for those devices might be in the wrong state.
> >
> > I might need to explicitly trigger a reset of DRC state at incoming
> > migration time to handle this.
> 
> Worth a try. I'll see if I can make a POC of this DRC reset at incoming
> migration time.
> 
> 
> Daniel
> 
> >
> >> Given that hotplug/unplug without migration still works and on the
> >> assumption that
> >> we'll look more into this libvirt scenario in the next spins, +1.
> >>
> >>
> >> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
> >>
> >>
> >>
> >> On 06/21/2017 06:18 AM, David Gibson wrote:
> >>> This sixth set of DRC cleanup patches (to be applied on top of the
> >>> patches from part V) is a complete rework of DRC state management.  We
> >>> stop tracking some unnecessary things, and change the basic state
> >>> representation to a simpler and more robust model.
> >>>
> >>> Most of the patches in this set "break" migration from earlier git
> >>> snapshots, but not from any released qemu version.  The previous
> >>> migration stream format had multiple problems, so better to fix them
> >>> now, before 2.10 is out.
> >>>
> >>> David Gibson (5):
> >>>     spapr: Remove 'awaiting_allocation' DRC flag
> >>>     spapr: Refactor spapr_drc_detach()
> >>>     spapr: Cleanups relating to DRC awaiting_release field
> >>>     spapr: Consolidate DRC state variables
> >>>     spapr: Remove sPAPRConfigureConnectorState sub-structure
> >>>
> >>>    hw/ppc/spapr.c             |   9 +-
> >>>    hw/ppc/spapr_drc.c         | 321 +++++++++++++++++++++------------------------
> >>>    hw/ppc/spapr_pci.c         |  13 +-
> >>>    hw/ppc/trace-events        |   3 +-
> >>>    include/hw/ppc/spapr_drc.h |  53 +++++---
> >>>    5 files changed, 187 insertions(+), 212 deletions(-)
> >>>
>

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by David Gibson 8 years, 7 months ago

On Wed, Jul 05, 2017 at 05:41:40PM -0500, Michael Roth wrote:
> Quoting Daniel Henrique Barboza (2017-07-05 16:53:57)
> > 
> > 
> > On 07/05/2017 08:04 AM, David Gibson wrote:
> > > On Tue, Jul 04, 2017 at 06:13:31PM -0300, Daniel Henrique Barboza wrote:
> > >> I just tested this patch set on top of current ppc-for-2.10 branch (which
> > >> contains
> > >> the patches from part V). It applied cleanly but required a couple of
> > >> trivial
> > >> fixes to build probably because it was made on top of an older code base.
> > > Right, I fixed that up locally already, but haven't gotten around to
> > > reposting yet.  You can look at the 'drcVI' branch on my github tree,
> > > if you're interested.
> > >
> > >> The trivial migration test worked fine. The libvirt scenario (attaching a
> > >> device on
> > >> target before migration, try to unplug after migration) isn't working as
> > >> expected
> > >> but we have a different result with this series. Instead of silently failing
> > >> to unplug
> > >> with error messages on dmesg, the hot unplug works on QEMU level:
> > > Thanks for testing.  Just to clarify what you're saying here, you
> > > haven't spotted a regression with this series, but there is a case
> > > which was broken and is still broken with slightly different
> > > symptoms.  Yes?
> > In my opinion, yes. It is debatable if the patch series made it worse 
> > because
> > the guest is now misbehaving, but the feature per se wasn't working
> > prior to it.
> 
> I think it's the removal of awaiting_allocation.

So.. yes, in the sense that I think we've rediscovered the problem
which prompter the awaiting_allocation flag in the first place.  I
still don't think awaiting_allocation is a sensible fix for.. well,
anything.

So we need to find the right fix for this problem.

> We know currently that
> in the libvirt scenario the DRC is exposed in an pre-hotplug state of
> ISOLATED/UNALLOCATED.

Right, need to understand exactly how we get there.

> In that state, spapr_drc_detach() completes
> immediately because from the perspective of QEMU it apparently has not
> been exposed to the guest yet, or the guest has already quiesced it on
> it's end.

Right.

> awaiting_allocation guarded against this, as it's intention was to make
> sure that resource was put into an ALLOCATED state prior to getting moved
> back into an UNALLOCATED state,

Right, but that seems broken to me.  It means if a cpu is hotplug when
the guest isn't paying attention (e.g. early boot, guest is
halted/crashed), then you can't remove it until you either boot an OS
that allocates then releases it, or you reset (and then release).

> so we didn't immediately unplug a CPU
> while the hotplug was in progress.

But in what sense is the hotplug "in progress"?  The guest has been
given a notification, but it hasn't touched the device yet.  Moving
the state away from UNALLOCATED is never guaranteed to work,
regardless of what notifications have been received.

> So in your scenario the CPU is just mysteriously vanishing out from
> under the guest,

Because the DRC is ending up in UNUSABLE state on the destination when
it should be in some other state (and was on the source)?  Yeah, that
sounds plausible.

> which probably explains the hang. The fix for this
> particular scenario is to fix the initial DRC on the target. I think
> we have a plan for that so I wouldn't consider this a regression
> necessarily.

Yeah.. I'm not quite sure how it's ending up wrong; why isn't the
incoming migration state setting it correctly?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by Daniel Henrique Barboza 8 years, 7 months ago

On 07/05/2017 06:53 PM, Daniel Henrique Barboza wrote:
> Worth a try. I'll see if I can make a POC of this DRC reset at incoming
> migration time. 

Just played a little with the idea of manual reset during migration. 
I've created a
POC that resets the CPU DRCs in a pre_load hook. Then I've done the same
experiment - device_add on both source and target before the migration,
hot unplug the device after migration is completed. The hot unplug worked as
expected in both QEMU and guest.

To minimize the impact I am resetting only the DRCs of the CPUs that 
were hotplugged
in the target instead of resetting everybody.

I'll see if this solution works for LMBs and PCI devices. In case 
affirmative, and
if we are fine with this solution of resetting the DRCs in pre_load (not 
sure if pre_load is
the right place for doing it - suggestions welcome), I can send a patch 
to be applied on top
of this series.

Thanks,

Daniel

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by David Gibson 8 years, 7 months ago

On Thu, Jul 06, 2017 at 11:31:35AM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 07/05/2017 06:53 PM, Daniel Henrique Barboza wrote:
> > Worth a try. I'll see if I can make a POC of this DRC reset at incoming
> > migration time.
> 
> 
> Just played a little with the idea of manual reset during migration. I've
> created a
> POC that resets the CPU DRCs in a pre_load hook.

Ok.  I'm not sure does the pre_load hook get called for objects where
nothing ends up being sent in the migration stream?

> Then I've done the same
> experiment - device_add on both source and target before the migration,
> hot unplug the device after migration is completed. The hot unplug worked as
> expected in both QEMU and guest.

That's encouraging.

> 
> To minimize the impact I am resetting only the DRCs of the CPUs that were
> hotplugged
> in the target instead of resetting everybody.

That sounds like more trouble than it's worth, I think it'll be easier
and safer to reset all DRCs.

> I'll see if this solution works for LMBs and PCI devices. In case
> affirmative, and
> if we are fine with this solution of resetting the DRCs in pre_load (not
> sure if pre_load is
> the right place for doing it - suggestions welcome), I can send a patch to
> be applied on top
> of this series.
> 
> 
> 
> Thanks,
> 
> Daniel
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by Daniel Henrique Barboza 8 years, 7 months ago

On 07/07/2017 04:14 AM, David Gibson wrote:
> On Thu, Jul 06, 2017 at 11:31:35AM -0300, Daniel Henrique Barboza wrote:
>>
>> On 07/05/2017 06:53 PM, Daniel Henrique Barboza wrote:
>>> Worth a try. I'll see if I can make a POC of this DRC reset at incoming
>>> migration time.
>>
>> Just played a little with the idea of manual reset during migration. I've
>> created a
>> POC that resets the CPU DRCs in a pre_load hook.
> Ok.  I'm not sure does the pre_load hook get called for objects where
> nothing ends up being sent in the migration stream?

In this POC I've added a pre_load hook in the machine state to ensure
that it will be called.  I saw that we already had a post_load hook in there
that does adjustments in the CPU and RADIX state, so adding a pre_load hook
sounded viable too. I am not sure if we can add some form of hook for
the DRCs that won't be migrated in the DRC state or if there is a better
VMSD to do the reset of the DRCs.

Anyway, this was more of an experiment than a serious fix proposal.
I believe the general idea is solid but it'll get a couple of spins to get a
good patch for pushing. I'll send the POC as a RFC to get things started and
we'll see what happens.

Daniel

>> Then I've done the same
>> experiment - device_add on both source and target before the migration,
>> hot unplug the device after migration is completed. The hot unplug worked as
>> expected in both QEMU and guest.
> That's encouraging.
>
>> To minimize the impact I am resetting only the DRCs of the CPUs that were
>> hotplugged
>> in the target instead of resetting everybody.
> That sounds like more trouble than it's worth, I think it'll be easier
> and safer to reset all DRCs.
>
>> I'll see if this solution works for LMBs and PCI devices. In case
>> affirmative, and
>> if we are fine with this solution of resetting the DRCs in pre_load (not
>> sure if pre_load is
>> the right place for doing it - suggestions welcome), I can send a patch to
>> be applied on top
>> of this series.
>>
>>
>>
>> Thanks,
>>
>> Daniel
>>

Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/5] spapr: DRC cleanups (part VI)

Posted by Daniel Henrique Barboza 8 years, 7 months ago

> Anyway, this was more of an experiment than a serious fix proposal.
> I believe the general idea is solid but it'll get a couple of spins to 
> get a
> good patch for pushing. I'll send the POC as a RFC to get things 
> started and
> we'll see what happens.

I've sent the POC as RFC here:

"[RFC drcVI PATCH] spapr: reset DRCs on migration pre_load"


The patch mentions drcVI because it was made on top of this series. Let 
me know
what you think.

Daniel