[libvirt] [PATCH v2 1/3] qemu: use a bigger unplug timeout for PPC64 guests

Daniel Henrique Barboza posted 3 patches 6 years, 4 months ago
There is a newer version of this series
[libvirt] [PATCH v2 1/3] qemu: use a bigger unplug timeout for PPC64 guests
Posted by Daniel Henrique Barboza 6 years, 4 months ago
For some architectures and setups, device removal can take
longer than the default 5 seconds. This results in commands
such as 'virsh setvcpus' to fire timeout messages even if
the operation were successful in the guest, confusing the
user.

This patch sets a new 10 seconds unplug timeout for PPC64
guests. All other archs will keep the default 5 seconds
timeout.

Instead of putting 'if PPC64' conditionals inside qemu_hotplug.c
to set the new timeout value, a new QEMU driver attribute
'unplugTimeout' was added. The timeout value is set during
qemuStateInitialize only once. All qemu_hotplug.c functions
that uses the timeout have easy access to a qemu_driver object,
thus the change to use unplugTimeout is straightforward.

The now unused 'qemuDomainRemoveDeviceWaitTime' global can't
be simply erased from qemu_hotplug.c though. Next patch will
remove it properly.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 src/qemu/qemu_conf.h    |  3 +++
 src/qemu/qemu_driver.c  | 11 +++++++++++
 src/qemu/qemu_hotplug.c | 11 ++++++-----
 tests/qemuhotplugtest.c |  2 +-
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h
index 8473d6d4ca..edda1421d0 100644
--- a/src/qemu/qemu_conf.h
+++ b/src/qemu/qemu_conf.h
@@ -296,6 +296,9 @@ struct _virQEMUDriver {
 
     /* Immutable pointer, self-locking APIs */
     virHashAtomicPtr migrationErrors;
+
+    /* Immutable value */
+    unsigned int unplugTimeout;
 };
 
 virQEMUDriverConfigPtr virQEMUDriverConfigNew(bool privileged);
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 1e041a8bac..2ebe22ee79 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -134,6 +134,13 @@ VIR_LOG_INIT("qemu.qemu_driver");
 
 #define QEMU_NB_BANDWIDTH_PARAM 7
 
+/* Timeout in miliseconds for device removal. PPC64 domains
+ * can experience a bigger delay in unplug operations during
+ * heavy guest activity (vcpu being the most notable case), thus
+ * the timeout for PPC64 is also bigger. */
+#define QEMU_UNPLUG_TIMEOUT 5000
+#define QEMU_UNPLUG_TIMEOUT_PPC64 10000
+
 static void qemuProcessEventHandler(void *data, void *opaque);
 
 static int qemuStateCleanup(void);
@@ -1095,6 +1102,10 @@ qemuStateInitialize(bool privileged,
     if (!qemu_driver->workerPool)
         goto error;
 
+    qemu_driver->unplugTimeout = QEMU_UNPLUG_TIMEOUT;
+    if (ARCH_IS_PPC64(qemu_driver->caps->host.arch))
+        qemu_driver->unplugTimeout = QEMU_UNPLUG_TIMEOUT_PPC64;
+
     qemuProcessReconnectAll(qemu_driver);
 
     qemuAutostartDomains(qemu_driver);
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c
index bd8868b0f7..4088cc5e8e 100644
--- a/src/qemu/qemu_hotplug.c
+++ b/src/qemu/qemu_hotplug.c
@@ -5237,7 +5237,8 @@ qemuDomainResetDeviceRemoval(virDomainObjPtr vm)
  *      - we failed to reliably wait for the event and thus use fallback behavior
  */
 static int
-qemuDomainWaitForDeviceRemoval(virDomainObjPtr vm)
+qemuDomainWaitForDeviceRemoval(virQEMUDriverPtr driver,
+                               virDomainObjPtr vm)
 {
     qemuDomainObjPrivatePtr priv = vm->privateData;
     unsigned long long until;
@@ -5245,7 +5246,7 @@ qemuDomainWaitForDeviceRemoval(virDomainObjPtr vm)
 
     if (virTimeMillisNow(&until) < 0)
         return 1;
-    until += qemuDomainRemoveDeviceWaitTime;
+    until += driver->unplugTimeout;
 
     while (priv->unplug.alias) {
         if ((rc = virDomainObjWaitUntil(vm, until)) == 1)
@@ -5701,7 +5702,7 @@ qemuDomainDetachDeviceChr(virQEMUDriverPtr driver,
     } else if (async) {
         ret = 0;
     } else {
-        if ((ret = qemuDomainWaitForDeviceRemoval(vm)) == 1)
+        if ((ret = qemuDomainWaitForDeviceRemoval(driver, vm)) == 1)
             ret = qemuDomainRemoveChrDevice(driver, vm, tmpChr, true);
     }
 
@@ -6001,7 +6002,7 @@ qemuDomainDetachDeviceLive(virDomainObjPtr vm,
     if (async) {
         ret = 0;
     } else {
-        if ((ret = qemuDomainWaitForDeviceRemoval(vm)) == 1)
+        if ((ret = qemuDomainWaitForDeviceRemoval(driver, vm)) == 1)
             ret = qemuDomainRemoveDevice(driver, vm, &detach);
     }
 
@@ -6107,7 +6108,7 @@ qemuDomainHotplugDelVcpu(virQEMUDriverPtr driver,
         goto cleanup;
     }
 
-    if ((rc = qemuDomainWaitForDeviceRemoval(vm)) <= 0) {
+    if ((rc = qemuDomainWaitForDeviceRemoval(driver, vm)) <= 0) {
         if (rc == 0)
             virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                            _("vcpu unplug request timed out"));
diff --git a/tests/qemuhotplugtest.c b/tests/qemuhotplugtest.c
index 3c177c6622..da6258991d 100644
--- a/tests/qemuhotplugtest.c
+++ b/tests/qemuhotplugtest.c
@@ -645,7 +645,7 @@ mymain(void)
     driver.hostdevMgr = virHostdevManagerGetDefault();
 
     /* wait only 100ms for DEVICE_DELETED event */
-    qemuDomainRemoveDeviceWaitTime = 100;
+    driver.unplugTimeout = 100;
 
 #define DO_TEST(file, ACTION, dev, fial, kep, ...) \
     do { \
-- 
2.21.0

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 1/3] qemu: use a bigger unplug timeout for PPC64 guests
Posted by Cole Robinson 6 years, 4 months ago
On 9/11/19 5:05 PM, Daniel Henrique Barboza wrote:
> For some architectures and setups, device removal can take
> longer than the default 5 seconds. This results in commands
> such as 'virsh setvcpus' to fire timeout messages even if
> the operation were successful in the guest, confusing the
> user.
> 
> This patch sets a new 10 seconds unplug timeout for PPC64
> guests. All other archs will keep the default 5 seconds
> timeout.
> 
> Instead of putting 'if PPC64' conditionals inside qemu_hotplug.c
> to set the new timeout value, a new QEMU driver attribute
> 'unplugTimeout' was added. The timeout value is set during
> qemuStateInitialize only once. All qemu_hotplug.c functions
> that uses the timeout have easy access to a qemu_driver object,
> thus the change to use unplugTimeout is straightforward.
> 
> The now unused 'qemuDomainRemoveDeviceWaitTime' global can't
> be simply erased from qemu_hotplug.c though. Next patch will
> remove it properly.
> 

Sorry for the wrong review delay. I see this implements danpb's 
suggestion from the previous thread. The implementation seems a little 
odd to me though because it is differentiating on host arch, but this is 
about guest arch right? And probably an arbitrary number of options, 
like I imagine TCG would want a longer timeout too (though that's not 
anything you need to deal with)

So I think this should be a function that lives in qemu_hotplug.c and 
acts on a DomainDef at least. The test suite will have to mock that in a 
qemuhotplugmock.c file, tests/qemucpumock.c is a good example to follow.

If you do that as an upfront patch, it can go in first. Then add the ppc 
changes in an add on patch.

You can CC me on the next version and I will review it

Thanks,
Cole

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 1/3] qemu: use a bigger unplug timeout for PPC64 guests
Posted by Daniel Henrique Barboza 6 years, 3 months ago

On 10/9/19 5:15 PM, Cole Robinson wrote:
> On 9/11/19 5:05 PM, Daniel Henrique Barboza wrote:
>> For some architectures and setups, device removal can take
>> longer than the default 5 seconds. This results in commands
>> such as 'virsh setvcpus' to fire timeout messages even if
>> the operation were successful in the guest, confusing the
>> user.
>>
>> This patch sets a new 10 seconds unplug timeout for PPC64
>> guests. All other archs will keep the default 5 seconds
>> timeout.
>>
>> Instead of putting 'if PPC64' conditionals inside qemu_hotplug.c
>> to set the new timeout value, a new QEMU driver attribute
>> 'unplugTimeout' was added. The timeout value is set during
>> qemuStateInitialize only once. All qemu_hotplug.c functions
>> that uses the timeout have easy access to a qemu_driver object,
>> thus the change to use unplugTimeout is straightforward.
>>
>> The now unused 'qemuDomainRemoveDeviceWaitTime' global can't
>> be simply erased from qemu_hotplug.c though. Next patch will
>> remove it properly.
>>
>
> Sorry for the wrong review delay. I see this implements danpb's 
> suggestion from the previous thread. The implementation seems a little 
> odd to me though because it is differentiating on host arch, but this 
> is about guest arch right? And probably an arbitrary number of 
> options, like I imagine TCG would want a longer timeout too (though 
> that's not anything you need to deal with)

I think it's sensible to say that a TCG guest would always required a
greater unplug timeout than a hardware accelerated one. However, never
thought about considering TCG guests in this patch series though. A shame.

So, considering that TCG guest exists, we can parametrize this unplug 
timeout
calculation by considering guest (not host) architecture and if it's a 
TCG or
a KVM guest. Speaking about the problem I'm trying to fix with ppc64 and
what we already have, we can roll out this unplug timeout logic like:

- pseries guest on KVM: 10 seconds
- everyone else on KVM: 5 seconds (untouched, like it is today)

For TCG guests, perhaps double the KVM timeout? This would need
experimentation, but double the timeout seems ok at first glance. Then we
can do:

- TCG pseries guest: 10 * 2 = 20 seconds
- TCG with every other arch guest = 5 * 2 = 10 seconds


This TCG calculation can be left alone for now as well - I can create 
the API
considering that TCG guest exists, but do not infer the timeout for it. 
Either
way works for me.

>
> So I think this should be a function that lives in qemu_hotplug.c and 
> acts on a DomainDef at least. The test suite will have to mock that in 
> a qemuhotplugmock.c file, tests/qemucpumock.c is a good example to 
> follow.

Just to check if we're on the same page, by 'I think this should be a 
function'
you mean the calculation of qemuDomainRemoveDeviceWaitTime that I just
mentioned above, right?


>
> If you do that as an upfront patch, it can go in first. Then add the 
> ppc changes in an add on patch.

Works for me.



Thanks,


DHB

>
> You can CC me on the next version and I will review it
>
> Thanks,
> Cole

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 1/3] qemu: use a bigger unplug timeout for PPC64 guests
Posted by Cole Robinson 6 years, 3 months ago
On 10/11/19 3:54 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 10/9/19 5:15 PM, Cole Robinson wrote:
>> On 9/11/19 5:05 PM, Daniel Henrique Barboza wrote:
>>> For some architectures and setups, device removal can take
>>> longer than the default 5 seconds. This results in commands
>>> such as 'virsh setvcpus' to fire timeout messages even if
>>> the operation were successful in the guest, confusing the
>>> user.
>>>
>>> This patch sets a new 10 seconds unplug timeout for PPC64
>>> guests. All other archs will keep the default 5 seconds
>>> timeout.
>>>
>>> Instead of putting 'if PPC64' conditionals inside qemu_hotplug.c
>>> to set the new timeout value, a new QEMU driver attribute
>>> 'unplugTimeout' was added. The timeout value is set during
>>> qemuStateInitialize only once. All qemu_hotplug.c functions
>>> that uses the timeout have easy access to a qemu_driver object,
>>> thus the change to use unplugTimeout is straightforward.
>>>
>>> The now unused 'qemuDomainRemoveDeviceWaitTime' global can't
>>> be simply erased from qemu_hotplug.c though. Next patch will
>>> remove it properly.
>>>
>>
>> Sorry for the wrong review delay. I see this implements danpb's 
>> suggestion from the previous thread. The implementation seems a little 
>> odd to me though because it is differentiating on host arch, but this 
>> is about guest arch right? And probably an arbitrary number of 
>> options, like I imagine TCG would want a longer timeout too (though 
>> that's not anything you need to deal with)
> 
> I think it's sensible to say that a TCG guest would always required a
> greater unplug timeout than a hardware accelerated one. However, never
> thought about considering TCG guests in this patch series though. A shame.
> 
> So, considering that TCG guest exists, we can parametrize this unplug 
> timeout
> calculation by considering guest (not host) architecture and if it's a 
> TCG or
> a KVM guest. Speaking about the problem I'm trying to fix with ppc64 and
> what we already have, we can roll out this unplug timeout logic like:
> 
> - pseries guest on KVM: 10 seconds
> - everyone else on KVM: 5 seconds (untouched, like it is today)
>  > For TCG guests, perhaps double the KVM timeout? This would need
> experimentation, but double the timeout seems ok at first glance. Then we
> can do:
> 
> - TCG pseries guest: 10 * 2 = 20 seconds
> - TCG with every other arch guest = 5 * 2 = 10 seconds
> 
> 
> This TCG calculation can be left alone for now as well - I can create 
> the API
> considering that TCG guest exists, but do not infer the timeout for it. 
> Either
> way works for me.
> 

I suggest just fixing the case you care about, leave TCG as it is (the 
default 5 seconds), otherwise we may be trying to fix something that no 
one cares about in real life.

But I think conceptually the function is a better fit incase the logic 
every becomes more complicated than a single host check.

>>
>> So I think this should be a function that lives in qemu_hotplug.c and 
>> acts on a DomainDef at least. The test suite will have to mock that in 
>> a qemuhotplugmock.c file, tests/qemucpumock.c is a good example to 
>> follow.
> 
> Just to check if we're on the same page, by 'I think this should be a 
> function'
> you mean the calculation of qemuDomainRemoveDeviceWaitTime that I just
> mentioned above, right?
> 

Yup that's what I meant

Thanks,
Cole

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list