[v1] Add support for post-copy recovery

[libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Jiri Denemark 3 years, 9 months ago

There's no need to artificially pause a domain when post-copy fails. The
virtual CPUs may continue running, only the guest tasks that decide to
read a page which has not been migrated yet will get blocked.

Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
---
 src/qemu/qemu_migration.c | 37 +++++++++++++++++++++++++++++++++----
 src/qemu/qemu_migration.h |  6 ++++--
 src/qemu/qemu_process.c   |  8 ++++----
 3 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
index b735bdb391..a5c7a27124 100644
--- a/src/qemu/qemu_migration.c
+++ b/src/qemu/qemu_migration.c
@@ -1577,14 +1577,19 @@ qemuMigrationSrcIsSafe(virDomainDef *def,
 
 
 void
-qemuMigrationAnyPostcopyFailed(virQEMUDriver *driver,
-                               virDomainObj *vm)
+qemuMigrationSrcPostcopyFailed(virDomainObj *vm)
 {
+    qemuDomainObjPrivate *priv = vm->privateData;
+    virQEMUDriver *driver = priv->driver;
     virDomainState state;
     int reason;
 
     state = virDomainObjGetState(vm, &reason);
 
+    VIR_DEBUG("%s/%s",
+              virDomainStateTypeToString(state),
+              virDomainStateReasonToString(state, reason));
+
     if (state != VIR_DOMAIN_PAUSED &&
         state != VIR_DOMAIN_RUNNING)
         return;
@@ -1608,6 +1613,30 @@ qemuMigrationAnyPostcopyFailed(virQEMUDriver *driver,
 }
 
 
+void
+qemuMigrationDstPostcopyFailed(virDomainObj *vm)
+{
+    virDomainState state;
+    int reason;
+
+    state = virDomainObjGetState(vm, &reason);
+
+    VIR_DEBUG("%s/%s",
+              virDomainStateTypeToString(state),
+              virDomainStateReasonToString(state, reason));
+
+    if (state != VIR_DOMAIN_RUNNING ||
+        reason == VIR_DOMAIN_RUNNING_POSTCOPY_FAILED)
+        return;
+
+    VIR_WARN("Incoming migration of domain %s failed during post-copy; "
+             "leaving the domain running in a degraded mode", vm->def->name);
+
+    virDomainObjSetState(vm, VIR_DOMAIN_RUNNING,
+                         VIR_DOMAIN_RUNNING_POSTCOPY_FAILED);
+}
+
+
 static int
 qemuMigrationSrcWaitForSpice(virDomainObj *vm)
 {
@@ -3470,7 +3499,7 @@ qemuMigrationSrcConfirmPhase(virQEMUDriver *driver,
 
         if (virDomainObjGetState(vm, &reason) == VIR_DOMAIN_PAUSED &&
             reason == VIR_DOMAIN_PAUSED_POSTCOPY)
-            qemuMigrationAnyPostcopyFailed(driver, vm);
+            qemuMigrationSrcPostcopyFailed(vm);
         else
             qemuMigrationSrcRestoreDomainState(driver, vm);
 
@@ -5847,7 +5876,7 @@ qemuMigrationDstFinish(virQEMUDriver *driver,
                                 VIR_DOMAIN_EVENT_STOPPED_FAILED);
             virObjectEventStateQueue(driver->domainEventState, event);
         } else {
-            qemuMigrationAnyPostcopyFailed(driver, vm);
+            qemuMigrationDstPostcopyFailed(vm);
         }
     }
 
diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h
index a8afa66119..c4e4228282 100644
--- a/src/qemu/qemu_migration.h
+++ b/src/qemu/qemu_migration.h
@@ -251,8 +251,10 @@ qemuMigrationDstRun(virQEMUDriver *driver,
                     virDomainAsyncJob asyncJob);
 
 void
-qemuMigrationAnyPostcopyFailed(virQEMUDriver *driver,
-                            virDomainObj *vm);
+qemuMigrationSrcPostcopyFailed(virDomainObj *vm);
+
+void
+qemuMigrationDstPostcopyFailed(virDomainObj *vm);
 
 int
 qemuMigrationSrcFetchMirrorStats(virQEMUDriver *driver,
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 1925559fad..a3192a7196 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -3482,7 +3482,7 @@ qemuProcessRecoverMigrationIn(virQEMUDriver *driver,
          * confirm success or failure yet; killing it seems safest unless
          * we already started guest CPUs or we were in post-copy mode */
         if (postcopy) {
-            qemuMigrationAnyPostcopyFailed(driver, vm);
+            qemuMigrationDstPostcopyFailed(vm);
         } else if (state != VIR_DOMAIN_RUNNING) {
             VIR_DEBUG("Killing migrated domain %s", vm->def->name);
             return -1;
@@ -3533,7 +3533,7 @@ qemuProcessRecoverMigrationOut(virQEMUDriver *driver,
          * post-copy mode
          */
         if (postcopy) {
-            qemuMigrationAnyPostcopyFailed(driver, vm);
+            qemuMigrationSrcPostcopyFailed(vm);
         } else {
             VIR_DEBUG("Cancelling unfinished migration of domain %s",
                       vm->def->name);
@@ -3551,7 +3551,7 @@ qemuProcessRecoverMigrationOut(virQEMUDriver *driver,
          * post-copy mode we can use PAUSED_POSTCOPY_FAILED state for this
          */
         if (postcopy)
-            qemuMigrationAnyPostcopyFailed(driver, vm);
+            qemuMigrationSrcPostcopyFailed(vm);
         break;
 
     case QEMU_MIGRATION_PHASE_CONFIRM3_CANCELLED:
@@ -3560,7 +3560,7 @@ qemuProcessRecoverMigrationOut(virQEMUDriver *driver,
          * as broken in that case
          */
         if (postcopy) {
-            qemuMigrationAnyPostcopyFailed(driver, vm);
+            qemuMigrationSrcPostcopyFailed(vm);
         } else {
             VIR_DEBUG("Resuming domain %s after failed migration",
                       vm->def->name);
-- 
2.35.1

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Peter Krempa 3 years, 9 months ago

On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> There's no need to artificially pause a domain when post-copy fails. The
> virtual CPUs may continue running, only the guest tasks that decide to
> read a page which has not been migrated yet will get blocked.

IMO not pausing the VM is a policy decision (same way as pausing it was
though) and should be user-configurable at migration start.

I can see that users might want to prevent a half-broken VM from
executing until it gets attention needed to fix it, even when it's safe
from a "theoretical" standpoint.

> 
> Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
> ---
>  src/qemu/qemu_migration.c | 37 +++++++++++++++++++++++++++++++++----
>  src/qemu/qemu_migration.h |  6 ++++--
>  src/qemu/qemu_process.c   |  8 ++++----
>  3 files changed, 41 insertions(+), 10 deletions(-)

The code looks okay, but I think this needs more justification if it's
to be accepted in this state.

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Daniel P. Berrangé 3 years, 9 months ago

On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote:
> On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > There's no need to artificially pause a domain when post-copy fails. The
> > virtual CPUs may continue running, only the guest tasks that decide to
> > read a page which has not been migrated yet will get blocked.
> 
> IMO not pausing the VM is a policy decision (same way as pausing it was
> though) and should be user-configurable at migration start.
> 
> I can see that users might want to prevent a half-broken VM from
> executing until it gets attention needed to fix it, even when it's safe
> from a "theoretical" standpoint.

It isn't even safe from a theoretical standpoint though.

Consider 2 processes in a guest that are communicating with each
other. 1 gets blocked on a page rea due to broken post copy, but
we leave the guest running.  The other process sees no progress
from the blocked process and/or hits time timeout and throws an
error. As a result the guest application workload ends up
completely dead, even if we later recover the the postcopy
migration.

Migration needs to strive to be transparent to the guest workload
and IMHO having the guest workload selectively executing and
selectively blocked is not transparent enough. It should require
a user opt-in for this kind of behaviour.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Peter Krempa 3 years, 9 months ago

On Wed, May 11, 2022 at 11:39:29 +0100, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote:
> > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > There's no need to artificially pause a domain when post-copy fails. The
> > > virtual CPUs may continue running, only the guest tasks that decide to
> > > read a page which has not been migrated yet will get blocked.
> > 
> > IMO not pausing the VM is a policy decision (same way as pausing it was
> > though) and should be user-configurable at migration start.
> > 
> > I can see that users might want to prevent a half-broken VM from
> > executing until it gets attention needed to fix it, even when it's safe
> > from a "theoretical" standpoint.
> 
> It isn't even safe from a theoretical standpoint though.
> 
> Consider 2 processes in a guest that are communicating with each
> other. 1 gets blocked on a page rea due to broken post copy, but
> we leave the guest running.  The other process sees no progress
> from the blocked process and/or hits time timeout and throws an
> error. As a result the guest application workload ends up
> completely dead, even if we later recover the the postcopy
> migration.

IMO you have to deal with this scenario in a reduced scope anyways when
opting into using post-copy.

Each page transfer is vastly slower than the comparable access into
memory, so if the 'timeout' portion is implied to be on the same order
of magnitde of memory access latency then your software is going to have
a very bad time when being migrated in post-copy mode. If the link gets
congested ... then it's even worse.

Obviously when the migration link breaks and you are getting unbounded
wait for a page access it's worse even for other types of APPs.

Anyways, does qemu support pausing the destination if the connection
breaks? If no, the second best thing we can do is to pause it by
libvirt, but it will still have caveats e.g. when libvirt is not around
to pause it.

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Daniel P. Berrangé 3 years, 9 months ago

On Wed, May 11, 2022 at 01:03:43PM +0200, Peter Krempa wrote:
> On Wed, May 11, 2022 at 11:39:29 +0100, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote:
> > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > > There's no need to artificially pause a domain when post-copy fails. The
> > > > virtual CPUs may continue running, only the guest tasks that decide to
> > > > read a page which has not been migrated yet will get blocked.
> > > 
> > > IMO not pausing the VM is a policy decision (same way as pausing it was
> > > though) and should be user-configurable at migration start.
> > > 
> > > I can see that users might want to prevent a half-broken VM from
> > > executing until it gets attention needed to fix it, even when it's safe
> > > from a "theoretical" standpoint.
> > 
> > It isn't even safe from a theoretical standpoint though.
> > 
> > Consider 2 processes in a guest that are communicating with each
> > other. 1 gets blocked on a page rea due to broken post copy, but
> > we leave the guest running.  The other process sees no progress
> > from the blocked process and/or hits time timeout and throws an
> > error. As a result the guest application workload ends up
> > completely dead, even if we later recover the the postcopy
> > migration.
> 
> IMO you have to deal with this scenario in a reduced scope anyways when
> opting into using post-copy.
> 
> Each page transfer is vastly slower than the comparable access into
> memory, so if the 'timeout' portion is implied to be on the same order
> of magnitde of memory access latency then your software is going to have
> a very bad time when being migrated in post-copy mode. If the link gets
> congested ... then it's even worse.

That's very different likely order of magnitudes though. A "slow"
page access in post-copy is $LOW seconds. A blocked process due to
a broken post-copy connection is potentially $HIGH minutes long if
the infra takes a long time to fix.

A page access taking a seconds rather than microseconds isn't
going to trip up many app level timeouts IMHO.

A process blocked for many minutes is highly likely to trigger
app level timeouts.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Jiri Denemark 3 years, 9 months ago

On Wed, May 11, 2022 at 10:48:10 +0200, Peter Krempa wrote:
> On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > There's no need to artificially pause a domain when post-copy fails. The
> > virtual CPUs may continue running, only the guest tasks that decide to
> > read a page which has not been migrated yet will get blocked.
> 
> IMO not pausing the VM is a policy decision (same way as pausing it was
> though) and should be user-configurable at migration start.
> 
> I can see that users might want to prevent a half-broken VM from
> executing until it gets attention needed to fix it, even when it's safe
> from a "theoretical" standpoint.

It depends how much was already migrated. In practise the guest may
easily stop running anyway :-) So yeah, it was a needless policy
decision which is being removed now. But the important reason behind it,
which I should have mention in the commit message is the difference
between libvirt and QEMU migration state. When libvirt connection breaks
(between daemons for p2p migration or between a client and daemons) we
consider migration as broken from the API point of view and return
failure. However, the migration may still be running just fine if the
connection between QEMU processes remains working. And since we're in
post-copy phase, the migration can even finish just fine without
libvirt. So a half-broken VM may magically become a fully working
migrated VM after our migration API reported a failure. Keeping the
domain running makes this situation easier to handle :-)

Jirka

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Peter Krempa 3 years, 9 months ago

On Wed, May 11, 2022 at 12:26:52 +0200, Jiri Denemark wrote:
> On Wed, May 11, 2022 at 10:48:10 +0200, Peter Krempa wrote:
> > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > There's no need to artificially pause a domain when post-copy fails. The
> > > virtual CPUs may continue running, only the guest tasks that decide to
> > > read a page which has not been migrated yet will get blocked.
> > 
> > IMO not pausing the VM is a policy decision (same way as pausing it was
> > though) and should be user-configurable at migration start.
> > 
> > I can see that users might want to prevent a half-broken VM from
> > executing until it gets attention needed to fix it, even when it's safe
> > from a "theoretical" standpoint.
> 
> It depends how much was already migrated. In practise the guest may
> easily stop running anyway :-) 

Well, I'd consider that behaviour to be very bad actually, but given the
caveats below ...

> So yeah, it was a needless policy
> decision which is being removed now. But the important reason behind it,
> which I should have mention in the commit message is the difference
> between libvirt and QEMU migration state. When libvirt connection breaks
> (between daemons for p2p migration or between a client and daemons) we
> consider migration as broken from the API point of view and return
> failure. However, the migration may still be running just fine if the
> connection between QEMU processes remains working. And since we're in
> post-copy phase, the migration can even finish just fine without
> libvirt. So a half-broken VM may magically become a fully working
> migrated VM after our migration API reported a failure. Keeping the
> domain running makes this situation easier to handle :-)

I see. Additionally if e.g. libvirtd isn't running at all (but that ties
to the "connection broken" scenario) we wouldn't even pause it.

So the caveats were there in fact always albeit less probable.

IMO it should be documented somewhere. I agree that this patch is okay
though. I think we can add the docs in a follow-up.

Reviewed-by: Peter Krempa <pkrempa@redhat.com>

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Daniel P. Berrangé 3 years, 9 months ago

On Wed, May 11, 2022 at 12:42:08PM +0200, Peter Krempa wrote:
> On Wed, May 11, 2022 at 12:26:52 +0200, Jiri Denemark wrote:
> > On Wed, May 11, 2022 at 10:48:10 +0200, Peter Krempa wrote:
> > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > > There's no need to artificially pause a domain when post-copy fails. The
> > > > virtual CPUs may continue running, only the guest tasks that decide to
> > > > read a page which has not been migrated yet will get blocked.
> > > 
> > > IMO not pausing the VM is a policy decision (same way as pausing it was
> > > though) and should be user-configurable at migration start.
> > > 
> > > I can see that users might want to prevent a half-broken VM from
> > > executing until it gets attention needed to fix it, even when it's safe
> > > from a "theoretical" standpoint.
> > 
> > It depends how much was already migrated. In practise the guest may
> > easily stop running anyway :-) 
> 
> Well, I'd consider that behaviour to be very bad actually, but given the
> caveats below ...
> 
> > So yeah, it was a needless policy
> > decision which is being removed now. But the important reason behind it,
> > which I should have mention in the commit message is the difference
> > between libvirt and QEMU migration state. When libvirt connection breaks
> > (between daemons for p2p migration or between a client and daemons) we
> > consider migration as broken from the API point of view and return
> > failure. However, the migration may still be running just fine if the
> > connection between QEMU processes remains working. And since we're in
> > post-copy phase, the migration can even finish just fine without
> > libvirt. So a half-broken VM may magically become a fully working
> > migrated VM after our migration API reported a failure. Keeping the
> > domain running makes this situation easier to handle :-)
> 
> I see. Additionally if e.g. libvirtd isn't running at all (but that ties
> to the "connection broken" scenario) we wouldn't even pause it.
> 
> So the caveats were there in fact always albeit less probable.

There are two very different scenarios here though.

Strictly from the QEMU scenario, migration only fails if there's
a problem with QEMU's migration connection. This scenario can
impact the guest, because processes get selectively blocked,
which can ultimately lead to application timeouts and errors.
If there's a failure at the QEMU level we want the guest to be
paused, so that interaction between apps in the guest is not
impacted.

On the libvirt side, if our own libvirtd conenction fails,
this does not impact the guest or QEMU's migration connection
(I'll ignore tunnelled mig here). So there is no problem with
the guest continuing to execute, and even complete the migration
from QEMU POV.

For robustness we want a way for QEMU to autonomously pause the
guest when post-copy fails, so that this pausing happens even
if libvirt's connection has also failed concurrently.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Posted by Jiri Denemark 3 years, 9 months ago

On Wed, May 11, 2022 at 11:54:24 +0100, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 12:42:08PM +0200, Peter Krempa wrote:
> > On Wed, May 11, 2022 at 12:26:52 +0200, Jiri Denemark wrote:
> > > On Wed, May 11, 2022 at 10:48:10 +0200, Peter Krempa wrote:
> > > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > > > > There's no need to artificially pause a domain when post-copy fails. The
> > > > > virtual CPUs may continue running, only the guest tasks that decide to
> > > > > read a page which has not been migrated yet will get blocked.
> > > > 
> > > > IMO not pausing the VM is a policy decision (same way as pausing it was
> > > > though) and should be user-configurable at migration start.
> > > > 
> > > > I can see that users might want to prevent a half-broken VM from
> > > > executing until it gets attention needed to fix it, even when it's safe
> > > > from a "theoretical" standpoint.
> > > 
> > > It depends how much was already migrated. In practise the guest may
> > > easily stop running anyway :-) 
> > 
> > Well, I'd consider that behaviour to be very bad actually, but given the
> > caveats below ...
> > 
> > > So yeah, it was a needless policy
> > > decision which is being removed now. But the important reason behind it,
> > > which I should have mention in the commit message is the difference
> > > between libvirt and QEMU migration state. When libvirt connection breaks
> > > (between daemons for p2p migration or between a client and daemons) we
> > > consider migration as broken from the API point of view and return
> > > failure. However, the migration may still be running just fine if the
> > > connection between QEMU processes remains working. And since we're in
> > > post-copy phase, the migration can even finish just fine without
> > > libvirt. So a half-broken VM may magically become a fully working
> > > migrated VM after our migration API reported a failure. Keeping the
> > > domain running makes this situation easier to handle :-)
> > 
> > I see. Additionally if e.g. libvirtd isn't running at all (but that ties
> > to the "connection broken" scenario) we wouldn't even pause it.
> > 
> > So the caveats were there in fact always albeit less probable.
> 
> There are two very different scenarios here though.
> 
> Strictly from the QEMU scenario, migration only fails if there's
> a problem with QEMU's migration connection. This scenario can
> impact the guest, because processes get selectively blocked,
> which can ultimately lead to application timeouts and errors.
> If there's a failure at the QEMU level we want the guest to be
> paused, so that interaction between apps in the guest is not
> impacted.

Well, the same applies for an external communication to a paused domain.
But I agree the timeouts between processes running inside a single
domain are more serious, as this is not really expected in contrast to
external communication where timeouts and broken connections are pretty
common.

> On the libvirt side, if our own libvirtd conenction fails,
> this does not impact the guest or QEMU's migration connection
> (I'll ignore tunnelled mig here).

Luckily we don't even support post-copy with tunnelled migration :-)

> For robustness we want a way for QEMU to autonomously pause the
> guest when post-copy fails, so that this pausing happens even
> if libvirt's connection has also failed concurrently.

Yeah, if QEMU gets modified to stop vCPUs in postcopy-pause migration
state (and resume them in postcopy-recover), we can keep this code to
keep the domain running. Otherwise we'd need to check the state of QEMU
migration and decide whether to pause the domain or keep it running. And
pause it in postcopy-pause handler. But we could get into trouble if
QEMU gets to this state while libvirtd is not running. So modifying QEMU
looks would be a more robust solution.

Jirka