Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

[libvirt] [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

Posted by Prerna Saxena 6 years, 3 months ago

In case of non-p2p migration, in case libvirt client gets disconnected from source libvirt
after PERFORM phase is over, the daemon just resets the current migration job.
However, the VM could be left paused on both source and destination in such case. In case
the client reconnects and queries migration status, the job has been blanked out from source libvirt,
and this reconnected client has no clear way of figuring out if an unclean migration had previously
been aborted.

This patch calls out a "potentially" incomplete migration as a failed job, so that a client may
be able to watch previously failed jobs for this VM and take corrective actions as needed.

Signed-off-by: Prerna Saxena <saxenap.ltc@gmail.com>
---
 src/qemu/qemu_domain.c    |   16 ++++++++++++++++
 src/qemu/qemu_domain.h    |    2 ++
 src/qemu/qemu_migration.c |    4 ++--
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index e8e0313..7c60d17 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
     qemuDomainObjSaveJob(driver, obj);
 }
 
+
+void
+qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
+{
+    qemuDomainObjPrivatePtr priv = obj->privateData;
+    VIR_FREE(priv->job.completed);
+    if (VIR_ALLOC(priv->job.completed) == 0) {
+        priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
+        priv->job.completed = priv->job.current;
+    } else {
+        VIR_WARN("Unable to allocate job.completed for VM %s", obj->def->name);
+    }
+    qemuDomainObjResetAsyncJob(priv);
+    qemuDomainObjEndJob(driver, obj);
+}
+
 void
 qemuDomainObjReleaseAsyncJob(virDomainObjPtr obj)
 {
diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
index c33af36..6465603 100644
--- a/src/qemu/qemu_domain.h
+++ b/src/qemu/qemu_domain.h
@@ -497,6 +497,8 @@ void qemuDomainObjRestoreJob(virDomainObjPtr obj,
                              struct qemuDomainJobObj *job);
 void qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver,
                                   virDomainObjPtr obj);
+void qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver,
+                                  virDomainObjPtr obj);
 void qemuDomainObjReleaseAsyncJob(virDomainObjPtr obj);
 
 qemuMonitorPtr qemuDomainGetMonitor(virDomainObjPtr vm)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
index 69eb231..fd8950e 100644
--- a/src/qemu/qemu_migration.c
+++ b/src/qemu/qemu_migration.c
@@ -1911,8 +1911,8 @@ qemuMigrationCleanup(virDomainObjPtr vm,
         VIR_WARN("Migration of domain %s finished but we don't know if the"
                  " domain was successfully started on destination or not",
                  vm->def->name);
-        /* clear the job and let higher levels decide what to do */
-        qemuDomainObjDiscardAsyncJob(driver, vm);
+        /* clearly "fail" the job and let higher levels decide what to do */
+        qemuDomainObjFailAsyncJob(driver, vm);
         break;
 
     case QEMU_MIGRATION_PHASE_PERFORM3:
-- 
1.7.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

Posted by Jiri Denemark 6 years, 3 months ago

On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
> In case of non-p2p migration, in case libvirt client gets disconnected from source libvirt
> after PERFORM phase is over, the daemon just resets the current migration job.
> However, the VM could be left paused on both source and destination in such case. In case
> the client reconnects and queries migration status, the job has been blanked out from source libvirt,
> and this reconnected client has no clear way of figuring out if an unclean migration had previously
> been aborted.

The virDomainGetState API should return VIR_DOMAIN_PAUSED with
VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?

> This patch calls out a "potentially" incomplete migration as a failed
> job, so that a client may

As you say it's potentially incomplete, so marking it as failed is not
completely correct. It's a split brain when the source cannot
distinguish whether the migration was successful or not.

> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index e8e0313..7c60d17 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
>      qemuDomainObjSaveJob(driver, obj);
>  }
>  
> +
> +void
> +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
> +{
> +    qemuDomainObjPrivatePtr priv = obj->privateData;
> +    VIR_FREE(priv->job.completed);
> +    if (VIR_ALLOC(priv->job.completed) == 0) {
> +        priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
> +        priv->job.completed = priv->job.current;

This will just leak the memory allocated for priv->job.completed by
overwriting the pointer to the one from priv->job.current, ...

> +    } else {
> +        VIR_WARN("Unable to allocate job.completed for VM %s", obj->def->name);
> +    }
> +    qemuDomainObjResetAsyncJob(priv);

which will point to a freed memory after this call.

> +    qemuDomainObjEndJob(driver, obj);

And while this is almost certainly (I didn't really check though) not
something you should call from a close callback, you don't save the
changes to the status XML so once libvirtd restarts, it will think the
domain is still being migrated.

Jirka

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

Posted by Prerna 6 years, 2 months ago

Hi Jirka,

On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar@redhat.com> wrote:

> On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
> > In case of non-p2p migration, in case libvirt client gets disconnected
> from source libvirt
> > after PERFORM phase is over, the daemon just resets the current
> migration job.
> > However, the VM could be left paused on both source and destination in
> such case. In case
> > the client reconnects and queries migration status, the job has been
> blanked out from source libvirt,
> > and this reconnected client has no clear way of figuring out if an
> unclean migration had previously
> > been aborted.
>
> The virDomainGetState API should return VIR_DOMAIN_PAUSED with
> VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?
>
>
I understand that a client application should poll source libvirtd for
status of migration job completion using virDomainGetJobStats().
However, as you explained above, cleanup callbacks clear the job info so a
client should additionally be polling for virDomainGetState() too.
Would it not be cleaner to have a single API reflect the source of truth?


> > This patch calls out a "potentially" incomplete migration as a failed
> > job, so that a client may
>
> As you say it's potentially incomplete, so marking it as failed is not
> completely correct. It's a split brain when the source cannot
> distinguish whether the migration was successful or not.
>

Agree, it might have run to completion too, as we observed in some cases.
Do you think marking the job status as "UNKNOWN" is better articulation of
the current state?


>
> > diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> > index e8e0313..7c60d17 100644
> > --- a/src/qemu/qemu_domain.c
> > +++ b/src/qemu/qemu_domain.c
> > @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr
> driver, virDomainObjPtr obj)
> >      qemuDomainObjSaveJob(driver, obj);
> >  }
> >
> > +
> > +void
> > +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
> > +{
> > +    qemuDomainObjPrivatePtr priv = obj->privateData;
> > +    VIR_FREE(priv->job.completed);
> > +    if (VIR_ALLOC(priv->job.completed) == 0) {
> > +        priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
> > +        priv->job.completed = priv->job.current;
>
> This will just leak the memory allocated for priv->job.completed by
> overwriting the pointer to the one from priv->job.current, ...
>
> > +    } else {
> > +        VIR_WARN("Unable to allocate job.completed for VM %s",
> obj->def->name);
> > +    }
> > +    qemuDomainObjResetAsyncJob(priv);
>
> which will point to a freed memory after this call.
>

Agree, I will fix this.


>
> > +    qemuDomainObjEndJob(driver, obj);
>
> And while this is almost certainly (I didn't really check though) not
> something you should call from a close callback, you don't save the
> changes to the status XML so once libvirtd restarts, it will think the
> domain is still being migrated.
>

I will add the same to status XML.
I am suggesting that strengthening the job data would be additionally
useful. If the daemon has not restarted, job information can still get us
fairly accurate status of migration. Pls let me know if you think this is
not useful, I will be happy to learn the rationale.

Regards,
Prerna
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [PATCH] Migration: Preserve the failed job in case migration job is terminated beyond the perform phase.

Posted by Jiri Denemark 6 years, 2 months ago

On Mon, Jan 29, 2018 at 15:56:29 +0530, Prerna wrote:
> Hi Jirka,
> 
> On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar@redhat.com> wrote:
> 
> > On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote:
> > > In case of non-p2p migration, in case libvirt client gets disconnected
> > from source libvirt
> > > after PERFORM phase is over, the daemon just resets the current
> > migration job.
> > > However, the VM could be left paused on both source and destination in
> > such case. In case
> > > the client reconnects and queries migration status, the job has been
> > blanked out from source libvirt,
> > > and this reconnected client has no clear way of figuring out if an
> > unclean migration had previously
> > > been aborted.
> >
> > The virDomainGetState API should return VIR_DOMAIN_PAUSED with
> > VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough?
> >
> I understand that a client application should poll source libvirtd for
> status of migration job completion using virDomainGetJobStats().

Not really, it may poll if it wants to monitor migration progress, but
normally the client would just wait for the migration API to return
either success or failure.

> However, as you explained above, cleanup callbacks clear the job info so a
> client should additionally be polling for virDomainGetState() too.

Well, even if virDomainGetJobStats with VIR_DOMAIN_JOB_STATS_COMPLETED
flag was modified to report the job as VIR_DOMAIN_JOB_FAILED, the client
would still need to call virDomainGetState (on both sides in some cases)
to check whether the domain is running or it was left in a paused state.
So the reporting of a failed job by virDomainGetJobStats does not seem
to be really necessary. And it would be a bit confusing too since the
flag is called *_COMPLETED, while the migration in fact did not
complete. This confusion could be fixed by introducing a new flag,
but...

> Would it not be cleaner to have a single API reflect the source of truth?

Perhaps, but since there already is a way of getting the info, any
client which wants to work with more than just a bleeding edge libvirt
would still need to implement the existing way. And why would the client
bother using the new API when it can be sure the old way will still be
available? Doing so would make the client even more complicated for no
benefit.

But as I said, just seeing that a previous migration job failed is not
enough to recover from a disconnected client which was controlling a
non-p2p migration.

BTW, p2p migration is far less fragile in this respect. If the
connection to a client breaks, migration normally continues without any
disruption. And if the connection between libvirt daemons fails, both
sides will detect it and abort the migration. Of course, a split brain
can still happen even with p2p migration, but it's not so easy to
trigger it since the time frame in which the connection has to break to
cause a split brain is much shorter.

Jirka

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list