Add new migration flag VIR_MIGRATE_DRY_RUN

[libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Jim Fehlig 7 years, 3 months ago

A dry run can be used as a best-effort check that a migration command
will succeed. The destination host will be checked to see if it can
accommodate the resources required by the domain. DRY_RUN will fail if
the destination host is not capable of running the domain. Although a
subsequent migration will likely succeed, the success of DRY_RUN does not
ensure a future migration will succeed. Resources on the destination host
could become unavailable between a DRY_RUN and actual migration.

Signed-off-by: Jim Fehlig <jfehlig@suse.com>
---

If it is agreed this is useful, my thought was to use the begin and
prepare phases of migration to implement it. qemuMigrationDstPrepareAny()
already does a lot of the heavy lifting wrt checking the host can
accommodate the domain. Some of it, and the remaining migration phases,
can be short-circuited in the case of dry run.

One interesting wrinkle I've observed is the check for cpu compatibility.
AFAICT qemu is actually invoked on the dst, "filtered-features" of the cpu
are requested via qmp, and results are checked against cpu in domain config.
If cpu on dst is insufficient, migration fails in the prepare phase with
something like "guest CPU doesn't match specification: missing features: z y z".
I was hoping to avoid launching qemu in the case of dry run, but that may
be unavoidable if we'd like a dependable dry run result.

Thanks for considering the idea!

(BTW, if it is considered useful I will follow up with a V1 series that
includes this patch and and impl for the qemu driver.)

 include/libvirt/libvirt-domain.h | 12 ++++++++++++
 src/qemu/qemu_migration.h        |  3 ++-
 tools/virsh-domain.c             |  7 +++++++
 tools/virsh.pod                  | 10 +++++++++-
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
index fdd2d6b8ea..6d52f6ce50 100644
--- a/include/libvirt/libvirt-domain.h
+++ b/include/libvirt/libvirt-domain.h
@@ -830,6 +830,18 @@ typedef enum {
      */
     VIR_MIGRATE_TLS               = (1 << 16),
 
+    /* Setting the VIR_MIGRATE_DRY_RUN flag will cause libvirt to make a
+     * best-effort attempt to check if migration will succeed. The destination
+     * host will be checked to see if it can accommodate the resources required
+     * by the domain. For example are the network, disk, memory, and CPU
+     * resources used by the domain on the source host also available on the
+     * destination host. The dry run will fail if libvirt determines the
+     * destination host is not capable of running the domain. Although a
+     * subsequent migration will likely succeed, the success of dry run does
+     * not ensure a future migration will succeed. Resources on the destination
+     * host could become unavailable between a dry run and actual migration.
+     */
+    VIR_MIGRATE_DRY_RUN           = (1 << 16),
 } virDomainMigrateFlags;
 
 
diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h
index e12b6972db..b0e2bc689b 100644
--- a/src/qemu/qemu_migration.h
+++ b/src/qemu/qemu_migration.h
@@ -57,7 +57,8 @@
      VIR_MIGRATE_AUTO_CONVERGE | \
      VIR_MIGRATE_RDMA_PIN_ALL | \
      VIR_MIGRATE_POSTCOPY | \
-     VIR_MIGRATE_TLS)
+     VIR_MIGRATE_TLS | \
+     VIR_MIGRATE_DRY_RUN)
 
 /* All supported migration parameters and their types. */
 # define QEMU_MIGRATION_PARAMETERS \
diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c
index 372bdb95d3..46f0f44917 100644
--- a/tools/virsh-domain.c
+++ b/tools/virsh-domain.c
@@ -10450,6 +10450,10 @@ static const vshCmdOptDef opts_migrate[] = {
      .type = VSH_OT_BOOL,
      .help = N_("use TLS for migration")
     },
+    {.name = "dry-run",
+     .type = VSH_OT_BOOL,
+     .help = N_("check if migration will succeed without actually performing the migration")
+    },
     {.name = NULL}
 };
 
@@ -10694,6 +10698,9 @@ doMigrate(void *opaque)
     if (vshCommandOptBool(cmd, "tls"))
         flags |= VIR_MIGRATE_TLS;
 
+    if (vshCommandOptBool(cmd, "dry-run"))
+        flags |= VIR_MIGRATE_DRY_RUN;
+
     if (flags & VIR_MIGRATE_PEER2PEER || vshCommandOptBool(cmd, "direct")) {
         if (virDomainMigrateToURI3(dom, desturi, params, nparams, flags) == 0)
             ret = '0';
diff --git a/tools/virsh.pod b/tools/virsh.pod
index 86c041d575..715fa3887f 100644
--- a/tools/virsh.pod
+++ b/tools/virsh.pod
@@ -1845,7 +1845,7 @@ I<domain> I<desturi> [I<migrateuri>] [I<graphicsuri>] [I<listen-address>] [I<dna
 [I<--compressed>] [I<--comp-methods> B<method-list>]
 [I<--comp-mt-level>] [I<--comp-mt-threads>] [I<--comp-mt-dthreads>]
 [I<--comp-xbzrle-cache>] [I<--auto-converge>] [I<auto-converge-initial>]
-[I<auto-converge-increment>] [I<--persistent-xml> B<file>] [I<--tls>]
+[I<auto-converge-increment>] [I<--persistent-xml> B<file>] [I<--tls>] [I<--dry-run>]
 
 Migrate domain to another host.  Add I<--live> for live migration; <--p2p>
 for peer-2-peer migration; I<--direct> for direct migration; or I<--tunnelled>
@@ -1937,6 +1937,14 @@ Providing I<--tls> causes the migration to use the host configured TLS setup
 the migration of the domain. Usage requires proper TLS setup for both source
 and target.
 
+I<--dry-run> can be used as a best-effort check that the migration command
+will succeed. The destination host will be checked to see if it can
+accommodate the resources required by the domain. I<--dry-run> will fail if
+the destination host is not capable of running the domain. Although a
+subsequent migration will likely succeed, the success of dry run does not
+ensure a future migration will succeed. Resources on the destination host
+could become unavailable between a dry run and actual migration.
+
 Running migration can be canceled by interrupting virsh (usually using
 C<Ctrl-C>) or by B<domjobabort> command sent from another virsh instance.
 
-- 
2.18.0

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Daniel P. Berrangé 7 years, 2 months ago

On Fri, Nov 02, 2018 at 04:34:02PM -0600, Jim Fehlig wrote:
> A dry run can be used as a best-effort check that a migration command
> will succeed. The destination host will be checked to see if it can
> accommodate the resources required by the domain. DRY_RUN will fail if
> the destination host is not capable of running the domain. Although a
> subsequent migration will likely succeed, the success of DRY_RUN does not
> ensure a future migration will succeed. Resources on the destination host
> could become unavailable between a DRY_RUN and actual migration.

I'm not really convinced this is a particularly useful concept,
as it is only going to catch a very small number of the reasons
why migration can fail. So you still have to expect the real
migration invokation to have a strong chance of failing.


> 
> Signed-off-by: Jim Fehlig <jfehlig@suse.com>
> ---
> 
> If it is agreed this is useful, my thought was to use the begin and
> prepare phases of migration to implement it. qemuMigrationDstPrepareAny()
> already does a lot of the heavy lifting wrt checking the host can
> accommodate the domain. Some of it, and the remaining migration phases,
> can be short-circuited in the case of dry run.
> 
> One interesting wrinkle I've observed is the check for cpu compatibility.
> AFAICT qemu is actually invoked on the dst, "filtered-features" of the cpu
> are requested via qmp, and results are checked against cpu in domain config.
> If cpu on dst is insufficient, migration fails in the prepare phase with
> something like "guest CPU doesn't match specification: missing features: z y z".
> I was hoping to avoid launching qemu in the case of dry run, but that may
> be unavoidable if we'd like a dependable dry run result.

Even launching QEMU isn't good enough - it has to actually process the
migration data stream for devices to get a good indication of success,
at which point you're basically doing a real migration.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Jim Fehlig 7 years, 2 months ago

On 11/12/18 4:26 AM, Daniel P. Berrangé wrote:
> On Fri, Nov 02, 2018 at 04:34:02PM -0600, Jim Fehlig wrote:
>> A dry run can be used as a best-effort check that a migration command
>> will succeed. The destination host will be checked to see if it can
>> accommodate the resources required by the domain. DRY_RUN will fail if
>> the destination host is not capable of running the domain. Although a
>> subsequent migration will likely succeed, the success of DRY_RUN does not
>> ensure a future migration will succeed. Resources on the destination host
>> could become unavailable between a DRY_RUN and actual migration.
> 
> I'm not really convinced this is a particularly useful concept,
> as it is only going to catch a very small number of the reasons
> why migration can fail. So you still have to expect the real
> migration invokation to have a strong chance of failing.

I agree it is difficult to reliably check that a migration will succeed. TBH, I 
was expecting opposition due to libvirt already providing info for applications 
to do the check themselves. E.g. as nova has done with 
check_can_live_migrate_{source,destination} APIs.

Do you think libvirt provides enough information for an app to determine if a VM 
can be migrated between two hosts? Or maybe better asked: What info is currently 
missing for an app to reliably check if a VM can be migrated between two hosts?

>>
>> Signed-off-by: Jim Fehlig <jfehlig@suse.com>
>> ---
>>
>> If it is agreed this is useful, my thought was to use the begin and
>> prepare phases of migration to implement it. qemuMigrationDstPrepareAny()
>> already does a lot of the heavy lifting wrt checking the host can
>> accommodate the domain. Some of it, and the remaining migration phases,
>> can be short-circuited in the case of dry run.
>>
>> One interesting wrinkle I've observed is the check for cpu compatibility.
>> AFAICT qemu is actually invoked on the dst, "filtered-features" of the cpu
>> are requested via qmp, and results are checked against cpu in domain config.
>> If cpu on dst is insufficient, migration fails in the prepare phase with
>> something like "guest CPU doesn't match specification: missing features: z y z".
>> I was hoping to avoid launching qemu in the case of dry run, but that may
>> be unavoidable if we'd like a dependable dry run result.
> 
> Even launching QEMU isn't good enough - it has to actually process the
> migration data stream for devices to get a good indication of success,
> at which point you're basically doing a real migration.

Bummer. I guess that answers my question above: no. It also implies apps cannot 
reliably check if a migration will succeed and should instead put effort into 
handling errors from an actual migration :-).

Regards,
Jim

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Daniel P. Berrangé 7 years, 2 months ago

On Mon, Nov 12, 2018 at 11:33:04AM -0700, Jim Fehlig wrote:
> On 11/12/18 4:26 AM, Daniel P. Berrangé wrote:
> > On Fri, Nov 02, 2018 at 04:34:02PM -0600, Jim Fehlig wrote:
> > > A dry run can be used as a best-effort check that a migration command
> > > will succeed. The destination host will be checked to see if it can
> > > accommodate the resources required by the domain. DRY_RUN will fail if
> > > the destination host is not capable of running the domain. Although a
> > > subsequent migration will likely succeed, the success of DRY_RUN does not
> > > ensure a future migration will succeed. Resources on the destination host
> > > could become unavailable between a DRY_RUN and actual migration.
> > 
> > I'm not really convinced this is a particularly useful concept,
> > as it is only going to catch a very small number of the reasons
> > why migration can fail. So you still have to expect the real
> > migration invokation to have a strong chance of failing.
> 
> I agree it is difficult to reliably check that a migration will succeed.
> TBH, I was expecting opposition due to libvirt already providing info for
> applications to do the check themselves. E.g. as nova has done with
> check_can_live_migrate_{source,destination} APIs.
> 
> Do you think libvirt provides enough information for an app to determine if
> a VM can be migrated between two hosts? Or maybe better asked: What info is
> currently missing for an app to reliably check if a VM can be migrated
> between two hosts?

There's probably two classes of problem here

 - Things that would prevent the QEMU process being started.

   * XML points to host resources that don't exist (block devices,
     files, nics, host devs, etc, NUMA/CPU pinning)

   * Use of QEMU features that aren't supported by this QEMU version

   * Insufficient free resources. Principally lack of RAM,
     both normal and huge pages.

   These problems are not really anthing todo with live migration
   as they impact normal guest startup to exactly the same degree.

   Libvirt will already report on the first two problems during
   its normal QEMU setup process. During live migration you'll
   see these problems reported quite quickly in the prepare phase
   before any data is sent.

   Insufficient resources is really hard to report on with any
   useful accuracy. We can't even predict reliably how much RAM
   any given QEMU config will need, let alone measure whether
   the host is able to provide that much. If you're lucky QEMU
   may simply fail to start due to insufficient RAM/huge pages.
   This would abort the live migration early on before much data
   is sent.

 - Things that interfere with the live migration operation

    * Firewall blocks libvirtd <-> libvirtd comms

    * Firewall blocks QEMU <-> QEMU comms

    * Storage copy is not requested and disks are not
      on shared storage

    * Network connectivity won't seemlessly switch for
      guest NICs

    * Bugs in QEMU when loading device state causing
      failure

    * Bugs in libvirt not correctly configuring QEMU
      to ensure stable ABI

    * Live migration never converging

   Some of these get seen quite quickly such as firewall
   issues. Bugs in device state are only seen durnig the
   main data transfer. Problems with storage/network
   setup are only seen when the guest crashes & burns
   after migration is complete & are hard to diagnose
   earlier from libvirt's POV. Apps like nova can
   diagnose this kind of thing better as they have a
   higher level view of the storage/network connectivity
   that libvirt can't see.

   Live migration convergance is the real hard one
   that causes alot of pain for people. Personally I
   recommend that people use post-copy by defalt
   to guarantee convergance in finite time, with
   low impact on guest performance. There was an
   interesting presentation at KVM Forum this year
   about doing workload prediction for VMs to identify
   which time/day has a workload that most friendly
   towards convergance.

> > Even launching QEMU isn't good enough - it has to actually process the
> > migration data stream for devices to get a good indication of success,
> > at which point you're basically doing a real migration.
> 
> Bummer. I guess that answers my question above: no. It also implies apps
> cannot reliably check if a migration will succeed and should instead put
> effort into handling errors from an actual migration :-).

Yep, we pretty much have to accept that live migration is going to fail
and work to ensure that when it fails, you don't loose the original
VM. 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Jim Fehlig 7 years, 2 months ago

On 11/13/18 3:29 AM, Daniel P. Berrangé wrote:
> On Mon, Nov 12, 2018 at 11:33:04AM -0700, Jim Fehlig wrote:
>> On 11/12/18 4:26 AM, Daniel P. Berrangé wrote:
>>> On Fri, Nov 02, 2018 at 04:34:02PM -0600, Jim Fehlig wrote:
>>>> A dry run can be used as a best-effort check that a migration command
>>>> will succeed. The destination host will be checked to see if it can
>>>> accommodate the resources required by the domain. DRY_RUN will fail if
>>>> the destination host is not capable of running the domain. Although a
>>>> subsequent migration will likely succeed, the success of DRY_RUN does not
>>>> ensure a future migration will succeed. Resources on the destination host
>>>> could become unavailable between a DRY_RUN and actual migration.
>>>
>>> I'm not really convinced this is a particularly useful concept,
>>> as it is only going to catch a very small number of the reasons
>>> why migration can fail. So you still have to expect the real
>>> migration invokation to have a strong chance of failing.
>>
>> I agree it is difficult to reliably check that a migration will succeed.
>> TBH, I was expecting opposition due to libvirt already providing info for
>> applications to do the check themselves. E.g. as nova has done with
>> check_can_live_migrate_{source,destination} APIs.
>>
>> Do you think libvirt provides enough information for an app to determine if
>> a VM can be migrated between two hosts? Or maybe better asked: What info is
>> currently missing for an app to reliably check if a VM can be migrated
>> between two hosts?
> 
> There's probably two classes of problem here
> 
>   - Things that would prevent the QEMU process being started.
>   
>     * XML points to host resources that don't exist (block devices,
>       files, nics, host devs, etc, NUMA/CPU pinning)
> 
>     * Use of QEMU features that aren't supported by this QEMU version
> 
>     * Insufficient free resources. Principally lack of RAM,
>       both normal and huge pages.
> 
>     These problems are not really anthing todo with live migration
>     as they impact normal guest startup to exactly the same degree.
> 
>     Libvirt will already report on the first two problems during
>     its normal QEMU setup process. During live migration you'll
>     see these problems reported quite quickly in the prepare phase
>     before any data is sent.

Right. These are the ones that would be easy to detect with dry run, which I 
envisioned would terminate after the prepare phase.

>     Insufficient resources is really hard to report on with any
>     useful accuracy. We can't even predict reliably how much RAM
>     any given QEMU config will need, let alone measure whether
>     the host is able to provide that much. If you're lucky QEMU
>     may simply fail to start due to insufficient RAM/huge pages.
>     This would abort the live migration early on before much data
>     is sent.

Inability to predict qemu memory overhead is indeed unfortunate. E.g. SEV 
encrypted VMs must (at the moment) have all their memory regions locked: guest 
RAM, ROM(s), pflash, video RAM, and any qemu overhead. The last one is an 
"undecidable problem" (from libvirt docs) and makes it difficult to calculate a 
suitable value for /domain/memtune/hard_limit. If the value is too small the VM 
will fail to start.

nova also has a 'reserved_host_memory_mb' setting which should include the qemu 
overhead IMO. But the docs have no guidance on how to set that, likely because 
there is no known way to reliably calculate the overhead.

>   - Things that interfere with the live migration operation
> 
>      * Firewall blocks libvirtd <-> libvirtd comms
> 
>      * Firewall blocks QEMU <-> QEMU comms
> 
>      * Storage copy is not requested and disks are not
>        on shared storage

I think these could be successfully checked in dry run too.

> 
>      * Network connectivity won't seemlessly switch for
>        guest NICs
> 
>      * Bugs in QEMU when loading device state causing
>        failure
> 
>      * Bugs in libvirt not correctly configuring QEMU
>        to ensure stable ABI
> 
>      * Live migration never converging

I've no illusions that these can be checked in dry run :-).

> 
>     Some of these get seen quite quickly such as firewall
>     issues. Bugs in device state are only seen durnig the
>     main data transfer. Problems with storage/network
>     setup are only seen when the guest crashes & burns
>     after migration is complete & are hard to diagnose
>     earlier from libvirt's POV. Apps like nova can
>     diagnose this kind of thing better as they have a
>     higher level view of the storage/network connectivity
>     that libvirt can't see.
> 
>     Live migration convergance is the real hard one
>     that causes alot of pain for people. Personally I
>     recommend that people use post-copy by defalt
>     to guarantee convergance in finite time, with
>     low impact on guest performance. There was an
>     interesting presentation at KVM Forum this year
>     about doing workload prediction for VMs to identify
>     which time/day has a workload that most friendly
>     towards convergance.

Ah, interesting. I've watched some of the videos as they become available in the 
youtube channel and will look for this one.

>>> Even launching QEMU isn't good enough - it has to actually process the
>>> migration data stream for devices to get a good indication of success,
>>> at which point you're basically doing a real migration.
>>
>> Bummer. I guess that answers my question above: no. It also implies apps
>> cannot reliably check if a migration will succeed and should instead put
>> effort into handling errors from an actual migration :-).
> 
> Yep, we pretty much have to accept that live migration is going to fail
> and work to ensure that when it fails, you don't loose the original
> VM.

Thanks for your detailed response, much appreciated.

Regards,
Jim

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Michal Privoznik 7 years, 3 months ago

On 11/02/2018 11:34 PM, Jim Fehlig wrote:
> A dry run can be used as a best-effort check that a migration command
> will succeed. The destination host will be checked to see if it can
> accommodate the resources required by the domain. DRY_RUN will fail if
> the destination host is not capable of running the domain. Although a
> subsequent migration will likely succeed, the success of DRY_RUN does not
> ensure a future migration will succeed. Resources on the destination host
> could become unavailable between a DRY_RUN and actual migration.
> 
> Signed-off-by: Jim Fehlig <jfehlig@suse.com>
> ---
> 
> If it is agreed this is useful, my thought was to use the begin and
> prepare phases of migration to implement it. qemuMigrationDstPrepareAny()
> already does a lot of the heavy lifting wrt checking the host can
> accommodate the domain. Some of it, and the remaining migration phases,
> can be short-circuited in the case of dry run.
> 
> One interesting wrinkle I've observed is the check for cpu compatibility.
> AFAICT qemu is actually invoked on the dst, "filtered-features" of the cpu
> are requested via qmp, and results are checked against cpu in domain config.
> If cpu on dst is insufficient, migration fails in the prepare phase with
> something like "guest CPU doesn't match specification: missing features: z y z".
> I was hoping to avoid launching qemu in the case of dry run, but that may
> be unavoidable if we'd like a dependable dry run result.
> 
> Thanks for considering the idea!
> 
> (BTW, if it is considered useful I will follow up with a V1 series that
> includes this patch and and impl for the qemu driver.)
> 
>  include/libvirt/libvirt-domain.h | 12 ++++++++++++
>  src/qemu/qemu_migration.h        |  3 ++-
>  tools/virsh-domain.c             |  7 +++++++
>  tools/virsh.pod                  | 10 +++++++++-
>  4 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
> index fdd2d6b8ea..6d52f6ce50 100644
> --- a/include/libvirt/libvirt-domain.h
> +++ b/include/libvirt/libvirt-domain.h
> @@ -830,6 +830,18 @@ typedef enum {
>       */
>      VIR_MIGRATE_TLS               = (1 << 16),
>  
> +    /* Setting the VIR_MIGRATE_DRY_RUN flag will cause libvirt to make a
> +     * best-effort attempt to check if migration will succeed. The destination
> +     * host will be checked to see if it can accommodate the resources required
> +     * by the domain. For example are the network, disk, memory, and CPU

While this is a honourable goal to achieve I don't think we can
guarantee it (without running qemu). At least in qemu world. For
instance, libvirt doesn't check if there's enough memory (nor regular
nor hugepages) when domain is started/migrated. We just run qemu and let
it fail. However, for network, CPU and hostdev we do run checks so these
might work. Disks are in grey area - we check their presence but not
their labels. And if domain is relabel=no then the only way to learn if
qemu would succeed is to run it.

But I don't see much problem with starting qemu in paused state. I mean,
we can get through Prepare phase but never actually reach Perform stage.
The API/flag would return success if Prepare succeeded.

I bet it's easier to check if migration would succeed in xen world, or?

The other thing is how are apps expected to use this? I mean, if an app
wants to work without admin intervention then it would need to learn how
to fix any possible error (missing disk, perms issue, missing hostdev,
etc.). This is not a trivial task IMO.

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH] Add new migration flag VIR_MIGRATE_DRY_RUN

Posted by Jim Fehlig 7 years, 3 months ago

On 11/5/18 1:46 AM, Michal Privoznik wrote:
> On 11/02/2018 11:34 PM, Jim Fehlig wrote:
>> A dry run can be used as a best-effort check that a migration command
>> will succeed. The destination host will be checked to see if it can
>> accommodate the resources required by the domain. DRY_RUN will fail if
>> the destination host is not capable of running the domain. Although a
>> subsequent migration will likely succeed, the success of DRY_RUN does not
>> ensure a future migration will succeed. Resources on the destination host
>> could become unavailable between a DRY_RUN and actual migration.
>>
>> Signed-off-by: Jim Fehlig <jfehlig@suse.com>
>> ---
>>
>> If it is agreed this is useful, my thought was to use the begin and
>> prepare phases of migration to implement it. qemuMigrationDstPrepareAny()
>> already does a lot of the heavy lifting wrt checking the host can
>> accommodate the domain. Some of it, and the remaining migration phases,
>> can be short-circuited in the case of dry run.
>>
>> One interesting wrinkle I've observed is the check for cpu compatibility.
>> AFAICT qemu is actually invoked on the dst, "filtered-features" of the cpu
>> are requested via qmp, and results are checked against cpu in domain config.
>> If cpu on dst is insufficient, migration fails in the prepare phase with
>> something like "guest CPU doesn't match specification: missing features: z y z".
>> I was hoping to avoid launching qemu in the case of dry run, but that may
>> be unavoidable if we'd like a dependable dry run result.
>>
>> Thanks for considering the idea!
>>
>> (BTW, if it is considered useful I will follow up with a V1 series that
>> includes this patch and and impl for the qemu driver.)
>>
>>   include/libvirt/libvirt-domain.h | 12 ++++++++++++
>>   src/qemu/qemu_migration.h        |  3 ++-
>>   tools/virsh-domain.c             |  7 +++++++
>>   tools/virsh.pod                  | 10 +++++++++-
>>   4 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>> index fdd2d6b8ea..6d52f6ce50 100644
>> --- a/include/libvirt/libvirt-domain.h
>> +++ b/include/libvirt/libvirt-domain.h
>> @@ -830,6 +830,18 @@ typedef enum {
>>        */
>>       VIR_MIGRATE_TLS               = (1 << 16),
>>   
>> +    /* Setting the VIR_MIGRATE_DRY_RUN flag will cause libvirt to make a
>> +     * best-effort attempt to check if migration will succeed. The destination
>> +     * host will be checked to see if it can accommodate the resources required
>> +     * by the domain. For example are the network, disk, memory, and CPU
> 
> While this is a honourable goal to achieve I don't think we can
> guarantee it (without running qemu). At least in qemu world.

I don't think it can be guaranteed even if qemu is run. That's why the rest of 
the comment warns about relying on dry run's success. Dry run succeeding should 
give the user warm fuzzies, but it can't guarantee success of a future migration.

> For instance, libvirt doesn't check if there's enough memory (nor regular
> nor hugepages) when domain is started/migrated. We just run qemu and let
> it fail. However, for network, CPU and hostdev we do run checks so these
> might work. Disks are in grey area - we check their presence but not
> their labels. And if domain is relabel=no then the only way to learn if
> qemu would succeed is to run it.

I'll have to check but I think starting qemu for dry run is a no-go if host 
resources are actually consumed. E.g. if host memory is given to the dry run 
qemu and not available for non dry run instances.

> But I don't see much problem with starting qemu in paused state. I mean,
> we can get through Prepare phase but never actually reach Perform stage.
> The API/flag would return success if Prepare succeeded.

Yep, my though exactly, along with doing less preparation in the prepare phase.

> I bet it's easier to check if migration would succeed in xen world, or?

I suppose so, if anything because it supports less options. E.g. there's only 
one type of cpu for Xen PV domains.

> The other thing is how are apps expected to use this? I mean, if an app
> wants to work without admin intervention then it would need to learn how
> to fix any possible error (missing disk, perms issue, missing hostdev,
> etc.). This is not a trivial task IMO.

That's the case today if an actual migration fails. Dry run simply allows 
checking the possible success of migration without actually performing it. Admin 
intervention can occur before there is any attempt to perform a doomed migration 
(which in worse case can result in domain not running on src or dst).

Regards,
Jim

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list