vfio: Make migration support experimental

[RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Alex Williamson 3 years, 5 months ago

Per the proposed documentation for vfio device migration:

  Dirty pages are tracked when device is in stop-and-copy phase
  because if pages are marked dirty during pre-copy phase and
  content is transfered from source to destination, there is no
  way to know newly dirtied pages from the point they were copied
  earlier until device stops. To avoid repeated copy of same
  content, pinned pages are marked dirty only during
  stop-and-copy phase.

Essentially, since we don't have hardware dirty page tracking for
assigned devices at this point, we consider any page that is pinned
by an mdev vendor driver or pinned and mapped through the IOMMU to
be perpetually dirty.  In the worst case, this may result in all of
guest memory being considered dirty during every iteration of live
migration.  The current vfio implementation of migration has chosen
to mask device dirtied pages until the final stages of migration in
order to avoid this worst case scenario.

Allowing the device to implement a policy decision to prioritize
reduced migration data like this jeopardizes QEMU's overall ability
to implement any degree of service level guarantees during migration.
For example, any estimates towards achieving acceptable downtime
margins cannot be trusted when such a device is present.  The vfio
device should participate in dirty page tracking to the best of its
ability throughout migration, even if that means the dirty footprint
of the device impedes migration progress, allowing both QEMU and
higher level management tools to decide whether to continue the
migration or abort due to failure to achieve the desired behavior.

Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: Philippe Mathieu-Daudé <philmd@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

Given that our discussion in the link above seems to be going in
circles, I'm afraid it seems necessary to both have a contigency
plan and to raise the visibility of the current behavior to
determine whether others agree that this is a sufficiently
troubling behavior to consider migration support experimental
at this stage.  Please voice your opinion or contribute patches
to resolve this before QEMU 5.2.  Thanks,

Alex

 hw/vfio/migration.c           |    2 +-
 hw/vfio/pci.c                 |    2 ++
 include/hw/vfio/vfio-common.h |    1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3ce285ea395d..cd44d465a50b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
     Error *local_err = NULL;
     int ret = -ENOTSUP;
 
-    if (!container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
         goto add_blocker;
     }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 58c0ce8971e3..1349b900e513 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_REQ_BIT, true),
     DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
+    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
+                     vbasedev.enable_migration, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
     DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
                      vbasedev.ram_block_discard_allowed, false),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index baeb4dcff102..2119872c8af1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -123,6 +123,7 @@ typedef struct VFIODevice {
     bool needs_reset;
     bool no_mmap;
     bool ram_block_discard_allowed;
+    bool enable_migration;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Dr. David Alan Gilbert 3 years, 5 months ago

* Alex Williamson (alex.williamson@redhat.com) wrote:
> Per the proposed documentation for vfio device migration:
> 
>   Dirty pages are tracked when device is in stop-and-copy phase
>   because if pages are marked dirty during pre-copy phase and
>   content is transfered from source to destination, there is no
>   way to know newly dirtied pages from the point they were copied
>   earlier until device stops. To avoid repeated copy of same
>   content, pinned pages are marked dirty only during
>   stop-and-copy phase.
> 
> Essentially, since we don't have hardware dirty page tracking for
> assigned devices at this point, we consider any page that is pinned
> by an mdev vendor driver or pinned and mapped through the IOMMU to
> be perpetually dirty.  In the worst case, this may result in all of
> guest memory being considered dirty during every iteration of live
> migration.  The current vfio implementation of migration has chosen
> to mask device dirtied pages until the final stages of migration in
> order to avoid this worst case scenario.
> 
> Allowing the device to implement a policy decision to prioritize
> reduced migration data like this jeopardizes QEMU's overall ability
> to implement any degree of service level guarantees during migration.
> For example, any estimates towards achieving acceptable downtime
> margins cannot be trusted when such a device is present.  The vfio
> device should participate in dirty page tracking to the best of its
> ability throughout migration, even if that means the dirty footprint
> of the device impedes migration progress, allowing both QEMU and
> higher level management tools to decide whether to continue the
> migration or abort due to failure to achieve the desired behavior.

I don't feel particularly badly about the decision to squash it in
during the stop-and-copy phase; for devices where the pinned memory
is large, I don't think doing it during the main phase makes much sense;
especially if you then have to deal with tracking changes in pinning.

Having said that, I agree with marking it as experimental, because
I'm dubious how useful it will be for the same reason, I worry
about whether the downtime will be so large to make it pointless.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
> Cc: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Neo Jia <cjia@nvidia.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: Philippe Mathieu-DaudÃ© <philmd@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Given that our discussion in the link above seems to be going in
> circles, I'm afraid it seems necessary to both have a contigency
> plan and to raise the visibility of the current behavior to
> determine whether others agree that this is a sufficiently
> troubling behavior to consider migration support experimental
> at this stage.  Please voice your opinion or contribute patches
> to resolve this before QEMU 5.2.  Thanks,
> 
> Alex
> 
>  hw/vfio/migration.c           |    2 +-
>  hw/vfio/pci.c                 |    2 ++
>  include/hw/vfio/vfio-common.h |    1 +
>  3 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 3ce285ea395d..cd44d465a50b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>      Error *local_err = NULL;
>      int ret = -ENOTSUP;
>  
> -    if (!container->dirty_pages_supported) {
> +    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
>          goto add_blocker;
>      }
>  
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 58c0ce8971e3..1349b900e513 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
>                      VFIO_FEATURE_ENABLE_REQ_BIT, true),
>      DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
>                      VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
> +    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
> +                     vbasedev.enable_migration, false),
>      DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
>      DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
>                       vbasedev.ram_block_discard_allowed, false),
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index baeb4dcff102..2119872c8af1 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>      bool needs_reset;
>      bool no_mmap;
>      bool ram_block_discard_allowed;
> +    bool enable_migration;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Alex Williamson 3 years, 5 months ago

On Mon, 9 Nov 2020 19:44:17 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > Per the proposed documentation for vfio device migration:
> > 
> >   Dirty pages are tracked when device is in stop-and-copy phase
> >   because if pages are marked dirty during pre-copy phase and
> >   content is transfered from source to destination, there is no
> >   way to know newly dirtied pages from the point they were copied
> >   earlier until device stops. To avoid repeated copy of same
> >   content, pinned pages are marked dirty only during
> >   stop-and-copy phase.
> > 
> > Essentially, since we don't have hardware dirty page tracking for
> > assigned devices at this point, we consider any page that is pinned
> > by an mdev vendor driver or pinned and mapped through the IOMMU to
> > be perpetually dirty.  In the worst case, this may result in all of
> > guest memory being considered dirty during every iteration of live
> > migration.  The current vfio implementation of migration has chosen
> > to mask device dirtied pages until the final stages of migration in
> > order to avoid this worst case scenario.
> > 
> > Allowing the device to implement a policy decision to prioritize
> > reduced migration data like this jeopardizes QEMU's overall ability
> > to implement any degree of service level guarantees during migration.
> > For example, any estimates towards achieving acceptable downtime
> > margins cannot be trusted when such a device is present.  The vfio
> > device should participate in dirty page tracking to the best of its
> > ability throughout migration, even if that means the dirty footprint
> > of the device impedes migration progress, allowing both QEMU and
> > higher level management tools to decide whether to continue the
> > migration or abort due to failure to achieve the desired behavior.  
> 
> I don't feel particularly badly about the decision to squash it in
> during the stop-and-copy phase; for devices where the pinned memory
> is large, I don't think doing it during the main phase makes much sense;
> especially if you then have to deal with tracking changes in pinning.


AFAIK the kernel support for tracking changes in page pinning already
exists, this is largely the vfio device in QEMU that decides when to
start exposing the device dirty footprint to QEMU.  I'm a bit surprised
by this answer though, we don't really know what the device memory
footprint is.  It might be large, it might be nothing, but by not
participating in dirty page tracking until the VM is stopped, we can't
know what the footprint is and how it will affect downtime.  Is it
really the place of a QEMU device driver to impose this sort of policy?


> Having said that, I agree with marking it as experimental, because
> I'm dubious how useful it will be for the same reason, I worry
> about whether the downtime will be so large to make it pointless.


TBH I think that's the wrong reason to mark it experimental.  There's
clearly demand for vfio device migration and even if the practical use
cases are initially small, they will expand over time and hardware will
get better.  My objection is that the current behavior masks the
hardware and device limitations, leading to unrealistic expectations.
If the user expects minimal downtime, configures convergence to account
for that, QEMU thinks it can achieve it, and then the device marks
everything dirty, that's not supportable.  OTOH if the vfio device
participates in dirty tracking through pre-copy, then the practical use
cases will find themselves as migrations will either be aborted because
downtime tolerances cannot be achieved or downtimes will be configured
to match reality.  Thanks,

Alex

> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> > Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
> > Cc: Kirti Wankhede <kwankhede@nvidia.com>
> > Cc: Neo Jia <cjia@nvidia.com>
> > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Cc: Juan Quintela <quintela@redhat.com>
> > Cc: Philippe Mathieu-Daudé <philmd@redhat.com>
> > Cc: Cornelia Huck <cohuck@redhat.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> > Given that our discussion in the link above seems to be going in
> > circles, I'm afraid it seems necessary to both have a contigency
> > plan and to raise the visibility of the current behavior to
> > determine whether others agree that this is a sufficiently
> > troubling behavior to consider migration support experimental
> > at this stage.  Please voice your opinion or contribute patches
> > to resolve this before QEMU 5.2.  Thanks,
> > 
> > Alex
> > 
> >  hw/vfio/migration.c           |    2 +-
> >  hw/vfio/pci.c                 |    2 ++
> >  include/hw/vfio/vfio-common.h |    1 +
> >  3 files changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 3ce285ea395d..cd44d465a50b 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> >      Error *local_err = NULL;
> >      int ret = -ENOTSUP;
> >  
> > -    if (!container->dirty_pages_supported) {
> > +    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> >          goto add_blocker;
> >      }
> >  
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 58c0ce8971e3..1349b900e513 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
> >                      VFIO_FEATURE_ENABLE_REQ_BIT, true),
> >      DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
> >                      VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
> > +    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
> > +                     vbasedev.enable_migration, false),
> >      DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> >      DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
> >                       vbasedev.ram_block_discard_allowed, false),
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index baeb4dcff102..2119872c8af1 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> >      bool needs_reset;
> >      bool no_mmap;
> >      bool ram_block_discard_allowed;
> > +    bool enable_migration;
> >      VFIODeviceOps *ops;
> >      unsigned int num_irqs;
> >      unsigned int num_regions;
> >

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Dr. David Alan Gilbert 3 years, 5 months ago

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Mon, 9 Nov 2020 19:44:17 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > Per the proposed documentation for vfio device migration:
> > > 
> > >   Dirty pages are tracked when device is in stop-and-copy phase
> > >   because if pages are marked dirty during pre-copy phase and
> > >   content is transfered from source to destination, there is no
> > >   way to know newly dirtied pages from the point they were copied
> > >   earlier until device stops. To avoid repeated copy of same
> > >   content, pinned pages are marked dirty only during
> > >   stop-and-copy phase.
> > > 
> > > Essentially, since we don't have hardware dirty page tracking for
> > > assigned devices at this point, we consider any page that is pinned
> > > by an mdev vendor driver or pinned and mapped through the IOMMU to
> > > be perpetually dirty.  In the worst case, this may result in all of
> > > guest memory being considered dirty during every iteration of live
> > > migration.  The current vfio implementation of migration has chosen
> > > to mask device dirtied pages until the final stages of migration in
> > > order to avoid this worst case scenario.
> > > 
> > > Allowing the device to implement a policy decision to prioritize
> > > reduced migration data like this jeopardizes QEMU's overall ability
> > > to implement any degree of service level guarantees during migration.
> > > For example, any estimates towards achieving acceptable downtime
> > > margins cannot be trusted when such a device is present.  The vfio
> > > device should participate in dirty page tracking to the best of its
> > > ability throughout migration, even if that means the dirty footprint
> > > of the device impedes migration progress, allowing both QEMU and
> > > higher level management tools to decide whether to continue the
> > > migration or abort due to failure to achieve the desired behavior.  
> > 
> > I don't feel particularly badly about the decision to squash it in
> > during the stop-and-copy phase; for devices where the pinned memory
> > is large, I don't think doing it during the main phase makes much sense;
> > especially if you then have to deal with tracking changes in pinning.
> 
> 
> AFAIK the kernel support for tracking changes in page pinning already
> exists, this is largely the vfio device in QEMU that decides when to
> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
> by this answer though, we don't really know what the device memory
> footprint is.  It might be large, it might be nothing, but by not
> participating in dirty page tracking until the VM is stopped, we can't
> know what the footprint is and how it will affect downtime.  Is it
> really the place of a QEMU device driver to impose this sort of policy?

If it could actually track changes then I'd agree we shouldn't impose
any policy; but if it's just marking the whole area as dirty we're going
to need a bodge somewhere; this bodge doesn't look any worse than the
others to me.

> 
> > Having said that, I agree with marking it as experimental, because
> > I'm dubious how useful it will be for the same reason, I worry
> > about whether the downtime will be so large to make it pointless.
> 
> 
> TBH I think that's the wrong reason to mark it experimental.  There's
> clearly demand for vfio device migration and even if the practical use
> cases are initially small, they will expand over time and hardware will
> get better.  My objection is that the current behavior masks the
> hardware and device limitations, leading to unrealistic expectations.
> If the user expects minimal downtime, configures convergence to account
> for that, QEMU thinks it can achieve it, and then the device marks
> everything dirty, that's not supportable. 

Yes, agreed.

> OTOH if the vfio device
> participates in dirty tracking through pre-copy, then the practical use
> cases will find themselves as migrations will either be aborted because
> downtime tolerances cannot be achieved or downtimes will be configured
> to match reality.  Thanks,

Without a way to prioritise the unpinned memory during that period,
we're going to be repeatedly sending the pinned memory which is going to
lead to a much larger bandwidth usage that required; so that's going in
completely the wrong direction and also wrong from the point of view of
the user.

Dave

> 
> Alex
> 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > > Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
> > > Cc: Kirti Wankhede <kwankhede@nvidia.com>
> > > Cc: Neo Jia <cjia@nvidia.com>
> > > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > Cc: Juan Quintela <quintela@redhat.com>
> > > Cc: Philippe Mathieu-DaudÃ© <philmd@redhat.com>
> > > Cc: Cornelia Huck <cohuck@redhat.com>
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > > 
> > > Given that our discussion in the link above seems to be going in
> > > circles, I'm afraid it seems necessary to both have a contigency
> > > plan and to raise the visibility of the current behavior to
> > > determine whether others agree that this is a sufficiently
> > > troubling behavior to consider migration support experimental
> > > at this stage.  Please voice your opinion or contribute patches
> > > to resolve this before QEMU 5.2.  Thanks,
> > > 
> > > Alex
> > > 
> > >  hw/vfio/migration.c           |    2 +-
> > >  hw/vfio/pci.c                 |    2 ++
> > >  include/hw/vfio/vfio-common.h |    1 +
> > >  3 files changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > index 3ce285ea395d..cd44d465a50b 100644
> > > --- a/hw/vfio/migration.c
> > > +++ b/hw/vfio/migration.c
> > > @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> > >      Error *local_err = NULL;
> > >      int ret = -ENOTSUP;
> > >  
> > > -    if (!container->dirty_pages_supported) {
> > > +    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> > >          goto add_blocker;
> > >      }
> > >  
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 58c0ce8971e3..1349b900e513 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
> > >                      VFIO_FEATURE_ENABLE_REQ_BIT, true),
> > >      DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
> > >                      VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
> > > +    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
> > > +                     vbasedev.enable_migration, false),
> > >      DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> > >      DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
> > >                       vbasedev.ram_block_discard_allowed, false),
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index baeb4dcff102..2119872c8af1 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> > >      bool needs_reset;
> > >      bool no_mmap;
> > >      bool ram_block_discard_allowed;
> > > +    bool enable_migration;
> > >      VFIODeviceOps *ops;
> > >      unsigned int num_irqs;
> > >      unsigned int num_regions;
> > >   
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Kirti Wankhede 3 years, 5 months ago


On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:
> * Alex Williamson (alex.williamson@redhat.com) wrote:
>> On Mon, 9 Nov 2020 19:44:17 +0000
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>>
>>> * Alex Williamson (alex.williamson@redhat.com) wrote:
>>>> Per the proposed documentation for vfio device migration:
>>>>
>>>>    Dirty pages are tracked when device is in stop-and-copy phase
>>>>    because if pages are marked dirty during pre-copy phase and
>>>>    content is transfered from source to destination, there is no
>>>>    way to know newly dirtied pages from the point they were copied
>>>>    earlier until device stops. To avoid repeated copy of same
>>>>    content, pinned pages are marked dirty only during
>>>>    stop-and-copy phase.
>>>>
>>>> Essentially, since we don't have hardware dirty page tracking for
>>>> assigned devices at this point, we consider any page that is pinned
>>>> by an mdev vendor driver or pinned and mapped through the IOMMU to
>>>> be perpetually dirty.  In the worst case, this may result in all of
>>>> guest memory being considered dirty during every iteration of live
>>>> migration.  The current vfio implementation of migration has chosen
>>>> to mask device dirtied pages until the final stages of migration in
>>>> order to avoid this worst case scenario.
>>>>
>>>> Allowing the device to implement a policy decision to prioritize
>>>> reduced migration data like this jeopardizes QEMU's overall ability
>>>> to implement any degree of service level guarantees during migration.
>>>> For example, any estimates towards achieving acceptable downtime
>>>> margins cannot be trusted when such a device is present.  The vfio
>>>> device should participate in dirty page tracking to the best of its
>>>> ability throughout migration, even if that means the dirty footprint
>>>> of the device impedes migration progress, allowing both QEMU and
>>>> higher level management tools to decide whether to continue the
>>>> migration or abort due to failure to achieve the desired behavior.
>>>
>>> I don't feel particularly badly about the decision to squash it in
>>> during the stop-and-copy phase; for devices where the pinned memory
>>> is large, I don't think doing it during the main phase makes much sense;
>>> especially if you then have to deal with tracking changes in pinning.
>>
>>
>> AFAIK the kernel support for tracking changes in page pinning already
>> exists, this is largely the vfio device in QEMU that decides when to
>> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
>> by this answer though, we don't really know what the device memory
>> footprint is.  It might be large, it might be nothing, but by not
>> participating in dirty page tracking until the VM is stopped, we can't
>> know what the footprint is and how it will affect downtime.  Is it
>> really the place of a QEMU device driver to impose this sort of policy?
> 
> If it could actually track changes then I'd agree we shouldn't impose
> any policy; but if it's just marking the whole area as dirty we're going
> to need a bodge somewhere; this bodge doesn't look any worse than the
> others to me.
> 
>>
>>> Having said that, I agree with marking it as experimental, because
>>> I'm dubious how useful it will be for the same reason, I worry
>>> about whether the downtime will be so large to make it pointless.
>>

Not all device state is large, for example NIC might only report 
currently mapped RX buffers which usually not more than a 1GB and could 
be as low as 10's of MB. GPU might or might not have large data, that 
depends on its use cases.

>>
>> TBH I think that's the wrong reason to mark it experimental.  There's
>> clearly demand for vfio device migration and even if the practical use
>> cases are initially small, they will expand over time and hardware will
>> get better.  My objection is that the current behavior masks the
>> hardware and device limitations, leading to unrealistic expectations.
>> If the user expects minimal downtime, configures convergence to account
>> for that, QEMU thinks it can achieve it, and then the device marks
>> everything dirty, that's not supportable.
> 
> Yes, agreed.

Yes, there is demand for vfio device migration and many devices owners 
started scoping and development for migration support.
Instead of making whole migration support as experimental, we can have 
opt-in option to decide to mark sys mem pages dirty during iterative 
phase (pre-copy phase) of migration.

Thanks,
Kirti

> 
>> OTOH if the vfio device
>> participates in dirty tracking through pre-copy, then the practical use
>> cases will find themselves as migrations will either be aborted because
>> downtime tolerances cannot be achieved or downtimes will be configured
>> to match reality.  Thanks,
> 
> Without a way to prioritise the unpinned memory during that period,
> we're going to be repeatedly sending the pinned memory which is going to
> lead to a much larger bandwidth usage that required; so that's going in
> completely the wrong direction and also wrong from the point of view of
> the user.
> 
> Dave
> 
>>
>> Alex
>>
>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>>> Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
>>>> Cc: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Cc: Neo Jia <cjia@nvidia.com>
>>>> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>> Cc: Juan Quintela <quintela@redhat.com>
>>>> Cc: Philippe Mathieu-DaudÃ© <philmd@redhat.com>
>>>> Cc: Cornelia Huck <cohuck@redhat.com>
>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>> ---
>>>>
>>>> Given that our discussion in the link above seems to be going in
>>>> circles, I'm afraid it seems necessary to both have a contigency
>>>> plan and to raise the visibility of the current behavior to
>>>> determine whether others agree that this is a sufficiently
>>>> troubling behavior to consider migration support experimental
>>>> at this stage.  Please voice your opinion or contribute patches
>>>> to resolve this before QEMU 5.2.  Thanks,
>>>>
>>>> Alex
>>>>
>>>>   hw/vfio/migration.c           |    2 +-
>>>>   hw/vfio/pci.c                 |    2 ++
>>>>   include/hw/vfio/vfio-common.h |    1 +
>>>>   3 files changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 3ce285ea395d..cd44d465a50b 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
>>>>       Error *local_err = NULL;
>>>>       int ret = -ENOTSUP;
>>>>   
>>>> -    if (!container->dirty_pages_supported) {
>>>> +    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
>>>>           goto add_blocker;
>>>>       }
>>>>   
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 58c0ce8971e3..1349b900e513 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
>>>>                       VFIO_FEATURE_ENABLE_REQ_BIT, true),
>>>>       DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
>>>>                       VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
>>>> +    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
>>>> +                     vbasedev.enable_migration, false),
>>>>       DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
>>>>       DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
>>>>                        vbasedev.ram_block_discard_allowed, false),
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index baeb4dcff102..2119872c8af1 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -123,6 +123,7 @@ typedef struct VFIODevice {
>>>>       bool needs_reset;
>>>>       bool no_mmap;
>>>>       bool ram_block_discard_allowed;
>>>> +    bool enable_migration;
>>>>       VFIODeviceOps *ops;
>>>>       unsigned int num_irqs;
>>>>       unsigned int num_regions;
>>>>    
>>

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Alex Williamson 3 years, 5 months ago

On Tue, 10 Nov 2020 19:46:20 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:
> > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> >> On Mon, 9 Nov 2020 19:44:17 +0000
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >>  
> >>> * Alex Williamson (alex.williamson@redhat.com) wrote:  
> >>>> Per the proposed documentation for vfio device migration:
> >>>>
> >>>>    Dirty pages are tracked when device is in stop-and-copy phase
> >>>>    because if pages are marked dirty during pre-copy phase and
> >>>>    content is transfered from source to destination, there is no
> >>>>    way to know newly dirtied pages from the point they were copied
> >>>>    earlier until device stops. To avoid repeated copy of same
> >>>>    content, pinned pages are marked dirty only during
> >>>>    stop-and-copy phase.
> >>>>
> >>>> Essentially, since we don't have hardware dirty page tracking for
> >>>> assigned devices at this point, we consider any page that is pinned
> >>>> by an mdev vendor driver or pinned and mapped through the IOMMU to
> >>>> be perpetually dirty.  In the worst case, this may result in all of
> >>>> guest memory being considered dirty during every iteration of live
> >>>> migration.  The current vfio implementation of migration has chosen
> >>>> to mask device dirtied pages until the final stages of migration in
> >>>> order to avoid this worst case scenario.
> >>>>
> >>>> Allowing the device to implement a policy decision to prioritize
> >>>> reduced migration data like this jeopardizes QEMU's overall ability
> >>>> to implement any degree of service level guarantees during migration.
> >>>> For example, any estimates towards achieving acceptable downtime
> >>>> margins cannot be trusted when such a device is present.  The vfio
> >>>> device should participate in dirty page tracking to the best of its
> >>>> ability throughout migration, even if that means the dirty footprint
> >>>> of the device impedes migration progress, allowing both QEMU and
> >>>> higher level management tools to decide whether to continue the
> >>>> migration or abort due to failure to achieve the desired behavior.  
> >>>
> >>> I don't feel particularly badly about the decision to squash it in
> >>> during the stop-and-copy phase; for devices where the pinned memory
> >>> is large, I don't think doing it during the main phase makes much sense;
> >>> especially if you then have to deal with tracking changes in pinning.  
> >>
> >>
> >> AFAIK the kernel support for tracking changes in page pinning already
> >> exists, this is largely the vfio device in QEMU that decides when to
> >> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
> >> by this answer though, we don't really know what the device memory
> >> footprint is.  It might be large, it might be nothing, but by not
> >> participating in dirty page tracking until the VM is stopped, we can't
> >> know what the footprint is and how it will affect downtime.  Is it
> >> really the place of a QEMU device driver to impose this sort of policy?  
> > 
> > If it could actually track changes then I'd agree we shouldn't impose
> > any policy; but if it's just marking the whole area as dirty we're going
> > to need a bodge somewhere; this bodge doesn't look any worse than the
> > others to me.
> >   
> >>  
> >>> Having said that, I agree with marking it as experimental, because
> >>> I'm dubious how useful it will be for the same reason, I worry
> >>> about whether the downtime will be so large to make it pointless.  
> >>  
> 
> Not all device state is large, for example NIC might only report 
> currently mapped RX buffers which usually not more than a 1GB and could 
> be as low as 10's of MB. GPU might or might not have large data, that 
> depends on its use cases.


Right, it's only if we have a vendor driver that doesn't pin any memory
when dirty tracking is enabled and we're running without a viommu that
we would expect all of guest memory to be continuously dirty.


> >> TBH I think that's the wrong reason to mark it experimental.  There's
> >> clearly demand for vfio device migration and even if the practical use
> >> cases are initially small, they will expand over time and hardware will
> >> get better.  My objection is that the current behavior masks the
> >> hardware and device limitations, leading to unrealistic expectations.
> >> If the user expects minimal downtime, configures convergence to account
> >> for that, QEMU thinks it can achieve it, and then the device marks
> >> everything dirty, that's not supportable.  
> > 
> > Yes, agreed.  
> 
> Yes, there is demand for vfio device migration and many devices owners 
> started scoping and development for migration support.
> Instead of making whole migration support as experimental, we can have 
> opt-in option to decide to mark sys mem pages dirty during iterative 
> phase (pre-copy phase) of migration.


Per my previous suggestion, I'd think an opt-out would be more
appropriate, ie. implementing pre-copy dirty page tracking by default.


> >> OTOH if the vfio device
> >> participates in dirty tracking through pre-copy, then the practical use
> >> cases will find themselves as migrations will either be aborted because
> >> downtime tolerances cannot be achieved or downtimes will be configured
> >> to match reality.  Thanks,  
> > 
> > Without a way to prioritise the unpinned memory during that period,
> > we're going to be repeatedly sending the pinned memory which is going to
> > lead to a much larger bandwidth usage that required; so that's going in
> > completely the wrong direction and also wrong from the point of view of
> > the user.


Who decides which is the wrong direction for the user?  If the user
wants minimal bandwidth regardless of downtime, can't they create a
procedure to pause the VM, do the migration, then resume?  Are there
already migration tunables to effectively achieve this?  If a user
attempts to do a "live" migration, isn't our priority then shifted to
managing the downtime constraints over the bandwidth?  IOW the policy
decision is implied by the user actions and configuration of the
migration, I don't think that at the device level we should be guessing
which feature to prioritize, just like a vCPU doesn't to stop marking
dirty pages during pre-copy because it's touching too much memory.
Higher level policies and configurations should determine inflection
points... imo.  Thanks,

Alex

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Neo Jia 3 years, 5 months ago

On Tue, Nov 10, 2020 at 08:20:50AM -0700, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 10 Nov 2020 19:46:20 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote:
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > >> On Mon, 9 Nov 2020 19:44:17 +0000
> > >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >>
> > >>> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > >>>> Per the proposed documentation for vfio device migration:
> > >>>>
> > >>>>    Dirty pages are tracked when device is in stop-and-copy phase
> > >>>>    because if pages are marked dirty during pre-copy phase and
> > >>>>    content is transfered from source to destination, there is no
> > >>>>    way to know newly dirtied pages from the point they were copied
> > >>>>    earlier until device stops. To avoid repeated copy of same
> > >>>>    content, pinned pages are marked dirty only during
> > >>>>    stop-and-copy phase.
> > >>>>
> > >>>> Essentially, since we don't have hardware dirty page tracking for
> > >>>> assigned devices at this point, we consider any page that is pinned
> > >>>> by an mdev vendor driver or pinned and mapped through the IOMMU to
> > >>>> be perpetually dirty.  In the worst case, this may result in all of
> > >>>> guest memory being considered dirty during every iteration of live
> > >>>> migration.  The current vfio implementation of migration has chosen
> > >>>> to mask device dirtied pages until the final stages of migration in
> > >>>> order to avoid this worst case scenario.
> > >>>>
> > >>>> Allowing the device to implement a policy decision to prioritize
> > >>>> reduced migration data like this jeopardizes QEMU's overall ability
> > >>>> to implement any degree of service level guarantees during migration.
> > >>>> For example, any estimates towards achieving acceptable downtime
> > >>>> margins cannot be trusted when such a device is present.  The vfio
> > >>>> device should participate in dirty page tracking to the best of its
> > >>>> ability throughout migration, even if that means the dirty footprint
> > >>>> of the device impedes migration progress, allowing both QEMU and
> > >>>> higher level management tools to decide whether to continue the
> > >>>> migration or abort due to failure to achieve the desired behavior.
> > >>>
> > >>> I don't feel particularly badly about the decision to squash it in
> > >>> during the stop-and-copy phase; for devices where the pinned memory
> > >>> is large, I don't think doing it during the main phase makes much sense;
> > >>> especially if you then have to deal with tracking changes in pinning.
> > >>
> > >>
> > >> AFAIK the kernel support for tracking changes in page pinning already
> > >> exists, this is largely the vfio device in QEMU that decides when to
> > >> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
> > >> by this answer though, we don't really know what the device memory
> > >> footprint is.  It might be large, it might be nothing, but by not
> > >> participating in dirty page tracking until the VM is stopped, we can't
> > >> know what the footprint is and how it will affect downtime.  Is it
> > >> really the place of a QEMU device driver to impose this sort of policy?
> > >
> > > If it could actually track changes then I'd agree we shouldn't impose
> > > any policy; but if it's just marking the whole area as dirty we're going
> > > to need a bodge somewhere; this bodge doesn't look any worse than the
> > > others to me.
> > >
> > >>
> > >>> Having said that, I agree with marking it as experimental, because
> > >>> I'm dubious how useful it will be for the same reason, I worry
> > >>> about whether the downtime will be so large to make it pointless.
> > >>
> >
> > Not all device state is large, for example NIC might only report
> > currently mapped RX buffers which usually not more than a 1GB and could
> > be as low as 10's of MB. GPU might or might not have large data, that
> > depends on its use cases.
> 
> 
> Right, it's only if we have a vendor driver that doesn't pin any memory
> when dirty tracking is enabled and we're running without a viommu that
> we would expect all of guest memory to be continuously dirty.
> 
> 
> > >> TBH I think that's the wrong reason to mark it experimental.  There's
> > >> clearly demand for vfio device migration and even if the practical use
> > >> cases are initially small, they will expand over time and hardware will
> > >> get better.  My objection is that the current behavior masks the
> > >> hardware and device limitations, leading to unrealistic expectations.
> > >> If the user expects minimal downtime, configures convergence to account
> > >> for that, QEMU thinks it can achieve it, and then the device marks
> > >> everything dirty, that's not supportable.
> > >
> > > Yes, agreed.
> >
> > Yes, there is demand for vfio device migration and many devices owners
> > started scoping and development for migration support.
> > Instead of making whole migration support as experimental, we can have
> > opt-in option to decide to mark sys mem pages dirty during iterative
> > phase (pre-copy phase) of migration.
> 
> 
> Per my previous suggestion, I'd think an opt-out would be more
> appropriate, ie. implementing pre-copy dirty page tracking by default.

I think this will be a better approach without marking this feature as
experimental.

Thanks,
Neo

> 
> 
> > >> OTOH if the vfio device
> > >> participates in dirty tracking through pre-copy, then the practical use
> > >> cases will find themselves as migrations will either be aborted because
> > >> downtime tolerances cannot be achieved or downtimes will be configured
> > >> to match reality.  Thanks,
> > >
> > > Without a way to prioritise the unpinned memory during that period,
> > > we're going to be repeatedly sending the pinned memory which is going to
> > > lead to a much larger bandwidth usage that required; so that's going in
> > > completely the wrong direction and also wrong from the point of view of
> > > the user.
> 
> 
> Who decides which is the wrong direction for the user?  If the user
> wants minimal bandwidth regardless of downtime, can't they create a
> procedure to pause the VM, do the migration, then resume?  Are there
> already migration tunables to effectively achieve this?  If a user
> attempts to do a "live" migration, isn't our priority then shifted to
> managing the downtime constraints over the bandwidth?  IOW the policy
> decision is implied by the user actions and configuration of the
> migration, I don't think that at the device level we should be guessing
> which feature to prioritize, just like a vCPU doesn't to stop marking
> dirty pages during pre-copy because it's touching too much memory.
> Higher level policies and configurations should determine inflection
> points... imo.  Thanks,
> 
> Alex
>

Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

Posted by Cornelia Huck 3 years, 5 months ago

On Mon, 09 Nov 2020 11:56:02 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> Per the proposed documentation for vfio device migration:
> 
>   Dirty pages are tracked when device is in stop-and-copy phase
>   because if pages are marked dirty during pre-copy phase and
>   content is transfered from source to destination, there is no
>   way to know newly dirtied pages from the point they were copied
>   earlier until device stops. To avoid repeated copy of same
>   content, pinned pages are marked dirty only during
>   stop-and-copy phase.
> 
> Essentially, since we don't have hardware dirty page tracking for
> assigned devices at this point, we consider any page that is pinned
> by an mdev vendor driver or pinned and mapped through the IOMMU to
> be perpetually dirty.  In the worst case, this may result in all of
> guest memory being considered dirty during every iteration of live
> migration.  The current vfio implementation of migration has chosen
> to mask device dirtied pages until the final stages of migration in
> order to avoid this worst case scenario.
> 
> Allowing the device to implement a policy decision to prioritize
> reduced migration data like this jeopardizes QEMU's overall ability
> to implement any degree of service level guarantees during migration.
> For example, any estimates towards achieving acceptable downtime
> margins cannot be trusted when such a device is present.  The vfio
> device should participate in dirty page tracking to the best of its
> ability throughout migration, even if that means the dirty footprint
> of the device impedes migration progress, allowing both QEMU and
> higher level management tools to decide whether to continue the
> migration or abort due to failure to achieve the desired behavior.
> 
> Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
> Cc: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Neo Jia <cjia@nvidia.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: Philippe Mathieu-Daudé <philmd@redhat.com>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Given that our discussion in the link above seems to be going in
> circles, I'm afraid it seems necessary to both have a contigency
> plan and to raise the visibility of the current behavior to
> determine whether others agree that this is a sufficiently
> troubling behavior to consider migration support experimental
> at this stage.  Please voice your opinion or contribute patches
> to resolve this before QEMU 5.2.  Thanks,
> 
> Alex
> 
>  hw/vfio/migration.c           |    2 +-
>  hw/vfio/pci.c                 |    2 ++
>  include/hw/vfio/vfio-common.h |    1 +
>  3 files changed, 4 insertions(+), 1 deletion(-)

Given the ongoing discussions, I'd be rather more comfortable making
this experimental for the upcoming release and spent some time getting
this into a state that everyone is happy to live with, so

Acked-by: Cornelia Huck <cohuck@redhat.com>