vfio: Move the saving of the config space to the right place in VFIO migration

[PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

When running VFIO migration, I found that the restoring of VFIO PCI device’s
config space is before VGIC on ARM64 target. But generally, interrupt controllers
need to be restored before PCI devices. Besides, if a VFIO PCI device is
configured to have directly-injected MSIs (VLPIs), the restoring of its config
space will trigger the configuring of these VLPIs (in kernel), where it would
return an error as I saw due to the dependency on kvm’s vgic.

To avoid this, we can move the saving of the config space from the iterable
process to the non-iterable process, so that it will be called after VGIC
according to their priorities.

Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
 hw/vfio/migration.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3ce285ea39..028da35a25 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
     return 0;
 }
 
-static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+static void vfio_save_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
@@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
 
     trace_vfio_save_device_config_state(vbasedev->name);
 
-    return qemu_file_get_error(f);
+    if (qemu_file_get_error(f))
+        error_report("%s: Failed to save device config space",
+                     vbasedev->name);
 }
 
 static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
-    uint64_t data;
 
     if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
         int ret;
@@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
         }
     }
 
-    data = qemu_get_be64(f);
-    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
-        error_report("%s: Failed loading device config space, "
-                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
-        return -EINVAL;
-    }
-
     trace_vfio_load_device_config_state(vbasedev->name);
-    return qemu_file_get_error(f);
+    return 0;
 }
 
 static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
@@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    ret = vfio_save_device_config_state(f, opaque);
-    if (ret) {
-        return ret;
-    }
-
     ret = vfio_update_pending(vbasedev);
     if (ret) {
         return ret;
@@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_state = vfio_save_device_config_state,
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
-- 
2.19.1

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Kirti Wankhede 5 years, 2 months ago


On 11/14/2020 2:47 PM, Shenming Lu wrote:
> When running VFIO migration, I found that the restoring of VFIO PCI device’s
> config space is before VGIC on ARM64 target. But generally, interrupt controllers
> need to be restored before PCI devices. 

Is there any other way by which VGIC can be restored before PCI device?

> Besides, if a VFIO PCI device is
> configured to have directly-injected MSIs (VLPIs), the restoring of its config
> space will trigger the configuring of these VLPIs (in kernel), where it would
> return an error as I saw due to the dependency on kvm’s vgic.
> 

Can this be fixed in kernel to re-initialize the kernel state?

> To avoid this, we can move the saving of the config space from the iterable
> process to the non-iterable process, so that it will be called after VGIC
> according to their priorities.
> 

With this change, at resume side, pre-copy phase data would reach 
destination without restored config space. VFIO device on destination 
might need it's config space setup and validated before it can accept 
further VFIO device specific migration state.

This also changes bit-stream, so it would break migration with original 
migration patch-set.

Thanks,
Kirti

> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>   hw/vfio/migration.c | 22 ++++++----------------
>   1 file changed, 6 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 3ce285ea39..028da35a25 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
>       return 0;
>   }
>   
> -static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> +static void vfio_save_device_config_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>   
> @@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>   
>       trace_vfio_save_device_config_state(vbasedev->name);
>   
> -    return qemu_file_get_error(f);
> +    if (qemu_file_get_error(f))
> +        error_report("%s: Failed to save device config space",
> +                     vbasedev->name);
>   }
>   
>   static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> -    uint64_t data;
>   
>       if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
>           int ret;
> @@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>           }
>       }
>   
> -    data = qemu_get_be64(f);
> -    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> -        error_report("%s: Failed loading device config space, "
> -                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> -        return -EINVAL;
> -    }
> -
>       trace_vfio_load_device_config_state(vbasedev->name);
> -    return qemu_file_get_error(f);
> +    return 0;
>   }
>   
>   static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> @@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>           return ret;
>       }
>   
> -    ret = vfio_save_device_config_state(f, opaque);
> -    if (ret) {
> -        return ret;
> -    }
> -
>       ret = vfio_update_pending(vbasedev);
>       if (ret) {
>           return ret;
> @@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
>       .save_live_pending = vfio_save_pending,
>       .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_state = vfio_save_device_config_state,
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
>

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Alex Williamson 5 years, 2 months ago

On Thu, 19 Nov 2020 14:13:24 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/14/2020 2:47 PM, Shenming Lu wrote:
> > When running VFIO migration, I found that the restoring of VFIO PCI device’s
> > config space is before VGIC on ARM64 target. But generally, interrupt controllers
> > need to be restored before PCI devices.   
> 
> Is there any other way by which VGIC can be restored before PCI device?
> 
> > Besides, if a VFIO PCI device is
> > configured to have directly-injected MSIs (VLPIs), the restoring of its config
> > space will trigger the configuring of these VLPIs (in kernel), where it would
> > return an error as I saw due to the dependency on kvm’s vgic.
> >   
> 
> Can this be fixed in kernel to re-initialize the kernel state?
> 
> > To avoid this, we can move the saving of the config space from the iterable
> > process to the non-iterable process, so that it will be called after VGIC
> > according to their priorities.
> >   
> 
> With this change, at resume side, pre-copy phase data would reach 
> destination without restored config space. VFIO device on destination 
> might need it's config space setup and validated before it can accept 
> further VFIO device specific migration state.
> 
> This also changes bit-stream, so it would break migration with original 
> migration patch-set.

Config space can continue to change while in pre-copy, if we're only
sending config space at the initiation of pre-copy, how are any changes
that might occur before the VM is stopped conveyed to the target?  For
example the guest might reboot and a device returned to INTx mode from
MSI during pre-copy.  Thanks,

Alex


> > Signed-off-by: Shenming Lu <lushenming@huawei.com>
> > ---
> >   hw/vfio/migration.c | 22 ++++++----------------
> >   1 file changed, 6 insertions(+), 16 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 3ce285ea39..028da35a25 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
> >       return 0;
> >   }
> >   
> > -static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> > +static void vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >   {
> >       VFIODevice *vbasedev = opaque;
> >   
> > @@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
> >   
> >       trace_vfio_save_device_config_state(vbasedev->name);
> >   
> > -    return qemu_file_get_error(f);
> > +    if (qemu_file_get_error(f))
> > +        error_report("%s: Failed to save device config space",
> > +                     vbasedev->name);
> >   }
> >   
> >   static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >   {
> >       VFIODevice *vbasedev = opaque;
> > -    uint64_t data;
> >   
> >       if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> >           int ret;
> > @@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> >           }
> >       }
> >   
> > -    data = qemu_get_be64(f);
> > -    if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> > -        error_report("%s: Failed loading device config space, "
> > -                     "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> > -        return -EINVAL;
> > -    }
> > -
> >       trace_vfio_load_device_config_state(vbasedev->name);
> > -    return qemu_file_get_error(f);
> > +    return 0;
> >   }
> >   
> >   static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
> > @@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> >           return ret;
> >       }
> >   
> > -    ret = vfio_save_device_config_state(f, opaque);
> > -    if (ret) {
> > -        return ret;
> > -    }
> > -
> >       ret = vfio_update_pending(vbasedev);
> >       if (ret) {
> >           return ret;
> > @@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
> >       .save_live_pending = vfio_save_pending,
> >       .save_live_iterate = vfio_save_iterate,
> >       .save_live_complete_precopy = vfio_save_complete_precopy,
> > +    .save_state = vfio_save_device_config_state,
> >       .load_setup = vfio_load_setup,
> >       .load_cleanup = vfio_load_cleanup,
> >       .load_state = vfio_load_state,
> >   
>

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

On 2020/11/20 1:41, Alex Williamson wrote:
> On Thu, 19 Nov 2020 14:13:24 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/14/2020 2:47 PM, Shenming Lu wrote:
>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
>>> need to be restored before PCI devices.   
>>
>> Is there any other way by which VGIC can be restored before PCI device?

As far as I know, it seems to have to depend on priorities in the non-iterable process.

>>
>>> Besides, if a VFIO PCI device is
>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
>>> space will trigger the configuring of these VLPIs (in kernel), where it would
>>> return an error as I saw due to the dependency on kvm’s vgic.
>>>   
>>
>> Can this be fixed in kernel to re-initialize the kernel state?

Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
But the fact is that this error is not caused by kernel, it is due to the incorrect
calling order of qemu...

>>
>>> To avoid this, we can move the saving of the config space from the iterable
>>> process to the non-iterable process, so that it will be called after VGIC
>>> according to their priorities.
>>>   
>>
>> With this change, at resume side, pre-copy phase data would reach 
>> destination without restored config space. VFIO device on destination 
>> might need it's config space setup and validated before it can accept 
>> further VFIO device specific migration state.
>>
>> This also changes bit-stream, so it would break migration with original 
>> migration patch-set.
> 
> Config space can continue to change while in pre-copy, if we're only
> sending config space at the initiation of pre-copy, how are any changes
> that might occur before the VM is stopped conveyed to the target?  For
> example the guest might reboot and a device returned to INTx mode from
> MSI during pre-copy.  Thanks,

What I see is that the config space is only saved once in save_live_complete_precopy
currently...
As you said, a VFIO device might need it's config space setup first, and
the config space can continue to change while in pre-copy, Did you mean we
have to migrate the config space in save_live_iterate?
However, I still have a little doubt about the restoring dependence between
the qemu emulated config space and the device data...

Besides, if we surely can't move the saving of the config space back, can we
just move some actions which are triggered by the restoring of the config space
back (such as vfio_msix_enable())?

The demo patch is as follows (but it seems that .save_state is not a appropriate
place to do it. -_-):

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 55261562d4..9cf0310148 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -44,6 +44,7 @@
 #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_STATE_TRIGGER (0xffffffffef100005ULL)

 static int64_t bytes_transferred;

@@ -368,6 +369,15 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }

+static void vfio_device_state_trigger(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    if (vbasedev->ops && vbasedev->ops->vfio_trigger_config) {
+        vbasedev->ops->vfio_trigger_config(vbasedev);
+    }
+}
+
 static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -620,6 +630,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }

+static void vfio_save_state_trigger(QEMUFile *f, void *opaque)
+{
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_STATE_TRIGGER);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+}
+
 static int vfio_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -700,6 +717,18 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_STATE_TRIGGER:
+        {
+            vfio_device_state_trigger(opaque);
+            data = qemu_get_be64(f);
+            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
+                return ret;
+            } else {
+                error_report("%s: STATE TRIGGER: EOS not found 0x%"PRIx64,
+                             vbasedev->name, data);
+                return -EINVAL;
+            }
+        }
         default:
             error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
             return -EINVAL;
@@ -720,6 +749,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
     .save_live_pending = vfio_save_pending,
     .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_state = vfio_save_state_trigger,
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 58c0ce8971..4cd2feee92 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2441,13 +2441,19 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
     vfio_pci_write_config(pdev, PCI_COMMAND,
                           pci_get_word(pdev->config + PCI_COMMAND), 2);

+    return ret;
+}
+
+static void vfio_pci_trigger_config(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    PCIDevice *pdev = &vdev->pdev;
+
     if (msi_enabled(pdev)) {
         vfio_msi_enable(vdev);
     } else if (msix_enabled(pdev)) {
         vfio_msix_enable(vdev);
     }
-
-    return ret;
 }

 static VFIODeviceOps vfio_pci_ops = {
@@ -2457,6 +2463,7 @@ static VFIODeviceOps vfio_pci_ops = {
     .vfio_get_object = vfio_pci_get_object,
     .vfio_save_config = vfio_pci_save_config,
     .vfio_load_config = vfio_pci_load_config,
+    .vfio_trigger_config = vfio_pci_trigger_config,
 };

 int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index baeb4dcff1..4680a3e8d0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -138,6 +138,7 @@ struct VFIODeviceOps {
     Object *(*vfio_get_object)(VFIODevice *vdev);
     void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
     int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
+    void (*vfio_trigger_config)(VFIODevice *vdev);
 };

 typedef struct VFIOGroup {
-- 
2.19.1

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Alex Williamson 5 years, 2 months ago

On Fri, 20 Nov 2020 22:05:49 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> On 2020/11/20 1:41, Alex Williamson wrote:
> > On Thu, 19 Nov 2020 14:13:24 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/14/2020 2:47 PM, Shenming Lu wrote:  
> >>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
> >>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
> >>> need to be restored before PCI devices.     
> >>
> >> Is there any other way by which VGIC can be restored before PCI device?  
> 
> As far as I know, it seems to have to depend on priorities in the non-iterable process.
> 
> >>  
> >>> Besides, if a VFIO PCI device is
> >>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
> >>> space will trigger the configuring of these VLPIs (in kernel), where it would
> >>> return an error as I saw due to the dependency on kvm’s vgic.
> >>>     
> >>
> >> Can this be fixed in kernel to re-initialize the kernel state?  
> 
> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
> But the fact is that this error is not caused by kernel, it is due to the incorrect
> calling order of qemu...
> 
> >>  
> >>> To avoid this, we can move the saving of the config space from the iterable
> >>> process to the non-iterable process, so that it will be called after VGIC
> >>> according to their priorities.
> >>>     
> >>
> >> With this change, at resume side, pre-copy phase data would reach 
> >> destination without restored config space. VFIO device on destination 
> >> might need it's config space setup and validated before it can accept 
> >> further VFIO device specific migration state.
> >>
> >> This also changes bit-stream, so it would break migration with original 
> >> migration patch-set.  
> > 
> > Config space can continue to change while in pre-copy, if we're only
> > sending config space at the initiation of pre-copy, how are any changes
> > that might occur before the VM is stopped conveyed to the target?  For
> > example the guest might reboot and a device returned to INTx mode from
> > MSI during pre-copy.  Thanks,  
> 
> What I see is that the config space is only saved once in save_live_complete_precopy
> currently...
> As you said, a VFIO device might need it's config space setup first, and
> the config space can continue to change while in pre-copy, Did you mean we
> have to migrate the config space in save_live_iterate?
> However, I still have a little doubt about the restoring dependence between
> the qemu emulated config space and the device data...
> 
> Besides, if we surely can't move the saving of the config space back, can we
> just move some actions which are triggered by the restoring of the config space
> back (such as vfio_msix_enable())?

It seems that the significant benefit to enabling interrupts during
pre-copy would be to reduce the latency and failure potential during
the final phase of migration.  Do we have any data for how much it adds
to the device contributed downtime to configure interrupts only at the
final stage?  My guess is that it's a measurable delay on its own.  At
the same time, we can't ignore the differences in machine specific
dependencies and if we don't even sync the config space once the VM is
stopped... this all seems not ready to call supported, especially if we
have concerns already about migration bit-stream compatibility.

Given our timing relative to QEMU 5.2, the only path I feel comfortable
with is to move forward with downgrading vfio migration support to be
enabled via an experimental option.  Objections?  Thanks,

Alex

> 
> The demo patch is as follows (but it seems that .save_state is not a appropriate
> place to do it. -_-):
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 55261562d4..9cf0310148 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -44,6 +44,7 @@
>  #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>  #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>  #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +#define VFIO_MIG_FLAG_DEV_STATE_TRIGGER (0xffffffffef100005ULL)
> 
>  static int64_t bytes_transferred;
> 
> @@ -368,6 +369,15 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
>      return qemu_file_get_error(f);
>  }
> 
> +static void vfio_device_state_trigger(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    if (vbasedev->ops && vbasedev->ops->vfio_trigger_config) {
> +        vbasedev->ops->vfio_trigger_config(vbasedev);
> +    }
> +}
> +
>  static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -620,6 +630,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      return ret;
>  }
> 
> +static void vfio_save_state_trigger(QEMUFile *f, void *opaque)
> +{
> +    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_STATE_TRIGGER);
> +
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +}
> +
>  static int vfio_load_setup(QEMUFile *f, void *opaque)
>  {
>      VFIODevice *vbasedev = opaque;
> @@ -700,6 +717,18 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>              }
>              break;
>          }
> +        case VFIO_MIG_FLAG_DEV_STATE_TRIGGER:
> +        {
> +            vfio_device_state_trigger(opaque);
> +            data = qemu_get_be64(f);
> +            if (data == VFIO_MIG_FLAG_END_OF_STATE) {
> +                return ret;
> +            } else {
> +                error_report("%s: STATE TRIGGER: EOS not found 0x%"PRIx64,
> +                             vbasedev->name, data);
> +                return -EINVAL;
> +            }
> +        }
>          default:
>              error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
>              return -EINVAL;
> @@ -720,6 +749,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
>      .save_live_pending = vfio_save_pending,
>      .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
> +    .save_state = vfio_save_state_trigger,
>      .load_setup = vfio_load_setup,
>      .load_cleanup = vfio_load_cleanup,
>      .load_state = vfio_load_state,
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 58c0ce8971..4cd2feee92 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2441,13 +2441,19 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>      vfio_pci_write_config(pdev, PCI_COMMAND,
>                            pci_get_word(pdev->config + PCI_COMMAND), 2);
> 
> +    return ret;
> +}
> +
> +static void vfio_pci_trigger_config(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    PCIDevice *pdev = &vdev->pdev;
> +
>      if (msi_enabled(pdev)) {
>          vfio_msi_enable(vdev);
>      } else if (msix_enabled(pdev)) {
>          vfio_msix_enable(vdev);
>      }
> -
> -    return ret;
>  }
> 
>  static VFIODeviceOps vfio_pci_ops = {
> @@ -2457,6 +2463,7 @@ static VFIODeviceOps vfio_pci_ops = {
>      .vfio_get_object = vfio_pci_get_object,
>      .vfio_save_config = vfio_pci_save_config,
>      .vfio_load_config = vfio_pci_load_config,
> +    .vfio_trigger_config = vfio_pci_trigger_config,
>  };
> 
>  int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index baeb4dcff1..4680a3e8d0 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -138,6 +138,7 @@ struct VFIODeviceOps {
>      Object *(*vfio_get_object)(VFIODevice *vdev);
>      void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f);
>      int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
> +    void (*vfio_trigger_config)(VFIODevice *vdev);
>  };
> 
>  typedef struct VFIOGroup {

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

On 2020/11/21 6:01, Alex Williamson wrote:
> On Fri, 20 Nov 2020 22:05:49 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> On 2020/11/20 1:41, Alex Williamson wrote:
>>> On Thu, 19 Nov 2020 14:13:24 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 11/14/2020 2:47 PM, Shenming Lu wrote:  
>>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
>>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
>>>>> need to be restored before PCI devices.     
>>>>
>>>> Is there any other way by which VGIC can be restored before PCI device?  
>>
>> As far as I know, it seems to have to depend on priorities in the non-iterable process.
>>
>>>>  
>>>>> Besides, if a VFIO PCI device is
>>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
>>>>> space will trigger the configuring of these VLPIs (in kernel), where it would
>>>>> return an error as I saw due to the dependency on kvm’s vgic.
>>>>>     
>>>>
>>>> Can this be fixed in kernel to re-initialize the kernel state?  
>>
>> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
>> But the fact is that this error is not caused by kernel, it is due to the incorrect
>> calling order of qemu...
>>
>>>>  
>>>>> To avoid this, we can move the saving of the config space from the iterable
>>>>> process to the non-iterable process, so that it will be called after VGIC
>>>>> according to their priorities.
>>>>>     
>>>>
>>>> With this change, at resume side, pre-copy phase data would reach 
>>>> destination without restored config space. VFIO device on destination 
>>>> might need it's config space setup and validated before it can accept 
>>>> further VFIO device specific migration state.
>>>>
>>>> This also changes bit-stream, so it would break migration with original 
>>>> migration patch-set.  
>>>
>>> Config space can continue to change while in pre-copy, if we're only
>>> sending config space at the initiation of pre-copy, how are any changes
>>> that might occur before the VM is stopped conveyed to the target?  For
>>> example the guest might reboot and a device returned to INTx mode from
>>> MSI during pre-copy.  Thanks,  
>>
>> What I see is that the config space is only saved once in save_live_complete_precopy
>> currently...
>> As you said, a VFIO device might need it's config space setup first, and
>> the config space can continue to change while in pre-copy, Did you mean we
>> have to migrate the config space in save_live_iterate?
>> However, I still have a little doubt about the restoring dependence between
>> the qemu emulated config space and the device data...
>>
>> Besides, if we surely can't move the saving of the config space back, can we
>> just move some actions which are triggered by the restoring of the config space
>> back (such as vfio_msix_enable())?
> 
> It seems that the significant benefit to enabling interrupts during
> pre-copy would be to reduce the latency and failure potential during
> the final phase of migration.  Do we have any data for how much it adds
> to the device contributed downtime to configure interrupts only at the
> final stage?  My guess is that it's a measurable delay on its own.  At
> the same time, we can't ignore the differences in machine specific
> dependencies and if we don't even sync the config space once the VM is
> stopped... this all seems not ready to call supported, especially if we
> have concerns already about migration bit-stream compatibility.
> 

I have another question for this, if we restore the config space while in pre-copy
(include enabling interrupts), does it affect the _RESUMING state (paused) of the
device on the dst host (cause it to send interrupts? which should not be allowed
in this stage). Does the restore sequence need to be further discussed and reach
a consensus(spec) (taking into account other devices and the corresponding actions
of the vendor driver)?

> Given our timing relative to QEMU 5.2, the only path I feel comfortable
> with is to move forward with downgrading vfio migration support to be
> enabled via an experimental option.  Objections?  Thanks,

Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you
give a rough idea about this (where to enable interrupts, we hope it to be after
the restoring of VGIC)?

Thanks,
Shenming

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Neo Jia 5 years, 2 months ago

On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2020/11/21 6:01, Alex Williamson wrote:
> > On Fri, 20 Nov 2020 22:05:49 +0800
> > Shenming Lu <lushenming@huawei.com> wrote:
> >
> >> On 2020/11/20 1:41, Alex Williamson wrote:
> >>> On Thu, 19 Nov 2020 14:13:24 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>
> >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote:
> >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
> >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
> >>>>> need to be restored before PCI devices.
> >>>>
> >>>> Is there any other way by which VGIC can be restored before PCI device?
> >>
> >> As far as I know, it seems to have to depend on priorities in the non-iterable process.
> >>
> >>>>
> >>>>> Besides, if a VFIO PCI device is
> >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
> >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would
> >>>>> return an error as I saw due to the dependency on kvm’s vgic.
> >>>>>
> >>>>
> >>>> Can this be fixed in kernel to re-initialize the kernel state?
> >>
> >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
> >> But the fact is that this error is not caused by kernel, it is due to the incorrect
> >> calling order of qemu...
> >>
> >>>>
> >>>>> To avoid this, we can move the saving of the config space from the iterable
> >>>>> process to the non-iterable process, so that it will be called after VGIC
> >>>>> according to their priorities.
> >>>>>
> >>>>
> >>>> With this change, at resume side, pre-copy phase data would reach
> >>>> destination without restored config space. VFIO device on destination
> >>>> might need it's config space setup and validated before it can accept
> >>>> further VFIO device specific migration state.
> >>>>
> >>>> This also changes bit-stream, so it would break migration with original
> >>>> migration patch-set.
> >>>
> >>> Config space can continue to change while in pre-copy, if we're only
> >>> sending config space at the initiation of pre-copy, how are any changes
> >>> that might occur before the VM is stopped conveyed to the target?  For
> >>> example the guest might reboot and a device returned to INTx mode from
> >>> MSI during pre-copy.  Thanks,
> >>
> >> What I see is that the config space is only saved once in save_live_complete_precopy
> >> currently...
> >> As you said, a VFIO device might need it's config space setup first, and
> >> the config space can continue to change while in pre-copy, Did you mean we
> >> have to migrate the config space in save_live_iterate?
> >> However, I still have a little doubt about the restoring dependence between
> >> the qemu emulated config space and the device data...
> >>
> >> Besides, if we surely can't move the saving of the config space back, can we
> >> just move some actions which are triggered by the restoring of the config space
> >> back (such as vfio_msix_enable())?
> >
> > It seems that the significant benefit to enabling interrupts during
> > pre-copy would be to reduce the latency and failure potential during
> > the final phase of migration.  Do we have any data for how much it adds
> > to the device contributed downtime to configure interrupts only at the
> > final stage?  My guess is that it's a measurable delay on its own.  At
> > the same time, we can't ignore the differences in machine specific
> > dependencies and if we don't even sync the config space once the VM is
> > stopped... this all seems not ready to call supported, especially if we
> > have concerns already about migration bit-stream compatibility.
> >
> 
> I have another question for this, if we restore the config space while in pre-copy
> (include enabling interrupts), does it affect the _RESUMING state (paused) of the
> device on the dst host (cause it to send interrupts? which should not be allowed
> in this stage). Does the restore sequence need to be further discussed and reach
> a consensus(spec) (taking into account other devices and the corresponding actions
> of the vendor driver)?
> 
> > Given our timing relative to QEMU 5.2, the only path I feel comfortable
> > with is to move forward with downgrading vfio migration support to be
> > enabled via an experimental option.  Objections?  Thanks,
> 
> Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you
> give a rough idea about this (where to enable interrupts, we hope it to be after
> the restoring of VGIC)?

I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to
make the entire VFIO based migration an experimental feature?

Thanks,
Neo

> 
> Thanks,
> Shenming

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Alex Williamson 5 years, 2 months ago

On Mon, 23 Nov 2020 11:33:37 -0800
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 2020/11/21 6:01, Alex Williamson wrote:  
> > > On Fri, 20 Nov 2020 22:05:49 +0800
> > > Shenming Lu <lushenming@huawei.com> wrote:
> > >  
> > >> On 2020/11/20 1:41, Alex Williamson wrote:  
> > >>> On Thu, 19 Nov 2020 14:13:24 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>  
> > >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote:  
> > >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
> > >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
> > >>>>> need to be restored before PCI devices.  
> > >>>>
> > >>>> Is there any other way by which VGIC can be restored before PCI device?  
> > >>
> > >> As far as I know, it seems to have to depend on priorities in the non-iterable process.
> > >>  
> > >>>>  
> > >>>>> Besides, if a VFIO PCI device is
> > >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
> > >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would
> > >>>>> return an error as I saw due to the dependency on kvm’s vgic.
> > >>>>>  
> > >>>>
> > >>>> Can this be fixed in kernel to re-initialize the kernel state?  
> > >>
> > >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
> > >> But the fact is that this error is not caused by kernel, it is due to the incorrect
> > >> calling order of qemu...
> > >>  
> > >>>>  
> > >>>>> To avoid this, we can move the saving of the config space from the iterable
> > >>>>> process to the non-iterable process, so that it will be called after VGIC
> > >>>>> according to their priorities.
> > >>>>>  
> > >>>>
> > >>>> With this change, at resume side, pre-copy phase data would reach
> > >>>> destination without restored config space. VFIO device on destination
> > >>>> might need it's config space setup and validated before it can accept
> > >>>> further VFIO device specific migration state.
> > >>>>
> > >>>> This also changes bit-stream, so it would break migration with original
> > >>>> migration patch-set.  
> > >>>
> > >>> Config space can continue to change while in pre-copy, if we're only
> > >>> sending config space at the initiation of pre-copy, how are any changes
> > >>> that might occur before the VM is stopped conveyed to the target?  For
> > >>> example the guest might reboot and a device returned to INTx mode from
> > >>> MSI during pre-copy.  Thanks,  
> > >>
> > >> What I see is that the config space is only saved once in save_live_complete_precopy
> > >> currently...
> > >> As you said, a VFIO device might need it's config space setup first, and
> > >> the config space can continue to change while in pre-copy, Did you mean we
> > >> have to migrate the config space in save_live_iterate?
> > >> However, I still have a little doubt about the restoring dependence between
> > >> the qemu emulated config space and the device data...
> > >>
> > >> Besides, if we surely can't move the saving of the config space back, can we
> > >> just move some actions which are triggered by the restoring of the config space
> > >> back (such as vfio_msix_enable())?  
> > >
> > > It seems that the significant benefit to enabling interrupts during
> > > pre-copy would be to reduce the latency and failure potential during
> > > the final phase of migration.  Do we have any data for how much it adds
> > > to the device contributed downtime to configure interrupts only at the
> > > final stage?  My guess is that it's a measurable delay on its own.  At
> > > the same time, we can't ignore the differences in machine specific
> > > dependencies and if we don't even sync the config space once the VM is
> > > stopped... this all seems not ready to call supported, especially if we
> > > have concerns already about migration bit-stream compatibility.
> > >  
> > 
> > I have another question for this, if we restore the config space while in pre-copy
> > (include enabling interrupts), does it affect the _RESUMING state (paused) of the
> > device on the dst host (cause it to send interrupts? which should not be allowed
> > in this stage). Does the restore sequence need to be further discussed and reach
> > a consensus(spec) (taking into account other devices and the corresponding actions
> > of the vendor driver)?

If a target device generates an interrupt when stopped, I think that
would violate the basic notion of running vs stopped in the device
state, right?

> > > Given our timing relative to QEMU 5.2, the only path I feel comfortable
> > > with is to move forward with downgrading vfio migration support to be
> > > enabled via an experimental option.  Objections?  Thanks,  
> > 
> > Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you
> > give a rough idea about this (where to enable interrupts, we hope it to be after
> > the restoring of VGIC)?  
> 
> I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to
> make the entire VFIO based migration an experimental feature?

Because it's essentially unproven by the larger community.  Let's face
it, there was a surprisingly positive response to my previous proposal
to mark this feature experimental, and mostly it was not due to the
immediate issue that caused me to sound the alarm.  I think this latest
issue shows that this feature has not had the upstream soak time that
it needs.  Had we persisted with its development after it missed QEMU
5.1, had we pushed it upstream after the freeze lifted and iterated
throughout the 5.2 development window, we might be in a better position
right now.

As it stands, we're potentially at the final RC of this release, other
vendors have yet to prove this for their devices and architectures in
the short time since the code entered mainline, we continue to see new
issues, we have zero available vendor drivers supporting migration,
we're not sure how to maintain bit-stream compatibility with the issues
we already know about, and we're out of time.  I don't see how the
upstream community can gamble that this is in a supportable state at
this point.  Thanks,

Alex

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

Hi,

After reading everyone's opinions, we have a rough idea for this issue.

One key point is whether it is necessary to setup the config space before
the device can accept further migration data. I think it is decided by
the vendor driver, so we can simply ask the vendor driver about it in
.save_setup, which could avoid a lot of unnecessary copies and settings.
Once we have known the need, we can iterate the config space (before)
along with the device migration data in .save_live_iterate and
.save_live_complete_precopy, and if not needed, we can only migrate the
config space in .save_state.

Another key point is that the interrupt enabling should be after the
restoring of the interrupt controller (might not only interrupts).
My solution is to add a subflag at the beginning of the config data
(right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
actions on the dst (such as whether to enable interrupts).

Below is it's workflow.

On the save path:
	In vfio_save_setup():
	Ask the vendor driver if it needs the config space setup before it
	can accept further migration data.
		|
	In vfio_save_iterate() (pre-copy):
	If *needed*, save the config space which would be setup on the dst
	before the migration data, but send with a subflag to instruct not
	to (such as) enable interrupts.
		|
	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
	The same as that in vfio_save_iterate().
		|
	In .save_state (stop-and-copy, non-iterable process):
	If *needed*, only send a subflag to instruct to enable interrupts.
	If *not needed*, save the config space and setup everything on the dst.

Besides the above idea, we might be able to choose to let the vendor driver do
more: qemu just sends and writes the config data (before) along with the device
migration data every time, and it's up to the vendor driver to filter out/buffer
the received data or reorder the settings...

Thanks,
Shenming

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Alex Williamson 5 years, 2 months ago

On Thu, 26 Nov 2020 14:56:17 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> Hi,
> 
> After reading everyone's opinions, we have a rough idea for this issue.
> 
> One key point is whether it is necessary to setup the config space before
> the device can accept further migration data. I think it is decided by
> the vendor driver, so we can simply ask the vendor driver about it in
> .save_setup, which could avoid a lot of unnecessary copies and settings.
> Once we have known the need, we can iterate the config space (before)
> along with the device migration data in .save_live_iterate and
> .save_live_complete_precopy, and if not needed, we can only migrate the
> config space in .save_state.
> 
> Another key point is that the interrupt enabling should be after the
> restoring of the interrupt controller (might not only interrupts).
> My solution is to add a subflag at the beginning of the config data
> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
> actions on the dst (such as whether to enable interrupts).
> 
> Below is it's workflow.
> 
> On the save path:
> 	In vfio_save_setup():
> 	Ask the vendor driver if it needs the config space setup before it
> 	can accept further migration data.

How does "ask the vendor driver" actually work?

> 		|
> 	In vfio_save_iterate() (pre-copy):
> 	If *needed*, save the config space which would be setup on the dst
> 	before the migration data, but send with a subflag to instruct not
> 	to (such as) enable interrupts.

If not for triggering things like MSI/X configuration, isn't config
space almost entirely virtual?  What visibility does the vendor driver
have to the VM machine dependencies regarding device interrupt versus
interrupt controller migration?

> 		|
> 	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
> 	The same as that in vfio_save_iterate().
> 		|
> 	In .save_state (stop-and-copy, non-iterable process):
> 	If *needed*, only send a subflag to instruct to enable interrupts.
> 	If *not needed*, save the config space and setup everything on the dst.

Again, how does the vendor driver have visibility to know when the VM
machine can enable interrupts?

> 
> Besides the above idea, we might be able to choose to let the vendor driver do
> more: qemu just sends and writes the config data (before) along with the device
> migration data every time, and it's up to the vendor driver to filter out/buffer
> the received data or reorder the settings...

There is no vendor driver in QEMU though, so are you suggesting that
QEMU follows a standard protocol and the vendor driver chooses when to
enable specific features?  For instance, QEMU would call SET_IRQS and
the driver would return success, but defer that setup if necessary?
That seems quite troubling as we then have ioctls that behave
differently depending on the device state and we have no error path to
userspace should that setup fail later.  The vendor driver does have
its own data stream for migration, so the vendor driver could tell the
destination version of itself what type of interrupt to use, which
might be sufficient if we were to ignore the latency if QEMU were to
defer interrupt setup until stop-and-copy.

Is the question of when to setup device interrupts versus the interrupt
controller state largely a machine issue within QEMU?  If so, shouldn't
it be at QEMU's determination when to act on the config space
information on the target?  IOW, if a vendor driver has a dependency on
interrupt configuration, they need to include it in their own pre-copy
data stream and decouple that dependency from userspace interrupt
configuration via the SET_IRQS ioctl.  Is that possible?  Thanks,

Alex

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

On 2020/12/1 1:03, Alex Williamson wrote:
> On Thu, 26 Nov 2020 14:56:17 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> Hi,
>>
>> After reading everyone's opinions, we have a rough idea for this issue.
>>
>> One key point is whether it is necessary to setup the config space before
>> the device can accept further migration data. I think it is decided by
>> the vendor driver, so we can simply ask the vendor driver about it in
>> .save_setup, which could avoid a lot of unnecessary copies and settings.
>> Once we have known the need, we can iterate the config space (before)
>> along with the device migration data in .save_live_iterate and
>> .save_live_complete_precopy, and if not needed, we can only migrate the
>> config space in .save_state.
>>
>> Another key point is that the interrupt enabling should be after the
>> restoring of the interrupt controller (might not only interrupts).
>> My solution is to add a subflag at the beginning of the config data
>> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
>> actions on the dst (such as whether to enable interrupts).
>>
>> Below is it's workflow.
>>
>> On the save path:
>> 	In vfio_save_setup():
>> 	Ask the vendor driver if it needs the config space setup before it
>> 	can accept further migration data.
> 
> How does "ask the vendor driver" actually work?

Maybe via a ioctl?
Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup)
and send the config data (before) along with the device migration data
every time?...

> 
>> 		|
>> 	In vfio_save_iterate() (pre-copy):
>> 	If *needed*, save the config space which would be setup on the dst
>> 	before the migration data, but send with a subflag to instruct not
>> 	to (such as) enable interrupts.
> 
> If not for triggering things like MSI/X configuration, isn't config
> space almost entirely virtual?  What visibility does the vendor driver
> have to the VM machine dependencies regarding device interrupt versus
> interrupt controller migration?

My thought is that the vendor driver only decides the order of the config
space setup and the receiving of the migration data, but leaves the VM
machine dependencies to QEMU.

> 
>> 		|
>> 	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
>> 	The same as that in vfio_save_iterate().
>> 		|
>> 	In .save_state (stop-and-copy, non-iterable process):
>> 	If *needed*, only send a subflag to instruct to enable interrupts.
>> 	If *not needed*, save the config space and setup everything on the dst.
> 
> Again, how does the vendor driver have visibility to know when the VM
> machine can enable interrupts?

It seems troubling if the vendor driver needs the interrupts to be enabled
first...

> 
>>
>> Besides the above idea, we might be able to choose to let the vendor driver do
>> more: qemu just sends and writes the config data (before) along with the device
>> migration data every time, and it's up to the vendor driver to filter out/buffer
>> the received data or reorder the settings...
> 
> There is no vendor driver in QEMU though, so are you suggesting that
> QEMU follows a standard protocol and the vendor driver chooses when to
> enable specific features?  For instance, QEMU would call SET_IRQS and
> the driver would return success, but defer that setup if necessary?
> That seems quite troubling as we then have ioctls that behave
> differently depending on the device state and we have no error path to
> userspace should that setup fail later.  The vendor driver does have
> its own data stream for migration, so the vendor driver could tell the
> destination version of itself what type of interrupt to use, which
> might be sufficient if we were to ignore the latency if QEMU were to
> defer interrupt setup until stop-and-copy.

Did you mean that we could only enable MSI-X during the iterable phase, but
leave the setup of these unmasked vectors to the non-iterable phase?
It looks good to me.

> 
> Is the question of when to setup device interrupts versus the interrupt
> controller state largely a machine issue within QEMU?  If so, shouldn't
> it be at QEMU's determination when to act on the config space
> information on the target?

I think it would be simpler if ensuring the proper calling order in QEMU...

> IOW, if a vendor driver has a dependency on
> interrupt configuration, they need to include it in their own pre-copy
> data stream and decouple that dependency from userspace interrupt
> configuration via the SET_IRQS ioctl.  Is that possible?  Thanks,
> 

I don't understand what the decoupling that dependency from userspace
interrupt configuration means...

Thanks,
Shenming

> Alex
> 
> .
>

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Alex Williamson 5 years, 2 months ago

On Tue, 1 Dec 2020 14:37:52 +0800
Shenming Lu <lushenming@huawei.com> wrote:

> On 2020/12/1 1:03, Alex Williamson wrote:
> > On Thu, 26 Nov 2020 14:56:17 +0800
> > Shenming Lu <lushenming@huawei.com> wrote:
> >   
> >> Hi,
> >>
> >> After reading everyone's opinions, we have a rough idea for this issue.
> >>
> >> One key point is whether it is necessary to setup the config space before
> >> the device can accept further migration data. I think it is decided by
> >> the vendor driver, so we can simply ask the vendor driver about it in
> >> .save_setup, which could avoid a lot of unnecessary copies and settings.
> >> Once we have known the need, we can iterate the config space (before)
> >> along with the device migration data in .save_live_iterate and
> >> .save_live_complete_precopy, and if not needed, we can only migrate the
> >> config space in .save_state.
> >>
> >> Another key point is that the interrupt enabling should be after the
> >> restoring of the interrupt controller (might not only interrupts).
> >> My solution is to add a subflag at the beginning of the config data
> >> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
> >> actions on the dst (such as whether to enable interrupts).
> >>
> >> Below is it's workflow.
> >>
> >> On the save path:
> >> 	In vfio_save_setup():
> >> 	Ask the vendor driver if it needs the config space setup before it
> >> 	can accept further migration data.  
> > 
> > How does "ask the vendor driver" actually work?  
> 
> Maybe via a ioctl?
> Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup)
> and send the config data (before) along with the device migration data
> every time?...

Migration data streams are unidirectional, otherwise we break offline
migration.  There are various ways other than an ioctl for a vendor
driver to advertise requirements, ex. a capability on the migration
region info, but it's not clear to me that we really need this info yet.

> >> 		|
> >> 	In vfio_save_iterate() (pre-copy):
> >> 	If *needed*, save the config space which would be setup on the dst
> >> 	before the migration data, but send with a subflag to instruct not
> >> 	to (such as) enable interrupts.  
> > 
> > If not for triggering things like MSI/X configuration, isn't config
> > space almost entirely virtual?  What visibility does the vendor driver
> > have to the VM machine dependencies regarding device interrupt versus
> > interrupt controller migration?  
> 
> My thought is that the vendor driver only decides the order of the config
> space setup and the receiving of the migration data, but leaves the VM
> machine dependencies to QEMU.

But again, config space is largely virtual, the vendor driver doesn't
have access to it, it's only the effects of config space, like
representing the interrupt mode and configuring it on the device or
specific registers that QEMU writes through that the vendor driver
sees.  So how is it that the vendor driver decides the order?  The
vendor driver doesn't have visibility to the VM machine dependencies,
like when the interrupt controller is sufficiently configured to enable
device interrupts.  It seems that a vendor driver that depends on QEMU
enabling interrupts at a specific point is inherently dependent on
assumptions in the machine configuration.

> >> 		|
> >> 	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
> >> 	The same as that in vfio_save_iterate().
> >> 		|
> >> 	In .save_state (stop-and-copy, non-iterable process):
> >> 	If *needed*, only send a subflag to instruct to enable interrupts.
> >> 	If *not needed*, save the config space and setup everything on the dst.  
> > 
> > Again, how does the vendor driver have visibility to know when the VM
> > machine can enable interrupts?  
> 
> It seems troubling if the vendor driver needs the interrupts to be enabled
> first...

Yes.

> >> Besides the above idea, we might be able to choose to let the vendor driver do
> >> more: qemu just sends and writes the config data (before) along with the device
> >> migration data every time, and it's up to the vendor driver to filter out/buffer
> >> the received data or reorder the settings...  
> > 
> > There is no vendor driver in QEMU though, so are you suggesting that
> > QEMU follows a standard protocol and the vendor driver chooses when to
> > enable specific features?  For instance, QEMU would call SET_IRQS and
> > the driver would return success, but defer that setup if necessary?
> > That seems quite troubling as we then have ioctls that behave
> > differently depending on the device state and we have no error path to
> > userspace should that setup fail later.  The vendor driver does have
> > its own data stream for migration, so the vendor driver could tell the
> > destination version of itself what type of interrupt to use, which
> > might be sufficient if we were to ignore the latency if QEMU were to
> > defer interrupt setup until stop-and-copy.  
> 
> Did you mean that we could only enable MSI-X during the iterable phase, but
> leave the setup of these unmasked vectors to the non-iterable phase?
> It looks good to me.

On x86 the vfio SET_IRQS ioctl is independent of the VM interrupt
setup, so we could decouple configuring the device interrupts from
plumbing them through the VM.  However, doesn't ARM64 require mapping
an MSI doorbell page via vfio?  Where does is that configured during the
machine load versus the device iterative phase?

> > Is the question of when to setup device interrupts versus the interrupt
> > controller state largely a machine issue within QEMU?  If so, shouldn't
> > it be at QEMU's determination when to act on the config space
> > information on the target?  
> 
> I think it would be simpler if ensuring the proper calling order in QEMU...
> 
> > IOW, if a vendor driver has a dependency on
> > interrupt configuration, they need to include it in their own pre-copy
> > data stream and decouple that dependency from userspace interrupt
> > configuration via the SET_IRQS ioctl.  Is that possible?  Thanks,
> >   
> 
> I don't understand what the decoupling that dependency from userspace
> interrupt configuration means...

I mean that the vendor driver cannot depend on when QEMU calls
SET_IRQS, the vendor specific migration stream contains any information
the vendor driver needs about interrupt configuration such that QEMU
can call SET_IRQS whenever it chooses.  Thanks,

Alex

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

On 2020/12/2 6:21, Alex Williamson wrote:
> On Tue, 1 Dec 2020 14:37:52 +0800
> Shenming Lu <lushenming@huawei.com> wrote:
> 
>> On 2020/12/1 1:03, Alex Williamson wrote:
>>> On Thu, 26 Nov 2020 14:56:17 +0800
>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>   
>>>> Hi,
>>>>
>>>> After reading everyone's opinions, we have a rough idea for this issue.
>>>>
>>>> One key point is whether it is necessary to setup the config space before
>>>> the device can accept further migration data. I think it is decided by
>>>> the vendor driver, so we can simply ask the vendor driver about it in
>>>> .save_setup, which could avoid a lot of unnecessary copies and settings.
>>>> Once we have known the need, we can iterate the config space (before)
>>>> along with the device migration data in .save_live_iterate and
>>>> .save_live_complete_precopy, and if not needed, we can only migrate the
>>>> config space in .save_state.
>>>>
>>>> Another key point is that the interrupt enabling should be after the
>>>> restoring of the interrupt controller (might not only interrupts).
>>>> My solution is to add a subflag at the beginning of the config data
>>>> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
>>>> actions on the dst (such as whether to enable interrupts).
>>>>
>>>> Below is it's workflow.
>>>>
>>>> On the save path:
>>>> 	In vfio_save_setup():
>>>> 	Ask the vendor driver if it needs the config space setup before it
>>>> 	can accept further migration data.  
>>>
>>> How does "ask the vendor driver" actually work?  
>>
>> Maybe via a ioctl?
>> Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup)
>> and send the config data (before) along with the device migration data
>> every time?...
> 
> Migration data streams are unidirectional, otherwise we break offline
> migration.  There are various ways other than an ioctl for a vendor
> driver to advertise requirements, ex. a capability on the migration
> region info, but it's not clear to me that we really need this info yet.

Ok, this is not clear yet.

> 
>>>> 		|
>>>> 	In vfio_save_iterate() (pre-copy):
>>>> 	If *needed*, save the config space which would be setup on the dst
>>>> 	before the migration data, but send with a subflag to instruct not
>>>> 	to (such as) enable interrupts.  
>>>
>>> If not for triggering things like MSI/X configuration, isn't config
>>> space almost entirely virtual?  What visibility does the vendor driver
>>> have to the VM machine dependencies regarding device interrupt versus
>>> interrupt controller migration?  
>>
>> My thought is that the vendor driver only decides the order of the config
>> space setup and the receiving of the migration data, but leaves the VM
>> machine dependencies to QEMU.
> 
> But again, config space is largely virtual, the vendor driver doesn't
> have access to it, it's only the effects of config space, like
> representing the interrupt mode and configuring it on the device or
> specific registers that QEMU writes through that the vendor driver
> sees.  So how is it that the vendor driver decides the order?

As you said above, a vendor driver could advertise its requirements via
the migration region...

> The
> vendor driver doesn't have visibility to the VM machine dependencies,
> like when the interrupt controller is sufficiently configured to enable
> device interrupts.  It seems that a vendor driver that depends on QEMU
> enabling interrupts at a specific point is inherently dependent on
> assumptions in the machine configuration.

Yeah, there might be more than one dependency.

> 
>>>> 		|
>>>> 	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
>>>> 	The same as that in vfio_save_iterate().
>>>> 		|
>>>> 	In .save_state (stop-and-copy, non-iterable process):
>>>> 	If *needed*, only send a subflag to instruct to enable interrupts.
>>>> 	If *not needed*, save the config space and setup everything on the dst.  
>>>
>>> Again, how does the vendor driver have visibility to know when the VM
>>> machine can enable interrupts?  
>>
>> It seems troubling if the vendor driver needs the interrupts to be enabled
>> first...
> 
> Yes.
> 
>>>> Besides the above idea, we might be able to choose to let the vendor driver do
>>>> more: qemu just sends and writes the config data (before) along with the device
>>>> migration data every time, and it's up to the vendor driver to filter out/buffer
>>>> the received data or reorder the settings...  
>>>
>>> There is no vendor driver in QEMU though, so are you suggesting that
>>> QEMU follows a standard protocol and the vendor driver chooses when to
>>> enable specific features?  For instance, QEMU would call SET_IRQS and
>>> the driver would return success, but defer that setup if necessary?
>>> That seems quite troubling as we then have ioctls that behave
>>> differently depending on the device state and we have no error path to
>>> userspace should that setup fail later.  The vendor driver does have
>>> its own data stream for migration, so the vendor driver could tell the
>>> destination version of itself what type of interrupt to use, which
>>> might be sufficient if we were to ignore the latency if QEMU were to
>>> defer interrupt setup until stop-and-copy.  
>>
>> Did you mean that we could only enable MSI-X during the iterable phase, but
>> leave the setup of these unmasked vectors to the non-iterable phase?
>> It looks good to me.
> 
> On x86 the vfio SET_IRQS ioctl is independent of the VM interrupt
> setup, so we could decouple configuring the device interrupts from
> plumbing them through the VM.

On ARM64 the setting of the IRQ bypass in the vfio SET_IRQS ioctl is dependent
on the VM interrupt setup...

> However, doesn't ARM64 require mapping
> an MSI doorbell page via vfio?  Where does is that configured during the
> machine load versus the device iterative phase?

Isn't the mapping of the MSI doorbell arch independent in vfio? -_-

> 
>>> Is the question of when to setup device interrupts versus the interrupt
>>> controller state largely a machine issue within QEMU?  If so, shouldn't
>>> it be at QEMU's determination when to act on the config space
>>> information on the target?  
>>
>> I think it would be simpler if ensuring the proper calling order in QEMU...
>>
>>> IOW, if a vendor driver has a dependency on
>>> interrupt configuration, they need to include it in their own pre-copy
>>> data stream and decouple that dependency from userspace interrupt
>>> configuration via the SET_IRQS ioctl.  Is that possible?  Thanks,
>>>   
>>
>> I don't understand what the decoupling that dependency from userspace
>> interrupt configuration means...
> 
> I mean that the vendor driver cannot depend on when QEMU calls
> SET_IRQS, the vendor specific migration stream contains any information
> the vendor driver needs about interrupt configuration such that QEMU
> can call SET_IRQS whenever it chooses.  Thanks,

Oh, that sounds great.
By this way, it seems that we only need to migrate the QEMU config space in
the non-iterable process once.

> 
> Alex
> 
> .
>

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Dr. David Alan Gilbert 5 years, 2 months ago

* Shenming Lu (lushenming@huawei.com) wrote:
> Hi,
> 
> After reading everyone's opinions, we have a rough idea for this issue.
> 
> One key point is whether it is necessary to setup the config space before
> the device can accept further migration data. I think it is decided by
> the vendor driver, so we can simply ask the vendor driver about it in
> .save_setup, which could avoid a lot of unnecessary copies and settings.
> Once we have known the need, we can iterate the config space (before)
> along with the device migration data in .save_live_iterate and
> .save_live_complete_precopy, and if not needed, we can only migrate the
> config space in .save_state.
> 
> Another key point is that the interrupt enabling should be after the
> restoring of the interrupt controller (might not only interrupts).
> My solution is to add a subflag at the beginning of the config data
> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
> actions on the dst (such as whether to enable interrupts).
> 
> Below is it's workflow.
> 
> On the save path:
> 	In vfio_save_setup():
> 	Ask the vendor driver if it needs the config space setup before it
> 	can accept further migration data.

You could argue that you could always send an initial copy of the config
data at the start; and then send new versions at the end anyway.
But is this just about interrupts? Is the dst going to interpret the
received migration data differently depending on the config data?  That
would seem dangerous if the config data was to change during the
migration.

Dave

> 		|
> 	In vfio_save_iterate() (pre-copy):
> 	If *needed*, save the config space which would be setup on the dst
> 	before the migration data, but send with a subflag to instruct not
> 	to (such as) enable interrupts.
> 		|
> 	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
> 	The same as that in vfio_save_iterate().
> 		|
> 	In .save_state (stop-and-copy, non-iterable process):
> 	If *needed*, only send a subflag to instruct to enable interrupts.
> 	If *not needed*, save the config space and setup everything on the dst.
> 
> Besides the above idea, we might be able to choose to let the vendor driver do
> more: qemu just sends and writes the config data (before) along with the device
> migration data every time, and it's up to the vendor driver to filter out/buffer
> the received data or reorder the settings...
> 
> Thanks,
> Shenming
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago


On 2020/12/2 18:55, Dr. David Alan Gilbert wrote:
> * Shenming Lu (lushenming@huawei.com) wrote:
>> Hi,
>>
>> After reading everyone's opinions, we have a rough idea for this issue.
>>
>> One key point is whether it is necessary to setup the config space before
>> the device can accept further migration data. I think it is decided by
>> the vendor driver, so we can simply ask the vendor driver about it in
>> .save_setup, which could avoid a lot of unnecessary copies and settings.
>> Once we have known the need, we can iterate the config space (before)
>> along with the device migration data in .save_live_iterate and
>> .save_live_complete_precopy, and if not needed, we can only migrate the
>> config space in .save_state.
>>
>> Another key point is that the interrupt enabling should be after the
>> restoring of the interrupt controller (might not only interrupts).
>> My solution is to add a subflag at the beginning of the config data
>> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered
>> actions on the dst (such as whether to enable interrupts).
>>
>> Below is it's workflow.
>>
>> On the save path:
>> 	In vfio_save_setup():
>> 	Ask the vendor driver if it needs the config space setup before it
>> 	can accept further migration data.
> 
> You could argue that you could always send an initial copy of the config
> data at the start; and then send new versions at the end anyway.
> But is this just about interrupts?

In the current implementation, the effect of the config space is mainly about
interrupts (SET_IRQS ioctl).

> Is the dst going to interpret the
> received migration data differently depending on the config data?

I also have a little doubt about the dependency... But not sure.

> That
> would seem dangerous if the config data was to change during the
> migration.
> 
> Dave
> 
>> 		|
>> 	In vfio_save_iterate() (pre-copy):
>> 	If *needed*, save the config space which would be setup on the dst
>> 	before the migration data, but send with a subflag to instruct not
>> 	to (such as) enable interrupts.
>> 		|
>> 	In vfio_save_complete_precopy() (stop-and-copy, iterable process):
>> 	The same as that in vfio_save_iterate().
>> 		|
>> 	In .save_state (stop-and-copy, non-iterable process):
>> 	If *needed*, only send a subflag to instruct to enable interrupts.
>> 	If *not needed*, save the config space and setup everything on the dst.
>>
>> Besides the above idea, we might be able to choose to let the vendor driver do
>> more: qemu just sends and writes the config data (before) along with the device
>> migration data every time, and it's up to the vendor driver to filter out/buffer
>> the received data or reorder the settings...
>>
>> Thanks,
>> Shenming
>>

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Shenming Lu 5 years, 2 months ago

On 2020/11/24 3:33, Neo Jia wrote:
> On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2020/11/21 6:01, Alex Williamson wrote:
>>> On Fri, 20 Nov 2020 22:05:49 +0800
>>> Shenming Lu <lushenming@huawei.com> wrote:
>>>
>>>> On 2020/11/20 1:41, Alex Williamson wrote:
>>>>> On Thu, 19 Nov 2020 14:13:24 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>
>>>>>> On 11/14/2020 2:47 PM, Shenming Lu wrote:
>>>>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
>>>>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
>>>>>>> need to be restored before PCI devices.
>>>>>>
>>>>>> Is there any other way by which VGIC can be restored before PCI device?
>>>>
>>>> As far as I know, it seems to have to depend on priorities in the non-iterable process.
>>>>
>>>>>>
>>>>>>> Besides, if a VFIO PCI device is
>>>>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
>>>>>>> space will trigger the configuring of these VLPIs (in kernel), where it would
>>>>>>> return an error as I saw due to the dependency on kvm’s vgic.
>>>>>>>
>>>>>>
>>>>>> Can this be fixed in kernel to re-initialize the kernel state?
>>>>
>>>> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
>>>> But the fact is that this error is not caused by kernel, it is due to the incorrect
>>>> calling order of qemu...
>>>>
>>>>>>
>>>>>>> To avoid this, we can move the saving of the config space from the iterable
>>>>>>> process to the non-iterable process, so that it will be called after VGIC
>>>>>>> according to their priorities.
>>>>>>>
>>>>>>
>>>>>> With this change, at resume side, pre-copy phase data would reach
>>>>>> destination without restored config space. VFIO device on destination
>>>>>> might need it's config space setup and validated before it can accept
>>>>>> further VFIO device specific migration state.
>>>>>>
>>>>>> This also changes bit-stream, so it would break migration with original
>>>>>> migration patch-set.
>>>>>
>>>>> Config space can continue to change while in pre-copy, if we're only
>>>>> sending config space at the initiation of pre-copy, how are any changes
>>>>> that might occur before the VM is stopped conveyed to the target?  For
>>>>> example the guest might reboot and a device returned to INTx mode from
>>>>> MSI during pre-copy.  Thanks,
>>>>
>>>> What I see is that the config space is only saved once in save_live_complete_precopy
>>>> currently...
>>>> As you said, a VFIO device might need it's config space setup first, and
>>>> the config space can continue to change while in pre-copy, Did you mean we
>>>> have to migrate the config space in save_live_iterate?
>>>> However, I still have a little doubt about the restoring dependence between
>>>> the qemu emulated config space and the device data...
>>>>
>>>> Besides, if we surely can't move the saving of the config space back, can we
>>>> just move some actions which are triggered by the restoring of the config space
>>>> back (such as vfio_msix_enable())?
>>>
>>> It seems that the significant benefit to enabling interrupts during
>>> pre-copy would be to reduce the latency and failure potential during
>>> the final phase of migration.  Do we have any data for how much it adds
>>> to the device contributed downtime to configure interrupts only at the
>>> final stage?  My guess is that it's a measurable delay on its own.  At
>>> the same time, we can't ignore the differences in machine specific
>>> dependencies and if we don't even sync the config space once the VM is
>>> stopped... this all seems not ready to call supported, especially if we
>>> have concerns already about migration bit-stream compatibility.
>>>
>>
>> I have another question for this, if we restore the config space while in pre-copy
>> (include enabling interrupts), does it affect the _RESUMING state (paused) of the
>> device on the dst host (cause it to send interrupts? which should not be allowed
>> in this stage). Does the restore sequence need to be further discussed and reach
>> a consensus(spec) (taking into account other devices and the corresponding actions
>> of the vendor driver)?
>>
>>> Given our timing relative to QEMU 5.2, the only path I feel comfortable
>>> with is to move forward with downgrading vfio migration support to be
>>> enabled via an experimental option.  Objections?  Thanks,
>>
>> Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you
>> give a rough idea about this (where to enable interrupts, we hope it to be after
>> the restoring of VGIC)?
> 
> I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to
> make the entire VFIO based migration an experimental feature?

It is not specific to Huawei ARM GIC implementation, the error was encountered in general
ARM GIC implementation...

Thanks,
Shenming

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Dr. David Alan Gilbert 5 years, 2 months ago

* Neo Jia (cjia@nvidia.com) wrote:
> On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On 2020/11/21 6:01, Alex Williamson wrote:
> > > On Fri, 20 Nov 2020 22:05:49 +0800
> > > Shenming Lu <lushenming@huawei.com> wrote:
> > >
> > >> On 2020/11/20 1:41, Alex Williamson wrote:
> > >>> On Thu, 19 Nov 2020 14:13:24 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>
> > >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote:
> > >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s
> > >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers
> > >>>>> need to be restored before PCI devices.
> > >>>>
> > >>>> Is there any other way by which VGIC can be restored before PCI device?
> > >>
> > >> As far as I know, it seems to have to depend on priorities in the non-iterable process.
> > >>
> > >>>>
> > >>>>> Besides, if a VFIO PCI device is
> > >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config
> > >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would
> > >>>>> return an error as I saw due to the dependency on kvm’s vgic.
> > >>>>>
> > >>>>
> > >>>> Can this be fixed in kernel to re-initialize the kernel state?
> > >>
> > >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic?
> > >> But the fact is that this error is not caused by kernel, it is due to the incorrect
> > >> calling order of qemu...
> > >>
> > >>>>
> > >>>>> To avoid this, we can move the saving of the config space from the iterable
> > >>>>> process to the non-iterable process, so that it will be called after VGIC
> > >>>>> according to their priorities.
> > >>>>>
> > >>>>
> > >>>> With this change, at resume side, pre-copy phase data would reach
> > >>>> destination without restored config space. VFIO device on destination
> > >>>> might need it's config space setup and validated before it can accept
> > >>>> further VFIO device specific migration state.
> > >>>>
> > >>>> This also changes bit-stream, so it would break migration with original
> > >>>> migration patch-set.
> > >>>
> > >>> Config space can continue to change while in pre-copy, if we're only
> > >>> sending config space at the initiation of pre-copy, how are any changes
> > >>> that might occur before the VM is stopped conveyed to the target?  For
> > >>> example the guest might reboot and a device returned to INTx mode from
> > >>> MSI during pre-copy.  Thanks,
> > >>
> > >> What I see is that the config space is only saved once in save_live_complete_precopy
> > >> currently...
> > >> As you said, a VFIO device might need it's config space setup first, and
> > >> the config space can continue to change while in pre-copy, Did you mean we
> > >> have to migrate the config space in save_live_iterate?
> > >> However, I still have a little doubt about the restoring dependence between
> > >> the qemu emulated config space and the device data...
> > >>
> > >> Besides, if we surely can't move the saving of the config space back, can we
> > >> just move some actions which are triggered by the restoring of the config space
> > >> back (such as vfio_msix_enable())?
> > >
> > > It seems that the significant benefit to enabling interrupts during
> > > pre-copy would be to reduce the latency and failure potential during
> > > the final phase of migration.  Do we have any data for how much it adds
> > > to the device contributed downtime to configure interrupts only at the
> > > final stage?  My guess is that it's a measurable delay on its own.  At
> > > the same time, we can't ignore the differences in machine specific
> > > dependencies and if we don't even sync the config space once the VM is
> > > stopped... this all seems not ready to call supported, especially if we
> > > have concerns already about migration bit-stream compatibility.
> > >
> > 
> > I have another question for this, if we restore the config space while in pre-copy
> > (include enabling interrupts), does it affect the _RESUMING state (paused) of the
> > device on the dst host (cause it to send interrupts? which should not be allowed
> > in this stage). Does the restore sequence need to be further discussed and reach
> > a consensus(spec) (taking into account other devices and the corresponding actions
> > of the vendor driver)?
> > 
> > > Given our timing relative to QEMU 5.2, the only path I feel comfortable
> > > with is to move forward with downgrading vfio migration support to be
> > > enabled via an experimental option.  Objections?  Thanks,
> > 
> > Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you
> > give a rough idea about this (where to enable interrupts, we hope it to be after
> > the restoring of VGIC)?
> 
> I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to
> make the entire VFIO based migration an experimental feature?

It sounds though like the problem is really a general ordering problem;
in normal migration we get the non-iterable device states at the end of
the migration, at the point when the CPUs have already been stopped;
then there's ordering of some of those device states to ensure
that interrupt controllers happen first.   That's not an ARM GIC
specific issue - any interrupt chain can cause it.

As others have said, the other problem here is that other parts of the
state in there can change during hte lifetime of the migration and you
need to make sure you end up with the right config space state on the
destination.

Dave

> Thanks,
> Neo
> 
> > 
> > Thanks,
> > Shenming
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [PATCH RFC] vfio: Move the saving of the config space to the right place in VFIO migration

Posted by Cornelia Huck 5 years, 2 months ago

On Fri, 20 Nov 2020 15:01:46 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> Given our timing relative to QEMU 5.2, the only path I feel comfortable
> with is to move forward with downgrading vfio migration support to be
> enabled via an experimental option.  Objections?  Thanks,
> 
> Alex

Agreed from my side. It seems better to give it one more release to
figure out some issues, and just mark it experimental for now.