When running VFIO migration, I found that the restoring of VFIO PCI device’s
config space is before VGIC on ARM64 target. But generally, interrupt controllers
need to be restored before PCI devices. Besides, if a VFIO PCI device is
configured to have directly-injected MSIs (VLPIs), the restoring of its config
space will trigger the configuring of these VLPIs (in kernel), where it would
return an error as I saw due to the dependency on kvm’s vgic.
To avoid this, we can move the saving of the config space from the iterable
process to the non-iterable process, so that it will be called after VGIC
according to their priorities.
Signed-off-by: Shenming Lu <lushenming@huawei.com>
---
hw/vfio/migration.c | 22 ++++++----------------
1 file changed, 6 insertions(+), 16 deletions(-)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 3ce285ea39..028da35a25 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev)
return 0;
}
-static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
+static void vfio_save_device_config_state(QEMUFile *f, void *opaque)
{
VFIODevice *vbasedev = opaque;
@@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
trace_vfio_save_device_config_state(vbasedev->name);
- return qemu_file_get_error(f);
+ if (qemu_file_get_error(f))
+ error_report("%s: Failed to save device config space",
+ vbasedev->name);
}
static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
{
VFIODevice *vbasedev = opaque;
- uint64_t data;
if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
int ret;
@@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
}
}
- data = qemu_get_be64(f);
- if (data != VFIO_MIG_FLAG_END_OF_STATE) {
- error_report("%s: Failed loading device config space, "
- "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
- return -EINVAL;
- }
-
trace_vfio_load_device_config_state(vbasedev->name);
- return qemu_file_get_error(f);
+ return 0;
}
static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start)
@@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
return ret;
}
- ret = vfio_save_device_config_state(f, opaque);
- if (ret) {
- return ret;
- }
-
ret = vfio_update_pending(vbasedev);
if (ret) {
return ret;
@@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = {
.save_live_pending = vfio_save_pending,
.save_live_iterate = vfio_save_iterate,
.save_live_complete_precopy = vfio_save_complete_precopy,
+ .save_state = vfio_save_device_config_state,
.load_setup = vfio_load_setup,
.load_cleanup = vfio_load_cleanup,
.load_state = vfio_load_state,
--
2.19.1
On 11/14/2020 2:47 PM, Shenming Lu wrote: > When running VFIO migration, I found that the restoring of VFIO PCI device’s > config space is before VGIC on ARM64 target. But generally, interrupt controllers > need to be restored before PCI devices. Is there any other way by which VGIC can be restored before PCI device? > Besides, if a VFIO PCI device is > configured to have directly-injected MSIs (VLPIs), the restoring of its config > space will trigger the configuring of these VLPIs (in kernel), where it would > return an error as I saw due to the dependency on kvm’s vgic. > Can this be fixed in kernel to re-initialize the kernel state? > To avoid this, we can move the saving of the config space from the iterable > process to the non-iterable process, so that it will be called after VGIC > according to their priorities. > With this change, at resume side, pre-copy phase data would reach destination without restored config space. VFIO device on destination might need it's config space setup and validated before it can accept further VFIO device specific migration state. This also changes bit-stream, so it would break migration with original migration patch-set. Thanks, Kirti > Signed-off-by: Shenming Lu <lushenming@huawei.com> > --- > hw/vfio/migration.c | 22 ++++++---------------- > 1 file changed, 6 insertions(+), 16 deletions(-) > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 3ce285ea39..028da35a25 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev) > return 0; > } > > -static int vfio_save_device_config_state(QEMUFile *f, void *opaque) > +static void vfio_save_device_config_state(QEMUFile *f, void *opaque) > { > VFIODevice *vbasedev = opaque; > > @@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) > > trace_vfio_save_device_config_state(vbasedev->name); > > - return qemu_file_get_error(f); > + if (qemu_file_get_error(f)) > + error_report("%s: Failed to save device config space", > + vbasedev->name); > } > > static int vfio_load_device_config_state(QEMUFile *f, void *opaque) > { > VFIODevice *vbasedev = opaque; > - uint64_t data; > > if (vbasedev->ops && vbasedev->ops->vfio_load_config) { > int ret; > @@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) > } > } > > - data = qemu_get_be64(f); > - if (data != VFIO_MIG_FLAG_END_OF_STATE) { > - error_report("%s: Failed loading device config space, " > - "end flag incorrect 0x%"PRIx64, vbasedev->name, data); > - return -EINVAL; > - } > - > trace_vfio_load_device_config_state(vbasedev->name); > - return qemu_file_get_error(f); > + return 0; > } > > static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) > @@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) > return ret; > } > > - ret = vfio_save_device_config_state(f, opaque); > - if (ret) { > - return ret; > - } > - > ret = vfio_update_pending(vbasedev); > if (ret) { > return ret; > @@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = { > .save_live_pending = vfio_save_pending, > .save_live_iterate = vfio_save_iterate, > .save_live_complete_precopy = vfio_save_complete_precopy, > + .save_state = vfio_save_device_config_state, > .load_setup = vfio_load_setup, > .load_cleanup = vfio_load_cleanup, > .load_state = vfio_load_state, >
On Thu, 19 Nov 2020 14:13:24 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote: > On 11/14/2020 2:47 PM, Shenming Lu wrote: > > When running VFIO migration, I found that the restoring of VFIO PCI device’s > > config space is before VGIC on ARM64 target. But generally, interrupt controllers > > need to be restored before PCI devices. > > Is there any other way by which VGIC can be restored before PCI device? > > > Besides, if a VFIO PCI device is > > configured to have directly-injected MSIs (VLPIs), the restoring of its config > > space will trigger the configuring of these VLPIs (in kernel), where it would > > return an error as I saw due to the dependency on kvm’s vgic. > > > > Can this be fixed in kernel to re-initialize the kernel state? > > > To avoid this, we can move the saving of the config space from the iterable > > process to the non-iterable process, so that it will be called after VGIC > > according to their priorities. > > > > With this change, at resume side, pre-copy phase data would reach > destination without restored config space. VFIO device on destination > might need it's config space setup and validated before it can accept > further VFIO device specific migration state. > > This also changes bit-stream, so it would break migration with original > migration patch-set. Config space can continue to change while in pre-copy, if we're only sending config space at the initiation of pre-copy, how are any changes that might occur before the VM is stopped conveyed to the target? For example the guest might reboot and a device returned to INTx mode from MSI during pre-copy. Thanks, Alex > > Signed-off-by: Shenming Lu <lushenming@huawei.com> > > --- > > hw/vfio/migration.c | 22 ++++++---------------- > > 1 file changed, 6 insertions(+), 16 deletions(-) > > > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > > index 3ce285ea39..028da35a25 100644 > > --- a/hw/vfio/migration.c > > +++ b/hw/vfio/migration.c > > @@ -351,7 +351,7 @@ static int vfio_update_pending(VFIODevice *vbasedev) > > return 0; > > } > > > > -static int vfio_save_device_config_state(QEMUFile *f, void *opaque) > > +static void vfio_save_device_config_state(QEMUFile *f, void *opaque) > > { > > VFIODevice *vbasedev = opaque; > > > > @@ -365,13 +365,14 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) > > > > trace_vfio_save_device_config_state(vbasedev->name); > > > > - return qemu_file_get_error(f); > > + if (qemu_file_get_error(f)) > > + error_report("%s: Failed to save device config space", > > + vbasedev->name); > > } > > > > static int vfio_load_device_config_state(QEMUFile *f, void *opaque) > > { > > VFIODevice *vbasedev = opaque; > > - uint64_t data; > > > > if (vbasedev->ops && vbasedev->ops->vfio_load_config) { > > int ret; > > @@ -384,15 +385,8 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque) > > } > > } > > > > - data = qemu_get_be64(f); > > - if (data != VFIO_MIG_FLAG_END_OF_STATE) { > > - error_report("%s: Failed loading device config space, " > > - "end flag incorrect 0x%"PRIx64, vbasedev->name, data); > > - return -EINVAL; > > - } > > - > > trace_vfio_load_device_config_state(vbasedev->name); > > - return qemu_file_get_error(f); > > + return 0; > > } > > > > static int vfio_set_dirty_page_tracking(VFIODevice *vbasedev, bool start) > > @@ -575,11 +569,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) > > return ret; > > } > > > > - ret = vfio_save_device_config_state(f, opaque); > > - if (ret) { > > - return ret; > > - } > > - > > ret = vfio_update_pending(vbasedev); > > if (ret) { > > return ret; > > @@ -720,6 +709,7 @@ static SaveVMHandlers savevm_vfio_handlers = { > > .save_live_pending = vfio_save_pending, > > .save_live_iterate = vfio_save_iterate, > > .save_live_complete_precopy = vfio_save_complete_precopy, > > + .save_state = vfio_save_device_config_state, > > .load_setup = vfio_load_setup, > > .load_cleanup = vfio_load_cleanup, > > .load_state = vfio_load_state, > > >
On 2020/11/20 1:41, Alex Williamson wrote: > On Thu, 19 Nov 2020 14:13:24 +0530 > Kirti Wankhede <kwankhede@nvidia.com> wrote: > >> On 11/14/2020 2:47 PM, Shenming Lu wrote: >>> When running VFIO migration, I found that the restoring of VFIO PCI device’s >>> config space is before VGIC on ARM64 target. But generally, interrupt controllers >>> need to be restored before PCI devices. >> >> Is there any other way by which VGIC can be restored before PCI device? As far as I know, it seems to have to depend on priorities in the non-iterable process. >> >>> Besides, if a VFIO PCI device is >>> configured to have directly-injected MSIs (VLPIs), the restoring of its config >>> space will trigger the configuring of these VLPIs (in kernel), where it would >>> return an error as I saw due to the dependency on kvm’s vgic. >>> >> >> Can this be fixed in kernel to re-initialize the kernel state? Did you mean to reconfigure these VLPIs when restoring kvm's vgic? But the fact is that this error is not caused by kernel, it is due to the incorrect calling order of qemu... >> >>> To avoid this, we can move the saving of the config space from the iterable >>> process to the non-iterable process, so that it will be called after VGIC >>> according to their priorities. >>> >> >> With this change, at resume side, pre-copy phase data would reach >> destination without restored config space. VFIO device on destination >> might need it's config space setup and validated before it can accept >> further VFIO device specific migration state. >> >> This also changes bit-stream, so it would break migration with original >> migration patch-set. > > Config space can continue to change while in pre-copy, if we're only > sending config space at the initiation of pre-copy, how are any changes > that might occur before the VM is stopped conveyed to the target? For > example the guest might reboot and a device returned to INTx mode from > MSI during pre-copy. Thanks, What I see is that the config space is only saved once in save_live_complete_precopy currently... As you said, a VFIO device might need it's config space setup first, and the config space can continue to change while in pre-copy, Did you mean we have to migrate the config space in save_live_iterate? However, I still have a little doubt about the restoring dependence between the qemu emulated config space and the device data... Besides, if we surely can't move the saving of the config space back, can we just move some actions which are triggered by the restoring of the config space back (such as vfio_msix_enable())? The demo patch is as follows (but it seems that .save_state is not a appropriate place to do it. -_-): diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 55261562d4..9cf0310148 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -44,6 +44,7 @@ #define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL) #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL) #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL) +#define VFIO_MIG_FLAG_DEV_STATE_TRIGGER (0xffffffffef100005ULL) static int64_t bytes_transferred; @@ -368,6 +369,15 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) return qemu_file_get_error(f); } +static void vfio_device_state_trigger(void *opaque) +{ + VFIODevice *vbasedev = opaque; + + if (vbasedev->ops && vbasedev->ops->vfio_trigger_config) { + vbasedev->ops->vfio_trigger_config(vbasedev); + } +} + static int vfio_load_device_config_state(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -620,6 +630,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) return ret; } +static void vfio_save_state_trigger(QEMUFile *f, void *opaque) +{ + qemu_put_be64(f, VFIO_MIG_FLAG_DEV_STATE_TRIGGER); + + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); +} + static int vfio_load_setup(QEMUFile *f, void *opaque) { VFIODevice *vbasedev = opaque; @@ -700,6 +717,18 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) } break; } + case VFIO_MIG_FLAG_DEV_STATE_TRIGGER: + { + vfio_device_state_trigger(opaque); + data = qemu_get_be64(f); + if (data == VFIO_MIG_FLAG_END_OF_STATE) { + return ret; + } else { + error_report("%s: STATE TRIGGER: EOS not found 0x%"PRIx64, + vbasedev->name, data); + return -EINVAL; + } + } default: error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data); return -EINVAL; @@ -720,6 +749,7 @@ static SaveVMHandlers savevm_vfio_handlers = { .save_live_pending = vfio_save_pending, .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, + .save_state = vfio_save_state_trigger, .load_setup = vfio_load_setup, .load_cleanup = vfio_load_cleanup, .load_state = vfio_load_state, diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 58c0ce8971..4cd2feee92 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -2441,13 +2441,19 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) vfio_pci_write_config(pdev, PCI_COMMAND, pci_get_word(pdev->config + PCI_COMMAND), 2); + return ret; +} + +static void vfio_pci_trigger_config(VFIODevice *vbasedev) +{ + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); + PCIDevice *pdev = &vdev->pdev; + if (msi_enabled(pdev)) { vfio_msi_enable(vdev); } else if (msix_enabled(pdev)) { vfio_msix_enable(vdev); } - - return ret; } static VFIODeviceOps vfio_pci_ops = { @@ -2457,6 +2463,7 @@ static VFIODeviceOps vfio_pci_ops = { .vfio_get_object = vfio_pci_get_object, .vfio_save_config = vfio_pci_save_config, .vfio_load_config = vfio_pci_load_config, + .vfio_trigger_config = vfio_pci_trigger_config, }; int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index baeb4dcff1..4680a3e8d0 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -138,6 +138,7 @@ struct VFIODeviceOps { Object *(*vfio_get_object)(VFIODevice *vdev); void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f); int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f); + void (*vfio_trigger_config)(VFIODevice *vdev); }; typedef struct VFIOGroup { -- 2.19.1
On Fri, 20 Nov 2020 22:05:49 +0800 Shenming Lu <lushenming@huawei.com> wrote: > On 2020/11/20 1:41, Alex Williamson wrote: > > On Thu, 19 Nov 2020 14:13:24 +0530 > > Kirti Wankhede <kwankhede@nvidia.com> wrote: > > > >> On 11/14/2020 2:47 PM, Shenming Lu wrote: > >>> When running VFIO migration, I found that the restoring of VFIO PCI device’s > >>> config space is before VGIC on ARM64 target. But generally, interrupt controllers > >>> need to be restored before PCI devices. > >> > >> Is there any other way by which VGIC can be restored before PCI device? > > As far as I know, it seems to have to depend on priorities in the non-iterable process. > > >> > >>> Besides, if a VFIO PCI device is > >>> configured to have directly-injected MSIs (VLPIs), the restoring of its config > >>> space will trigger the configuring of these VLPIs (in kernel), where it would > >>> return an error as I saw due to the dependency on kvm’s vgic. > >>> > >> > >> Can this be fixed in kernel to re-initialize the kernel state? > > Did you mean to reconfigure these VLPIs when restoring kvm's vgic? > But the fact is that this error is not caused by kernel, it is due to the incorrect > calling order of qemu... > > >> > >>> To avoid this, we can move the saving of the config space from the iterable > >>> process to the non-iterable process, so that it will be called after VGIC > >>> according to their priorities. > >>> > >> > >> With this change, at resume side, pre-copy phase data would reach > >> destination without restored config space. VFIO device on destination > >> might need it's config space setup and validated before it can accept > >> further VFIO device specific migration state. > >> > >> This also changes bit-stream, so it would break migration with original > >> migration patch-set. > > > > Config space can continue to change while in pre-copy, if we're only > > sending config space at the initiation of pre-copy, how are any changes > > that might occur before the VM is stopped conveyed to the target? For > > example the guest might reboot and a device returned to INTx mode from > > MSI during pre-copy. Thanks, > > What I see is that the config space is only saved once in save_live_complete_precopy > currently... > As you said, a VFIO device might need it's config space setup first, and > the config space can continue to change while in pre-copy, Did you mean we > have to migrate the config space in save_live_iterate? > However, I still have a little doubt about the restoring dependence between > the qemu emulated config space and the device data... > > Besides, if we surely can't move the saving of the config space back, can we > just move some actions which are triggered by the restoring of the config space > back (such as vfio_msix_enable())? It seems that the significant benefit to enabling interrupts during pre-copy would be to reduce the latency and failure potential during the final phase of migration. Do we have any data for how much it adds to the device contributed downtime to configure interrupts only at the final stage? My guess is that it's a measurable delay on its own. At the same time, we can't ignore the differences in machine specific dependencies and if we don't even sync the config space once the VM is stopped... this all seems not ready to call supported, especially if we have concerns already about migration bit-stream compatibility. Given our timing relative to QEMU 5.2, the only path I feel comfortable with is to move forward with downgrading vfio migration support to be enabled via an experimental option. Objections? Thanks, Alex > > The demo patch is as follows (but it seems that .save_state is not a appropriate > place to do it. -_-): > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 55261562d4..9cf0310148 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -44,6 +44,7 @@ > #define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL) > #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL) > #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL) > +#define VFIO_MIG_FLAG_DEV_STATE_TRIGGER (0xffffffffef100005ULL) > > static int64_t bytes_transferred; > > @@ -368,6 +369,15 @@ static int vfio_save_device_config_state(QEMUFile *f, void *opaque) > return qemu_file_get_error(f); > } > > +static void vfio_device_state_trigger(void *opaque) > +{ > + VFIODevice *vbasedev = opaque; > + > + if (vbasedev->ops && vbasedev->ops->vfio_trigger_config) { > + vbasedev->ops->vfio_trigger_config(vbasedev); > + } > +} > + > static int vfio_load_device_config_state(QEMUFile *f, void *opaque) > { > VFIODevice *vbasedev = opaque; > @@ -620,6 +630,13 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) > return ret; > } > > +static void vfio_save_state_trigger(QEMUFile *f, void *opaque) > +{ > + qemu_put_be64(f, VFIO_MIG_FLAG_DEV_STATE_TRIGGER); > + > + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); > +} > + > static int vfio_load_setup(QEMUFile *f, void *opaque) > { > VFIODevice *vbasedev = opaque; > @@ -700,6 +717,18 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) > } > break; > } > + case VFIO_MIG_FLAG_DEV_STATE_TRIGGER: > + { > + vfio_device_state_trigger(opaque); > + data = qemu_get_be64(f); > + if (data == VFIO_MIG_FLAG_END_OF_STATE) { > + return ret; > + } else { > + error_report("%s: STATE TRIGGER: EOS not found 0x%"PRIx64, > + vbasedev->name, data); > + return -EINVAL; > + } > + } > default: > error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data); > return -EINVAL; > @@ -720,6 +749,7 @@ static SaveVMHandlers savevm_vfio_handlers = { > .save_live_pending = vfio_save_pending, > .save_live_iterate = vfio_save_iterate, > .save_live_complete_precopy = vfio_save_complete_precopy, > + .save_state = vfio_save_state_trigger, > .load_setup = vfio_load_setup, > .load_cleanup = vfio_load_cleanup, > .load_state = vfio_load_state, > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index 58c0ce8971..4cd2feee92 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -2441,13 +2441,19 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f) > vfio_pci_write_config(pdev, PCI_COMMAND, > pci_get_word(pdev->config + PCI_COMMAND), 2); > > + return ret; > +} > + > +static void vfio_pci_trigger_config(VFIODevice *vbasedev) > +{ > + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); > + PCIDevice *pdev = &vdev->pdev; > + > if (msi_enabled(pdev)) { > vfio_msi_enable(vdev); > } else if (msix_enabled(pdev)) { > vfio_msix_enable(vdev); > } > - > - return ret; > } > > static VFIODeviceOps vfio_pci_ops = { > @@ -2457,6 +2463,7 @@ static VFIODeviceOps vfio_pci_ops = { > .vfio_get_object = vfio_pci_get_object, > .vfio_save_config = vfio_pci_save_config, > .vfio_load_config = vfio_pci_load_config, > + .vfio_trigger_config = vfio_pci_trigger_config, > }; > > int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp) > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > index baeb4dcff1..4680a3e8d0 100644 > --- a/include/hw/vfio/vfio-common.h > +++ b/include/hw/vfio/vfio-common.h > @@ -138,6 +138,7 @@ struct VFIODeviceOps { > Object *(*vfio_get_object)(VFIODevice *vdev); > void (*vfio_save_config)(VFIODevice *vdev, QEMUFile *f); > int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f); > + void (*vfio_trigger_config)(VFIODevice *vdev); > }; > > typedef struct VFIOGroup {
On 2020/11/21 6:01, Alex Williamson wrote: > On Fri, 20 Nov 2020 22:05:49 +0800 > Shenming Lu <lushenming@huawei.com> wrote: > >> On 2020/11/20 1:41, Alex Williamson wrote: >>> On Thu, 19 Nov 2020 14:13:24 +0530 >>> Kirti Wankhede <kwankhede@nvidia.com> wrote: >>> >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote: >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers >>>>> need to be restored before PCI devices. >>>> >>>> Is there any other way by which VGIC can be restored before PCI device? >> >> As far as I know, it seems to have to depend on priorities in the non-iterable process. >> >>>> >>>>> Besides, if a VFIO PCI device is >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would >>>>> return an error as I saw due to the dependency on kvm’s vgic. >>>>> >>>> >>>> Can this be fixed in kernel to re-initialize the kernel state? >> >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic? >> But the fact is that this error is not caused by kernel, it is due to the incorrect >> calling order of qemu... >> >>>> >>>>> To avoid this, we can move the saving of the config space from the iterable >>>>> process to the non-iterable process, so that it will be called after VGIC >>>>> according to their priorities. >>>>> >>>> >>>> With this change, at resume side, pre-copy phase data would reach >>>> destination without restored config space. VFIO device on destination >>>> might need it's config space setup and validated before it can accept >>>> further VFIO device specific migration state. >>>> >>>> This also changes bit-stream, so it would break migration with original >>>> migration patch-set. >>> >>> Config space can continue to change while in pre-copy, if we're only >>> sending config space at the initiation of pre-copy, how are any changes >>> that might occur before the VM is stopped conveyed to the target? For >>> example the guest might reboot and a device returned to INTx mode from >>> MSI during pre-copy. Thanks, >> >> What I see is that the config space is only saved once in save_live_complete_precopy >> currently... >> As you said, a VFIO device might need it's config space setup first, and >> the config space can continue to change while in pre-copy, Did you mean we >> have to migrate the config space in save_live_iterate? >> However, I still have a little doubt about the restoring dependence between >> the qemu emulated config space and the device data... >> >> Besides, if we surely can't move the saving of the config space back, can we >> just move some actions which are triggered by the restoring of the config space >> back (such as vfio_msix_enable())? > > It seems that the significant benefit to enabling interrupts during > pre-copy would be to reduce the latency and failure potential during > the final phase of migration. Do we have any data for how much it adds > to the device contributed downtime to configure interrupts only at the > final stage? My guess is that it's a measurable delay on its own. At > the same time, we can't ignore the differences in machine specific > dependencies and if we don't even sync the config space once the VM is > stopped... this all seems not ready to call supported, especially if we > have concerns already about migration bit-stream compatibility. > I have another question for this, if we restore the config space while in pre-copy (include enabling interrupts), does it affect the _RESUMING state (paused) of the device on the dst host (cause it to send interrupts? which should not be allowed in this stage). Does the restore sequence need to be further discussed and reach a consensus(spec) (taking into account other devices and the corresponding actions of the vendor driver)? > Given our timing relative to QEMU 5.2, the only path I feel comfortable > with is to move forward with downgrading vfio migration support to be > enabled via an experimental option. Objections? Thanks, Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you give a rough idea about this (where to enable interrupts, we hope it to be after the restoring of VGIC)? Thanks, Shenming
On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote: > External email: Use caution opening links or attachments > > > On 2020/11/21 6:01, Alex Williamson wrote: > > On Fri, 20 Nov 2020 22:05:49 +0800 > > Shenming Lu <lushenming@huawei.com> wrote: > > > >> On 2020/11/20 1:41, Alex Williamson wrote: > >>> On Thu, 19 Nov 2020 14:13:24 +0530 > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote: > >>> > >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote: > >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s > >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers > >>>>> need to be restored before PCI devices. > >>>> > >>>> Is there any other way by which VGIC can be restored before PCI device? > >> > >> As far as I know, it seems to have to depend on priorities in the non-iterable process. > >> > >>>> > >>>>> Besides, if a VFIO PCI device is > >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config > >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would > >>>>> return an error as I saw due to the dependency on kvm’s vgic. > >>>>> > >>>> > >>>> Can this be fixed in kernel to re-initialize the kernel state? > >> > >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic? > >> But the fact is that this error is not caused by kernel, it is due to the incorrect > >> calling order of qemu... > >> > >>>> > >>>>> To avoid this, we can move the saving of the config space from the iterable > >>>>> process to the non-iterable process, so that it will be called after VGIC > >>>>> according to their priorities. > >>>>> > >>>> > >>>> With this change, at resume side, pre-copy phase data would reach > >>>> destination without restored config space. VFIO device on destination > >>>> might need it's config space setup and validated before it can accept > >>>> further VFIO device specific migration state. > >>>> > >>>> This also changes bit-stream, so it would break migration with original > >>>> migration patch-set. > >>> > >>> Config space can continue to change while in pre-copy, if we're only > >>> sending config space at the initiation of pre-copy, how are any changes > >>> that might occur before the VM is stopped conveyed to the target? For > >>> example the guest might reboot and a device returned to INTx mode from > >>> MSI during pre-copy. Thanks, > >> > >> What I see is that the config space is only saved once in save_live_complete_precopy > >> currently... > >> As you said, a VFIO device might need it's config space setup first, and > >> the config space can continue to change while in pre-copy, Did you mean we > >> have to migrate the config space in save_live_iterate? > >> However, I still have a little doubt about the restoring dependence between > >> the qemu emulated config space and the device data... > >> > >> Besides, if we surely can't move the saving of the config space back, can we > >> just move some actions which are triggered by the restoring of the config space > >> back (such as vfio_msix_enable())? > > > > It seems that the significant benefit to enabling interrupts during > > pre-copy would be to reduce the latency and failure potential during > > the final phase of migration. Do we have any data for how much it adds > > to the device contributed downtime to configure interrupts only at the > > final stage? My guess is that it's a measurable delay on its own. At > > the same time, we can't ignore the differences in machine specific > > dependencies and if we don't even sync the config space once the VM is > > stopped... this all seems not ready to call supported, especially if we > > have concerns already about migration bit-stream compatibility. > > > > I have another question for this, if we restore the config space while in pre-copy > (include enabling interrupts), does it affect the _RESUMING state (paused) of the > device on the dst host (cause it to send interrupts? which should not be allowed > in this stage). Does the restore sequence need to be further discussed and reach > a consensus(spec) (taking into account other devices and the corresponding actions > of the vendor driver)? > > > Given our timing relative to QEMU 5.2, the only path I feel comfortable > > with is to move forward with downgrading vfio migration support to be > > enabled via an experimental option. Objections? Thanks, > > Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you > give a rough idea about this (where to enable interrupts, we hope it to be after > the restoring of VGIC)? I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to make the entire VFIO based migration an experimental feature? Thanks, Neo > > Thanks, > Shenming
On Mon, 23 Nov 2020 11:33:37 -0800 Neo Jia <cjia@nvidia.com> wrote: > On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote: > > External email: Use caution opening links or attachments > > > > > > On 2020/11/21 6:01, Alex Williamson wrote: > > > On Fri, 20 Nov 2020 22:05:49 +0800 > > > Shenming Lu <lushenming@huawei.com> wrote: > > > > > >> On 2020/11/20 1:41, Alex Williamson wrote: > > >>> On Thu, 19 Nov 2020 14:13:24 +0530 > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote: > > >>> > > >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote: > > >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s > > >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers > > >>>>> need to be restored before PCI devices. > > >>>> > > >>>> Is there any other way by which VGIC can be restored before PCI device? > > >> > > >> As far as I know, it seems to have to depend on priorities in the non-iterable process. > > >> > > >>>> > > >>>>> Besides, if a VFIO PCI device is > > >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config > > >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would > > >>>>> return an error as I saw due to the dependency on kvm’s vgic. > > >>>>> > > >>>> > > >>>> Can this be fixed in kernel to re-initialize the kernel state? > > >> > > >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic? > > >> But the fact is that this error is not caused by kernel, it is due to the incorrect > > >> calling order of qemu... > > >> > > >>>> > > >>>>> To avoid this, we can move the saving of the config space from the iterable > > >>>>> process to the non-iterable process, so that it will be called after VGIC > > >>>>> according to their priorities. > > >>>>> > > >>>> > > >>>> With this change, at resume side, pre-copy phase data would reach > > >>>> destination without restored config space. VFIO device on destination > > >>>> might need it's config space setup and validated before it can accept > > >>>> further VFIO device specific migration state. > > >>>> > > >>>> This also changes bit-stream, so it would break migration with original > > >>>> migration patch-set. > > >>> > > >>> Config space can continue to change while in pre-copy, if we're only > > >>> sending config space at the initiation of pre-copy, how are any changes > > >>> that might occur before the VM is stopped conveyed to the target? For > > >>> example the guest might reboot and a device returned to INTx mode from > > >>> MSI during pre-copy. Thanks, > > >> > > >> What I see is that the config space is only saved once in save_live_complete_precopy > > >> currently... > > >> As you said, a VFIO device might need it's config space setup first, and > > >> the config space can continue to change while in pre-copy, Did you mean we > > >> have to migrate the config space in save_live_iterate? > > >> However, I still have a little doubt about the restoring dependence between > > >> the qemu emulated config space and the device data... > > >> > > >> Besides, if we surely can't move the saving of the config space back, can we > > >> just move some actions which are triggered by the restoring of the config space > > >> back (such as vfio_msix_enable())? > > > > > > It seems that the significant benefit to enabling interrupts during > > > pre-copy would be to reduce the latency and failure potential during > > > the final phase of migration. Do we have any data for how much it adds > > > to the device contributed downtime to configure interrupts only at the > > > final stage? My guess is that it's a measurable delay on its own. At > > > the same time, we can't ignore the differences in machine specific > > > dependencies and if we don't even sync the config space once the VM is > > > stopped... this all seems not ready to call supported, especially if we > > > have concerns already about migration bit-stream compatibility. > > > > > > > I have another question for this, if we restore the config space while in pre-copy > > (include enabling interrupts), does it affect the _RESUMING state (paused) of the > > device on the dst host (cause it to send interrupts? which should not be allowed > > in this stage). Does the restore sequence need to be further discussed and reach > > a consensus(spec) (taking into account other devices and the corresponding actions > > of the vendor driver)? If a target device generates an interrupt when stopped, I think that would violate the basic notion of running vs stopped in the device state, right? > > > Given our timing relative to QEMU 5.2, the only path I feel comfortable > > > with is to move forward with downgrading vfio migration support to be > > > enabled via an experimental option. Objections? Thanks, > > > > Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you > > give a rough idea about this (where to enable interrupts, we hope it to be after > > the restoring of VGIC)? > > I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to > make the entire VFIO based migration an experimental feature? Because it's essentially unproven by the larger community. Let's face it, there was a surprisingly positive response to my previous proposal to mark this feature experimental, and mostly it was not due to the immediate issue that caused me to sound the alarm. I think this latest issue shows that this feature has not had the upstream soak time that it needs. Had we persisted with its development after it missed QEMU 5.1, had we pushed it upstream after the freeze lifted and iterated throughout the 5.2 development window, we might be in a better position right now. As it stands, we're potentially at the final RC of this release, other vendors have yet to prove this for their devices and architectures in the short time since the code entered mainline, we continue to see new issues, we have zero available vendor drivers supporting migration, we're not sure how to maintain bit-stream compatibility with the issues we already know about, and we're out of time. I don't see how the upstream community can gamble that this is in a supportable state at this point. Thanks, Alex
Hi, After reading everyone's opinions, we have a rough idea for this issue. One key point is whether it is necessary to setup the config space before the device can accept further migration data. I think it is decided by the vendor driver, so we can simply ask the vendor driver about it in .save_setup, which could avoid a lot of unnecessary copies and settings. Once we have known the need, we can iterate the config space (before) along with the device migration data in .save_live_iterate and .save_live_complete_precopy, and if not needed, we can only migrate the config space in .save_state. Another key point is that the interrupt enabling should be after the restoring of the interrupt controller (might not only interrupts). My solution is to add a subflag at the beginning of the config data (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered actions on the dst (such as whether to enable interrupts). Below is it's workflow. On the save path: In vfio_save_setup(): Ask the vendor driver if it needs the config space setup before it can accept further migration data. | In vfio_save_iterate() (pre-copy): If *needed*, save the config space which would be setup on the dst before the migration data, but send with a subflag to instruct not to (such as) enable interrupts. | In vfio_save_complete_precopy() (stop-and-copy, iterable process): The same as that in vfio_save_iterate(). | In .save_state (stop-and-copy, non-iterable process): If *needed*, only send a subflag to instruct to enable interrupts. If *not needed*, save the config space and setup everything on the dst. Besides the above idea, we might be able to choose to let the vendor driver do more: qemu just sends and writes the config data (before) along with the device migration data every time, and it's up to the vendor driver to filter out/buffer the received data or reorder the settings... Thanks, Shenming
On Thu, 26 Nov 2020 14:56:17 +0800 Shenming Lu <lushenming@huawei.com> wrote: > Hi, > > After reading everyone's opinions, we have a rough idea for this issue. > > One key point is whether it is necessary to setup the config space before > the device can accept further migration data. I think it is decided by > the vendor driver, so we can simply ask the vendor driver about it in > .save_setup, which could avoid a lot of unnecessary copies and settings. > Once we have known the need, we can iterate the config space (before) > along with the device migration data in .save_live_iterate and > .save_live_complete_precopy, and if not needed, we can only migrate the > config space in .save_state. > > Another key point is that the interrupt enabling should be after the > restoring of the interrupt controller (might not only interrupts). > My solution is to add a subflag at the beginning of the config data > (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered > actions on the dst (such as whether to enable interrupts). > > Below is it's workflow. > > On the save path: > In vfio_save_setup(): > Ask the vendor driver if it needs the config space setup before it > can accept further migration data. How does "ask the vendor driver" actually work? > | > In vfio_save_iterate() (pre-copy): > If *needed*, save the config space which would be setup on the dst > before the migration data, but send with a subflag to instruct not > to (such as) enable interrupts. If not for triggering things like MSI/X configuration, isn't config space almost entirely virtual? What visibility does the vendor driver have to the VM machine dependencies regarding device interrupt versus interrupt controller migration? > | > In vfio_save_complete_precopy() (stop-and-copy, iterable process): > The same as that in vfio_save_iterate(). > | > In .save_state (stop-and-copy, non-iterable process): > If *needed*, only send a subflag to instruct to enable interrupts. > If *not needed*, save the config space and setup everything on the dst. Again, how does the vendor driver have visibility to know when the VM machine can enable interrupts? > > Besides the above idea, we might be able to choose to let the vendor driver do > more: qemu just sends and writes the config data (before) along with the device > migration data every time, and it's up to the vendor driver to filter out/buffer > the received data or reorder the settings... There is no vendor driver in QEMU though, so are you suggesting that QEMU follows a standard protocol and the vendor driver chooses when to enable specific features? For instance, QEMU would call SET_IRQS and the driver would return success, but defer that setup if necessary? That seems quite troubling as we then have ioctls that behave differently depending on the device state and we have no error path to userspace should that setup fail later. The vendor driver does have its own data stream for migration, so the vendor driver could tell the destination version of itself what type of interrupt to use, which might be sufficient if we were to ignore the latency if QEMU were to defer interrupt setup until stop-and-copy. Is the question of when to setup device interrupts versus the interrupt controller state largely a machine issue within QEMU? If so, shouldn't it be at QEMU's determination when to act on the config space information on the target? IOW, if a vendor driver has a dependency on interrupt configuration, they need to include it in their own pre-copy data stream and decouple that dependency from userspace interrupt configuration via the SET_IRQS ioctl. Is that possible? Thanks, Alex
On 2020/12/1 1:03, Alex Williamson wrote: > On Thu, 26 Nov 2020 14:56:17 +0800 > Shenming Lu <lushenming@huawei.com> wrote: > >> Hi, >> >> After reading everyone's opinions, we have a rough idea for this issue. >> >> One key point is whether it is necessary to setup the config space before >> the device can accept further migration data. I think it is decided by >> the vendor driver, so we can simply ask the vendor driver about it in >> .save_setup, which could avoid a lot of unnecessary copies and settings. >> Once we have known the need, we can iterate the config space (before) >> along with the device migration data in .save_live_iterate and >> .save_live_complete_precopy, and if not needed, we can only migrate the >> config space in .save_state. >> >> Another key point is that the interrupt enabling should be after the >> restoring of the interrupt controller (might not only interrupts). >> My solution is to add a subflag at the beginning of the config data >> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered >> actions on the dst (such as whether to enable interrupts). >> >> Below is it's workflow. >> >> On the save path: >> In vfio_save_setup(): >> Ask the vendor driver if it needs the config space setup before it >> can accept further migration data. > > How does "ask the vendor driver" actually work? Maybe via a ioctl? Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup) and send the config data (before) along with the device migration data every time?... > >> | >> In vfio_save_iterate() (pre-copy): >> If *needed*, save the config space which would be setup on the dst >> before the migration data, but send with a subflag to instruct not >> to (such as) enable interrupts. > > If not for triggering things like MSI/X configuration, isn't config > space almost entirely virtual? What visibility does the vendor driver > have to the VM machine dependencies regarding device interrupt versus > interrupt controller migration? My thought is that the vendor driver only decides the order of the config space setup and the receiving of the migration data, but leaves the VM machine dependencies to QEMU. > >> | >> In vfio_save_complete_precopy() (stop-and-copy, iterable process): >> The same as that in vfio_save_iterate(). >> | >> In .save_state (stop-and-copy, non-iterable process): >> If *needed*, only send a subflag to instruct to enable interrupts. >> If *not needed*, save the config space and setup everything on the dst. > > Again, how does the vendor driver have visibility to know when the VM > machine can enable interrupts? It seems troubling if the vendor driver needs the interrupts to be enabled first... > >> >> Besides the above idea, we might be able to choose to let the vendor driver do >> more: qemu just sends and writes the config data (before) along with the device >> migration data every time, and it's up to the vendor driver to filter out/buffer >> the received data or reorder the settings... > > There is no vendor driver in QEMU though, so are you suggesting that > QEMU follows a standard protocol and the vendor driver chooses when to > enable specific features? For instance, QEMU would call SET_IRQS and > the driver would return success, but defer that setup if necessary? > That seems quite troubling as we then have ioctls that behave > differently depending on the device state and we have no error path to > userspace should that setup fail later. The vendor driver does have > its own data stream for migration, so the vendor driver could tell the > destination version of itself what type of interrupt to use, which > might be sufficient if we were to ignore the latency if QEMU were to > defer interrupt setup until stop-and-copy. Did you mean that we could only enable MSI-X during the iterable phase, but leave the setup of these unmasked vectors to the non-iterable phase? It looks good to me. > > Is the question of when to setup device interrupts versus the interrupt > controller state largely a machine issue within QEMU? If so, shouldn't > it be at QEMU's determination when to act on the config space > information on the target? I think it would be simpler if ensuring the proper calling order in QEMU... > IOW, if a vendor driver has a dependency on > interrupt configuration, they need to include it in their own pre-copy > data stream and decouple that dependency from userspace interrupt > configuration via the SET_IRQS ioctl. Is that possible? Thanks, > I don't understand what the decoupling that dependency from userspace interrupt configuration means... Thanks, Shenming > Alex > > . >
On Tue, 1 Dec 2020 14:37:52 +0800 Shenming Lu <lushenming@huawei.com> wrote: > On 2020/12/1 1:03, Alex Williamson wrote: > > On Thu, 26 Nov 2020 14:56:17 +0800 > > Shenming Lu <lushenming@huawei.com> wrote: > > > >> Hi, > >> > >> After reading everyone's opinions, we have a rough idea for this issue. > >> > >> One key point is whether it is necessary to setup the config space before > >> the device can accept further migration data. I think it is decided by > >> the vendor driver, so we can simply ask the vendor driver about it in > >> .save_setup, which could avoid a lot of unnecessary copies and settings. > >> Once we have known the need, we can iterate the config space (before) > >> along with the device migration data in .save_live_iterate and > >> .save_live_complete_precopy, and if not needed, we can only migrate the > >> config space in .save_state. > >> > >> Another key point is that the interrupt enabling should be after the > >> restoring of the interrupt controller (might not only interrupts). > >> My solution is to add a subflag at the beginning of the config data > >> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered > >> actions on the dst (such as whether to enable interrupts). > >> > >> Below is it's workflow. > >> > >> On the save path: > >> In vfio_save_setup(): > >> Ask the vendor driver if it needs the config space setup before it > >> can accept further migration data. > > > > How does "ask the vendor driver" actually work? > > Maybe via a ioctl? > Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup) > and send the config data (before) along with the device migration data > every time?... Migration data streams are unidirectional, otherwise we break offline migration. There are various ways other than an ioctl for a vendor driver to advertise requirements, ex. a capability on the migration region info, but it's not clear to me that we really need this info yet. > >> | > >> In vfio_save_iterate() (pre-copy): > >> If *needed*, save the config space which would be setup on the dst > >> before the migration data, but send with a subflag to instruct not > >> to (such as) enable interrupts. > > > > If not for triggering things like MSI/X configuration, isn't config > > space almost entirely virtual? What visibility does the vendor driver > > have to the VM machine dependencies regarding device interrupt versus > > interrupt controller migration? > > My thought is that the vendor driver only decides the order of the config > space setup and the receiving of the migration data, but leaves the VM > machine dependencies to QEMU. But again, config space is largely virtual, the vendor driver doesn't have access to it, it's only the effects of config space, like representing the interrupt mode and configuring it on the device or specific registers that QEMU writes through that the vendor driver sees. So how is it that the vendor driver decides the order? The vendor driver doesn't have visibility to the VM machine dependencies, like when the interrupt controller is sufficiently configured to enable device interrupts. It seems that a vendor driver that depends on QEMU enabling interrupts at a specific point is inherently dependent on assumptions in the machine configuration. > >> | > >> In vfio_save_complete_precopy() (stop-and-copy, iterable process): > >> The same as that in vfio_save_iterate(). > >> | > >> In .save_state (stop-and-copy, non-iterable process): > >> If *needed*, only send a subflag to instruct to enable interrupts. > >> If *not needed*, save the config space and setup everything on the dst. > > > > Again, how does the vendor driver have visibility to know when the VM > > machine can enable interrupts? > > It seems troubling if the vendor driver needs the interrupts to be enabled > first... Yes. > >> Besides the above idea, we might be able to choose to let the vendor driver do > >> more: qemu just sends and writes the config data (before) along with the device > >> migration data every time, and it's up to the vendor driver to filter out/buffer > >> the received data or reorder the settings... > > > > There is no vendor driver in QEMU though, so are you suggesting that > > QEMU follows a standard protocol and the vendor driver chooses when to > > enable specific features? For instance, QEMU would call SET_IRQS and > > the driver would return success, but defer that setup if necessary? > > That seems quite troubling as we then have ioctls that behave > > differently depending on the device state and we have no error path to > > userspace should that setup fail later. The vendor driver does have > > its own data stream for migration, so the vendor driver could tell the > > destination version of itself what type of interrupt to use, which > > might be sufficient if we were to ignore the latency if QEMU were to > > defer interrupt setup until stop-and-copy. > > Did you mean that we could only enable MSI-X during the iterable phase, but > leave the setup of these unmasked vectors to the non-iterable phase? > It looks good to me. On x86 the vfio SET_IRQS ioctl is independent of the VM interrupt setup, so we could decouple configuring the device interrupts from plumbing them through the VM. However, doesn't ARM64 require mapping an MSI doorbell page via vfio? Where does is that configured during the machine load versus the device iterative phase? > > Is the question of when to setup device interrupts versus the interrupt > > controller state largely a machine issue within QEMU? If so, shouldn't > > it be at QEMU's determination when to act on the config space > > information on the target? > > I think it would be simpler if ensuring the proper calling order in QEMU... > > > IOW, if a vendor driver has a dependency on > > interrupt configuration, they need to include it in their own pre-copy > > data stream and decouple that dependency from userspace interrupt > > configuration via the SET_IRQS ioctl. Is that possible? Thanks, > > > > I don't understand what the decoupling that dependency from userspace > interrupt configuration means... I mean that the vendor driver cannot depend on when QEMU calls SET_IRQS, the vendor specific migration stream contains any information the vendor driver needs about interrupt configuration such that QEMU can call SET_IRQS whenever it chooses. Thanks, Alex
On 2020/12/2 6:21, Alex Williamson wrote: > On Tue, 1 Dec 2020 14:37:52 +0800 > Shenming Lu <lushenming@huawei.com> wrote: > >> On 2020/12/1 1:03, Alex Williamson wrote: >>> On Thu, 26 Nov 2020 14:56:17 +0800 >>> Shenming Lu <lushenming@huawei.com> wrote: >>> >>>> Hi, >>>> >>>> After reading everyone's opinions, we have a rough idea for this issue. >>>> >>>> One key point is whether it is necessary to setup the config space before >>>> the device can accept further migration data. I think it is decided by >>>> the vendor driver, so we can simply ask the vendor driver about it in >>>> .save_setup, which could avoid a lot of unnecessary copies and settings. >>>> Once we have known the need, we can iterate the config space (before) >>>> along with the device migration data in .save_live_iterate and >>>> .save_live_complete_precopy, and if not needed, we can only migrate the >>>> config space in .save_state. >>>> >>>> Another key point is that the interrupt enabling should be after the >>>> restoring of the interrupt controller (might not only interrupts). >>>> My solution is to add a subflag at the beginning of the config data >>>> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered >>>> actions on the dst (such as whether to enable interrupts). >>>> >>>> Below is it's workflow. >>>> >>>> On the save path: >>>> In vfio_save_setup(): >>>> Ask the vendor driver if it needs the config space setup before it >>>> can accept further migration data. >>> >>> How does "ask the vendor driver" actually work? >> >> Maybe via a ioctl? >> Oh, it seems that we have to ask the dst vendor driver (in vfio_load_setup) >> and send the config data (before) along with the device migration data >> every time?... > > Migration data streams are unidirectional, otherwise we break offline > migration. There are various ways other than an ioctl for a vendor > driver to advertise requirements, ex. a capability on the migration > region info, but it's not clear to me that we really need this info yet. Ok, this is not clear yet. > >>>> | >>>> In vfio_save_iterate() (pre-copy): >>>> If *needed*, save the config space which would be setup on the dst >>>> before the migration data, but send with a subflag to instruct not >>>> to (such as) enable interrupts. >>> >>> If not for triggering things like MSI/X configuration, isn't config >>> space almost entirely virtual? What visibility does the vendor driver >>> have to the VM machine dependencies regarding device interrupt versus >>> interrupt controller migration? >> >> My thought is that the vendor driver only decides the order of the config >> space setup and the receiving of the migration data, but leaves the VM >> machine dependencies to QEMU. > > But again, config space is largely virtual, the vendor driver doesn't > have access to it, it's only the effects of config space, like > representing the interrupt mode and configuring it on the device or > specific registers that QEMU writes through that the vendor driver > sees. So how is it that the vendor driver decides the order? As you said above, a vendor driver could advertise its requirements via the migration region... > The > vendor driver doesn't have visibility to the VM machine dependencies, > like when the interrupt controller is sufficiently configured to enable > device interrupts. It seems that a vendor driver that depends on QEMU > enabling interrupts at a specific point is inherently dependent on > assumptions in the machine configuration. Yeah, there might be more than one dependency. > >>>> | >>>> In vfio_save_complete_precopy() (stop-and-copy, iterable process): >>>> The same as that in vfio_save_iterate(). >>>> | >>>> In .save_state (stop-and-copy, non-iterable process): >>>> If *needed*, only send a subflag to instruct to enable interrupts. >>>> If *not needed*, save the config space and setup everything on the dst. >>> >>> Again, how does the vendor driver have visibility to know when the VM >>> machine can enable interrupts? >> >> It seems troubling if the vendor driver needs the interrupts to be enabled >> first... > > Yes. > >>>> Besides the above idea, we might be able to choose to let the vendor driver do >>>> more: qemu just sends and writes the config data (before) along with the device >>>> migration data every time, and it's up to the vendor driver to filter out/buffer >>>> the received data or reorder the settings... >>> >>> There is no vendor driver in QEMU though, so are you suggesting that >>> QEMU follows a standard protocol and the vendor driver chooses when to >>> enable specific features? For instance, QEMU would call SET_IRQS and >>> the driver would return success, but defer that setup if necessary? >>> That seems quite troubling as we then have ioctls that behave >>> differently depending on the device state and we have no error path to >>> userspace should that setup fail later. The vendor driver does have >>> its own data stream for migration, so the vendor driver could tell the >>> destination version of itself what type of interrupt to use, which >>> might be sufficient if we were to ignore the latency if QEMU were to >>> defer interrupt setup until stop-and-copy. >> >> Did you mean that we could only enable MSI-X during the iterable phase, but >> leave the setup of these unmasked vectors to the non-iterable phase? >> It looks good to me. > > On x86 the vfio SET_IRQS ioctl is independent of the VM interrupt > setup, so we could decouple configuring the device interrupts from > plumbing them through the VM. On ARM64 the setting of the IRQ bypass in the vfio SET_IRQS ioctl is dependent on the VM interrupt setup... > However, doesn't ARM64 require mapping > an MSI doorbell page via vfio? Where does is that configured during the > machine load versus the device iterative phase? Isn't the mapping of the MSI doorbell arch independent in vfio? -_- > >>> Is the question of when to setup device interrupts versus the interrupt >>> controller state largely a machine issue within QEMU? If so, shouldn't >>> it be at QEMU's determination when to act on the config space >>> information on the target? >> >> I think it would be simpler if ensuring the proper calling order in QEMU... >> >>> IOW, if a vendor driver has a dependency on >>> interrupt configuration, they need to include it in their own pre-copy >>> data stream and decouple that dependency from userspace interrupt >>> configuration via the SET_IRQS ioctl. Is that possible? Thanks, >>> >> >> I don't understand what the decoupling that dependency from userspace >> interrupt configuration means... > > I mean that the vendor driver cannot depend on when QEMU calls > SET_IRQS, the vendor specific migration stream contains any information > the vendor driver needs about interrupt configuration such that QEMU > can call SET_IRQS whenever it chooses. Thanks, Oh, that sounds great. By this way, it seems that we only need to migrate the QEMU config space in the non-iterable process once. > > Alex > > . >
* Shenming Lu (lushenming@huawei.com) wrote: > Hi, > > After reading everyone's opinions, we have a rough idea for this issue. > > One key point is whether it is necessary to setup the config space before > the device can accept further migration data. I think it is decided by > the vendor driver, so we can simply ask the vendor driver about it in > .save_setup, which could avoid a lot of unnecessary copies and settings. > Once we have known the need, we can iterate the config space (before) > along with the device migration data in .save_live_iterate and > .save_live_complete_precopy, and if not needed, we can only migrate the > config space in .save_state. > > Another key point is that the interrupt enabling should be after the > restoring of the interrupt controller (might not only interrupts). > My solution is to add a subflag at the beginning of the config data > (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered > actions on the dst (such as whether to enable interrupts). > > Below is it's workflow. > > On the save path: > In vfio_save_setup(): > Ask the vendor driver if it needs the config space setup before it > can accept further migration data. You could argue that you could always send an initial copy of the config data at the start; and then send new versions at the end anyway. But is this just about interrupts? Is the dst going to interpret the received migration data differently depending on the config data? That would seem dangerous if the config data was to change during the migration. Dave > | > In vfio_save_iterate() (pre-copy): > If *needed*, save the config space which would be setup on the dst > before the migration data, but send with a subflag to instruct not > to (such as) enable interrupts. > | > In vfio_save_complete_precopy() (stop-and-copy, iterable process): > The same as that in vfio_save_iterate(). > | > In .save_state (stop-and-copy, non-iterable process): > If *needed*, only send a subflag to instruct to enable interrupts. > If *not needed*, save the config space and setup everything on the dst. > > Besides the above idea, we might be able to choose to let the vendor driver do > more: qemu just sends and writes the config data (before) along with the device > migration data every time, and it's up to the vendor driver to filter out/buffer > the received data or reorder the settings... > > Thanks, > Shenming > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 2020/12/2 18:55, Dr. David Alan Gilbert wrote: > * Shenming Lu (lushenming@huawei.com) wrote: >> Hi, >> >> After reading everyone's opinions, we have a rough idea for this issue. >> >> One key point is whether it is necessary to setup the config space before >> the device can accept further migration data. I think it is decided by >> the vendor driver, so we can simply ask the vendor driver about it in >> .save_setup, which could avoid a lot of unnecessary copies and settings. >> Once we have known the need, we can iterate the config space (before) >> along with the device migration data in .save_live_iterate and >> .save_live_complete_precopy, and if not needed, we can only migrate the >> config space in .save_state. >> >> Another key point is that the interrupt enabling should be after the >> restoring of the interrupt controller (might not only interrupts). >> My solution is to add a subflag at the beginning of the config data >> (right after VFIO_MIG_FLAG_DEV_CONFIG_STATE) to indicate the triggered >> actions on the dst (such as whether to enable interrupts). >> >> Below is it's workflow. >> >> On the save path: >> In vfio_save_setup(): >> Ask the vendor driver if it needs the config space setup before it >> can accept further migration data. > > You could argue that you could always send an initial copy of the config > data at the start; and then send new versions at the end anyway. > But is this just about interrupts? In the current implementation, the effect of the config space is mainly about interrupts (SET_IRQS ioctl). > Is the dst going to interpret the > received migration data differently depending on the config data? I also have a little doubt about the dependency... But not sure. > That > would seem dangerous if the config data was to change during the > migration. > > Dave > >> | >> In vfio_save_iterate() (pre-copy): >> If *needed*, save the config space which would be setup on the dst >> before the migration data, but send with a subflag to instruct not >> to (such as) enable interrupts. >> | >> In vfio_save_complete_precopy() (stop-and-copy, iterable process): >> The same as that in vfio_save_iterate(). >> | >> In .save_state (stop-and-copy, non-iterable process): >> If *needed*, only send a subflag to instruct to enable interrupts. >> If *not needed*, save the config space and setup everything on the dst. >> >> Besides the above idea, we might be able to choose to let the vendor driver do >> more: qemu just sends and writes the config data (before) along with the device >> migration data every time, and it's up to the vendor driver to filter out/buffer >> the received data or reorder the settings... >> >> Thanks, >> Shenming >>
On 2020/11/24 3:33, Neo Jia wrote: > On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote: >> External email: Use caution opening links or attachments >> >> >> On 2020/11/21 6:01, Alex Williamson wrote: >>> On Fri, 20 Nov 2020 22:05:49 +0800 >>> Shenming Lu <lushenming@huawei.com> wrote: >>> >>>> On 2020/11/20 1:41, Alex Williamson wrote: >>>>> On Thu, 19 Nov 2020 14:13:24 +0530 >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote: >>>>> >>>>>> On 11/14/2020 2:47 PM, Shenming Lu wrote: >>>>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s >>>>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers >>>>>>> need to be restored before PCI devices. >>>>>> >>>>>> Is there any other way by which VGIC can be restored before PCI device? >>>> >>>> As far as I know, it seems to have to depend on priorities in the non-iterable process. >>>> >>>>>> >>>>>>> Besides, if a VFIO PCI device is >>>>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config >>>>>>> space will trigger the configuring of these VLPIs (in kernel), where it would >>>>>>> return an error as I saw due to the dependency on kvm’s vgic. >>>>>>> >>>>>> >>>>>> Can this be fixed in kernel to re-initialize the kernel state? >>>> >>>> Did you mean to reconfigure these VLPIs when restoring kvm's vgic? >>>> But the fact is that this error is not caused by kernel, it is due to the incorrect >>>> calling order of qemu... >>>> >>>>>> >>>>>>> To avoid this, we can move the saving of the config space from the iterable >>>>>>> process to the non-iterable process, so that it will be called after VGIC >>>>>>> according to their priorities. >>>>>>> >>>>>> >>>>>> With this change, at resume side, pre-copy phase data would reach >>>>>> destination without restored config space. VFIO device on destination >>>>>> might need it's config space setup and validated before it can accept >>>>>> further VFIO device specific migration state. >>>>>> >>>>>> This also changes bit-stream, so it would break migration with original >>>>>> migration patch-set. >>>>> >>>>> Config space can continue to change while in pre-copy, if we're only >>>>> sending config space at the initiation of pre-copy, how are any changes >>>>> that might occur before the VM is stopped conveyed to the target? For >>>>> example the guest might reboot and a device returned to INTx mode from >>>>> MSI during pre-copy. Thanks, >>>> >>>> What I see is that the config space is only saved once in save_live_complete_precopy >>>> currently... >>>> As you said, a VFIO device might need it's config space setup first, and >>>> the config space can continue to change while in pre-copy, Did you mean we >>>> have to migrate the config space in save_live_iterate? >>>> However, I still have a little doubt about the restoring dependence between >>>> the qemu emulated config space and the device data... >>>> >>>> Besides, if we surely can't move the saving of the config space back, can we >>>> just move some actions which are triggered by the restoring of the config space >>>> back (such as vfio_msix_enable())? >>> >>> It seems that the significant benefit to enabling interrupts during >>> pre-copy would be to reduce the latency and failure potential during >>> the final phase of migration. Do we have any data for how much it adds >>> to the device contributed downtime to configure interrupts only at the >>> final stage? My guess is that it's a measurable delay on its own. At >>> the same time, we can't ignore the differences in machine specific >>> dependencies and if we don't even sync the config space once the VM is >>> stopped... this all seems not ready to call supported, especially if we >>> have concerns already about migration bit-stream compatibility. >>> >> >> I have another question for this, if we restore the config space while in pre-copy >> (include enabling interrupts), does it affect the _RESUMING state (paused) of the >> device on the dst host (cause it to send interrupts? which should not be allowed >> in this stage). Does the restore sequence need to be further discussed and reach >> a consensus(spec) (taking into account other devices and the corresponding actions >> of the vendor driver)? >> >>> Given our timing relative to QEMU 5.2, the only path I feel comfortable >>> with is to move forward with downgrading vfio migration support to be >>> enabled via an experimental option. Objections? Thanks, >> >> Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you >> give a rough idea about this (where to enable interrupts, we hope it to be after >> the restoring of VGIC)? > > I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to > make the entire VFIO based migration an experimental feature? It is not specific to Huawei ARM GIC implementation, the error was encountered in general ARM GIC implementation... Thanks, Shenming
* Neo Jia (cjia@nvidia.com) wrote: > On Mon, Nov 23, 2020 at 11:14:38AM +0800, Shenming Lu wrote: > > External email: Use caution opening links or attachments > > > > > > On 2020/11/21 6:01, Alex Williamson wrote: > > > On Fri, 20 Nov 2020 22:05:49 +0800 > > > Shenming Lu <lushenming@huawei.com> wrote: > > > > > >> On 2020/11/20 1:41, Alex Williamson wrote: > > >>> On Thu, 19 Nov 2020 14:13:24 +0530 > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote: > > >>> > > >>>> On 11/14/2020 2:47 PM, Shenming Lu wrote: > > >>>>> When running VFIO migration, I found that the restoring of VFIO PCI device’s > > >>>>> config space is before VGIC on ARM64 target. But generally, interrupt controllers > > >>>>> need to be restored before PCI devices. > > >>>> > > >>>> Is there any other way by which VGIC can be restored before PCI device? > > >> > > >> As far as I know, it seems to have to depend on priorities in the non-iterable process. > > >> > > >>>> > > >>>>> Besides, if a VFIO PCI device is > > >>>>> configured to have directly-injected MSIs (VLPIs), the restoring of its config > > >>>>> space will trigger the configuring of these VLPIs (in kernel), where it would > > >>>>> return an error as I saw due to the dependency on kvm’s vgic. > > >>>>> > > >>>> > > >>>> Can this be fixed in kernel to re-initialize the kernel state? > > >> > > >> Did you mean to reconfigure these VLPIs when restoring kvm's vgic? > > >> But the fact is that this error is not caused by kernel, it is due to the incorrect > > >> calling order of qemu... > > >> > > >>>> > > >>>>> To avoid this, we can move the saving of the config space from the iterable > > >>>>> process to the non-iterable process, so that it will be called after VGIC > > >>>>> according to their priorities. > > >>>>> > > >>>> > > >>>> With this change, at resume side, pre-copy phase data would reach > > >>>> destination without restored config space. VFIO device on destination > > >>>> might need it's config space setup and validated before it can accept > > >>>> further VFIO device specific migration state. > > >>>> > > >>>> This also changes bit-stream, so it would break migration with original > > >>>> migration patch-set. > > >>> > > >>> Config space can continue to change while in pre-copy, if we're only > > >>> sending config space at the initiation of pre-copy, how are any changes > > >>> that might occur before the VM is stopped conveyed to the target? For > > >>> example the guest might reboot and a device returned to INTx mode from > > >>> MSI during pre-copy. Thanks, > > >> > > >> What I see is that the config space is only saved once in save_live_complete_precopy > > >> currently... > > >> As you said, a VFIO device might need it's config space setup first, and > > >> the config space can continue to change while in pre-copy, Did you mean we > > >> have to migrate the config space in save_live_iterate? > > >> However, I still have a little doubt about the restoring dependence between > > >> the qemu emulated config space and the device data... > > >> > > >> Besides, if we surely can't move the saving of the config space back, can we > > >> just move some actions which are triggered by the restoring of the config space > > >> back (such as vfio_msix_enable())? > > > > > > It seems that the significant benefit to enabling interrupts during > > > pre-copy would be to reduce the latency and failure potential during > > > the final phase of migration. Do we have any data for how much it adds > > > to the device contributed downtime to configure interrupts only at the > > > final stage? My guess is that it's a measurable delay on its own. At > > > the same time, we can't ignore the differences in machine specific > > > dependencies and if we don't even sync the config space once the VM is > > > stopped... this all seems not ready to call supported, especially if we > > > have concerns already about migration bit-stream compatibility. > > > > > > > I have another question for this, if we restore the config space while in pre-copy > > (include enabling interrupts), does it affect the _RESUMING state (paused) of the > > device on the dst host (cause it to send interrupts? which should not be allowed > > in this stage). Does the restore sequence need to be further discussed and reach > > a consensus(spec) (taking into account other devices and the corresponding actions > > of the vendor driver)? > > > > > Given our timing relative to QEMU 5.2, the only path I feel comfortable > > > with is to move forward with downgrading vfio migration support to be > > > enabled via an experimental option. Objections? Thanks, > > > > Alright, but this issue is related to our ARM GICv4.1 migration scheme, could you > > give a rough idea about this (where to enable interrupts, we hope it to be after > > the restoring of VGIC)? > > I disagree. If this is only specific to Huawei ARM GIC implementation, why do we want to > make the entire VFIO based migration an experimental feature? It sounds though like the problem is really a general ordering problem; in normal migration we get the non-iterable device states at the end of the migration, at the point when the CPUs have already been stopped; then there's ordering of some of those device states to ensure that interrupt controllers happen first. That's not an ARM GIC specific issue - any interrupt chain can cause it. As others have said, the other problem here is that other parts of the state in there can change during hte lifetime of the migration and you need to make sure you end up with the right config space state on the destination. Dave > Thanks, > Neo > > > > > Thanks, > > Shenming > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, 20 Nov 2020 15:01:46 -0700 Alex Williamson <alex.williamson@redhat.com> wrote: > Given our timing relative to QEMU 5.2, the only path I feel comfortable > with is to move forward with downgrading vfio migration support to be > enabled via an experimental option. Objections? Thanks, > > Alex Agreed from my side. It seems better to give it one more release to figure out some issues, and just mark it experimental for now.
© 2016 - 2024 Red Hat, Inc.