[v7] Live Update

[PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steve Sistare 4 years, 1 month ago

Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in cpr state.

In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
at a different VA after exec.  DMA to already-mapped pages continues.  Save
the msi message area as part of vfio-pci vmstate, save the interrupt and
notifier eventfd's in cpr state, and clear the close-on-exec flag for the
vfio descriptors.  The flag is not cleared earlier because the descriptors
should not persist across miscellaneous fork and exec calls that may be
performed during normal operation.

On qemu restart, vfio_realize() finds the saved descriptors, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  The result is that
vfio_realize constructs qemu data structures that reflect the current
state of the device.  However, the reconstruction is not complete until
cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
state.  It rebuilds vector data structures and attaches the interrupts to
the new KVM instance.  cpr-load then invokes the main vfio listener callback,
which walks the flattened ranges of the vfio_address_spaces and calls
VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
starts the VM and suppresses vfio pci device reset.

This functionality is delivered by 3 patches for clarity.  Part 1 handles
device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
support.  Part 3 adds INTX support.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS                   |   1 +
 hw/pci/pci.c                  |  10 ++++
 hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
 hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   1 +
 include/hw/pci/pci.h          |   1 +
 include/hw/vfio/vfio-common.h |   8 +++
 include/migration/cpr.h       |   3 ++
 migration/cpr.c               |  10 +++-
 migration/target.c            |  14 +++++
 12 files changed, 324 insertions(+), 11 deletions(-)
 create mode 100644 hw/vfio/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index cfe7480..feed239 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2992,6 +2992,7 @@ CPR
 M: Steve Sistare <steven.sistare@oracle.com>
 M: Mark Kanda <mark.kanda@oracle.com>
 S: Maintained
+F: hw/vfio/cpr.c
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 0fd21e1..e35df4f 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
 {
     int r;
 
+    /*
+     * A reused vfio-pci device is already configured, so do not reset it
+     * during qemu_system_reset prior to cpr-load, else interrupts may be
+     * lost.  By contrast, pure-virtual pci devices may be reset here and
+     * updated with new state in cpr-load with no ill effects.
+     */
+    if (dev->reused) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5b87f95..90f66ad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,6 +31,7 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
@@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    assert(!container->reused);
+
     if (iotlb && container->dirty_pages_supported &&
         vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
@@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
 {
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
         .vaddr = (__u64)(uintptr_t)vaddr,
         .iova = iova,
         .size = size,
     };
 
+    /*
+     * Set the new vaddr for any mappings registered during cpr-load.
+     * Reused is cleared thereafter.
+     */
+    if (container->reused) {
+        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
+        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+            goto fail;
+        }
+        return 0;
+    }
+
+    map.flags = VFIO_DMA_MAP_FLAG_READ;
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }
 
-    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+fail:
+    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
+        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
     return -errno;
 }
 
@@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    vfio_container_region_add(container, section);
+}
+
+void vfio_container_region_add(VFIOContainer *container,
+                               MemoryRegionSection *section)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
@@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         int iommu_idx;
 
         trace_vfio_listener_region_add_iommu(iova, end);
+
         /*
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
@@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
     }
 }
 
+void vfio_listener_register(VFIOContainer *container)
+{
+    container->listener = vfio_memory_listener;
+    memory_listener_register(&container->listener, container->space->as);
+}
+
 static struct vfio_info_cap_header *
 vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
 {
@@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
 {
     int iommu_type, ret;
 
+    /*
+     * If container is reused, just set its type and skip the ioctls, as the
+     * container and group are already configured in the kernel.
+     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
+     * If you ever add new types or spapr cpr support, kind reader, please
+     * also implement VFIO_GET_IOMMU.
+     */
+    if (container->reused) {
+        container->iommu_type = VFIO_TYPE1v2_IOMMU;
+        return 0;
+    }
+
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
         return iommu_type;
@@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 {
     VFIOContainer *container;
     int ret, fd;
+    bool reused;
     VFIOAddressSpace *space;
 
     space = vfio_get_address_space(as);
+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+    reused = (fd > 0);
 
     /*
      * VFIO is currently incompatible with discarding of RAM insofar as the
@@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
+    /*
+     * If the container is reused, then the group is already attached in the
+     * kernel.  If a container with matching fd is found, then update the
+     * userland group list and return.  It not, then after the loop, create
+     * the container struct and group list.
+     */
+
     QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+        if ((reused && container->fd == fd) ||
+            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             ret = vfio_ram_block_discard_disable(container, true);
             if (ret) {
                 error_setg_errno(errp, -ret,
@@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             }
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            vfio_kvm_device_add_group(group);
+            if (!reused) {
+                vfio_kvm_device_add_group(group);
+                cpr_save_fd("vfio_container_for_group", group->groupid,
+                            container->fd);
+            }
             return 0;
         }
     }
 
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    if (!reused) {
+        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    }
+
     if (fd < 0) {
         error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
         ret = -errno;
@@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->reused = reused;
     container->error = NULL;
     container->dirty_pages_supported = false;
     container->dma_max_mappings = 0;
@@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    container->listener = vfio_memory_listener;
-
-    memory_listener_register(&container->listener, container->space->as);
+    /*
+     * If reused, register the listener later, after all state that may
+     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
+     * the listener will invoke its callback on each flat section and call
+     * vfio_dma_map to supply the new vaddr, and the calls will match the
+     * mappings remembered by the kernel.
+     */
+    if (!reused) {
+        vfio_listener_register(container);
+    }
 
     if (container->error) {
         ret = -1;
@@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container->initialized = true;
+    if (!reused) {
+        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
+    }
 
     return 0;
 listener_release_exit:
@@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     VFIOGroup *group;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
+    bool reused;
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
@@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
+
+    group->fd = cpr_find_fd("vfio_group", groupid);
+    reused = (group->fd >= 0);
+    if (!reused) {
+        group->fd = qemu_open_old(path, O_RDWR);
+    }
+
     if (group->fd < 0) {
         error_setg_errno(errp, errno, "failed to open %s", path);
         goto free_group_exit;
@@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
+    if (!reused) {
+        cpr_save_fd("vfio_group", groupid, group->fd);
+    }
+
     return group;
 
 close_fd_exit:
@@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
     trace_vfio_put_group(group->fd);
+    cpr_delete_fd("vfio_group", group->groupid);
     close(group->fd);
     g_free(group);
 
@@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
+    bool reused;
+
+    fd = cpr_find_fd(name, 0);
+    reused = (fd >= 0);
+    if (!reused) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     vbasedev->num_irqs = dev_info.num_irqs;
     vbasedev->num_regions = dev_info.num_regions;
     vbasedev->flags = dev_info.flags;
+    vbasedev->reused = reused;
+    if (!reused) {
+        cpr_save_fd(name, 0, fd);
+    }
 
     trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
                           dev_info.num_irqs);
@@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
+    cpr_delete_fd(vbasedev->name, 0);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
new file mode 100644
index 0000000..2c39cd5
--- /dev/null
+++ b/hw/vfio/cpr.c
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "hw/vfio/vfio-common.h"
+#include "sysemu/kvm.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+static int
+vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return -errno;
+    }
+    return 0;
+}
+
+bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
+        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
+                         "or VFIO_UNMAP_ALL");
+        return false;
+    } else {
+        return true;
+    }
+}
+
+/*
+ * Verify that all containers support CPR, and unmap all dma vaddr's.
+ */
+int vfio_cpr_save(Error **errp)
+{
+    ERRP_GUARD();
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_is_cpr_capable(container, errp)) {
+                return -1;
+            }
+            if (vfio_dma_unmap_vaddr_all(container, errp)) {
+                return -1;
+            }
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Register the listener for each container, which causes its callback to be
+ * invoked for every flat section.  The callback will see that the container
+ * is reused, and call vfo_dma_map with the new vaddr.
+ */
+int vfio_cpr_load(Error **errp)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_is_cpr_capable(container, errp)) {
+                return -1;
+            }
+            vfio_listener_register(container);
+            container->reused = false;
+        }
+    }
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vbasedev->reused = false;
+        }
+    }
+    return 0;
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af29..e247b2b 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,6 +5,7 @@ vfio_ss.add(files(
   'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+  'cpr.c',
   'display.c',
   'pci-quirks.c',
   'pci.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a90cce2..acac8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -30,6 +30,7 @@
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
 #include "qapi/qmp/qdict.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         vfio_put_group(group);
         goto error;
     }
+    pdev->reused = vdev->vbasedev.reused;
 
     vfio_populate_device(vdev, &err);
     if (err) {
@@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(dev);
 
+    /* Do not reset the device during qemu_system_reset prior to cpr-load */
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void vfio_merge_config(VFIOPCIDevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    g_autofree uint8_t *phys_config = g_malloc(size);
+    uint32_t mask;
+    int ret, i;
+
+    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
+    if (ret < size) {
+        ret = ret < 0 ? errno : EFAULT;
+        error_report("failed to read device config space: %s", strerror(ret));
+        return;
+    }
+
+    for (i = 0; i < size; i++) {
+        mask = vdev->emulated_config_bits[i];
+        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
+    }
+}
+
+/*
+ * The kernel may change non-emulated config bits.  Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_pci_pre_load(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    int i;
+
+    for (i = 0; i < size; i++) {
+        pdev->cmask[i] &= vdev->emulated_config_bits[i];
+    }
+
+    return 0;
+}
+
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vfio_merge_config(vdev);
+
+    pdev->reused = false;
+
+    return 0;
+}
+
+static bool vfio_pci_needed(void *opaque)
+{
+    return cpr_get_mode() == CPR_MODE_RESTART;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+    .name = "vfio-pci",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .pre_load = vfio_pci_pre_load,
+    .post_load = vfio_pci_post_load,
+    .needed = vfio_pci_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f..63dd0fe 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cc63dd4..8557e82 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -361,6 +361,7 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+    bool reused;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1641753..bc23c29 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -85,6 +85,7 @@ typedef struct VFIOContainer {
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
+    bool reused;
     uint64_t dirty_pgsizes;
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
@@ -136,6 +137,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     bool enable_migration;
+    bool reused;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_cpr_save(Error **errp);
+int vfio_cpr_load(Error **errp);
+bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
@@ -236,6 +241,9 @@ struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
+void vfio_listener_register(VFIOContainer *container);
+void vfio_container_region_add(VFIOContainer *container,
+                               MemoryRegionSection *section);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index a4da24e..a4007cf 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
 int cpr_state_load(Error **errp);
 void cpr_state_print(void);
 
+int cpr_vfio_save(Error **errp);
+int cpr_vfio_load(Error **errp);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 37eca66..cee82cf 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,6 +7,7 @@
 
 #include "qemu/osdep.h"
 #include "exec/memory.h"
+#include "hw/vfio/vfio-common.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
 #include "migration.h"
@@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
         error_setg(errp, "cpr-exec requires cpr-save with restart mode");
         return;
     }
-
+    if (cpr_vfio_save(errp)) {
+        return;
+    }
     cpr_walk_fd(preserve_fd, 0);
     if (cpr_state_save(errp)) {
         return;
@@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
         goto out;
     }
 
+    if (cpr_get_mode() == CPR_MODE_RESTART &&
+        cpr_vfio_load(errp)) {
+        goto out;
+    }
+
     state = global_state_get_runstate();
     if (state == RUN_STATE_RUNNING) {
         vm_start();
diff --git a/migration/target.c b/migration/target.c
index 4390bf0..984bc9e 100644
--- a/migration/target.c
+++ b/migration/target.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "qapi/qapi-types-migration.h"
 #include "migration.h"
+#include "migration/cpr.h"
 #include CONFIG_DEVICES
 
 #ifdef CONFIG_VFIO
@@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
         info->vfio->transferred = vfio_mig_bytes_transferred();
     }
 }
+
+int cpr_vfio_save(Error **errp)
+{
+    return vfio_cpr_save(errp);
+}
+
+int cpr_vfio_load(Error **errp)
+{
+    return vfio_cpr_load(errp);
+}
+
 #else
 
 void populate_vfio_info(MigrationInfo *info) {}
+int cpr_vfio_save(Error **errp) { return 0; }
+int cpr_vfio_load(Error **errp) { return 0; }
 
 #endif /* CONFIG_VFIO */
-- 
1.8.3.1

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Michael S. Tsirkin 4 years, 1 month ago

On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |  10 ++++
>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/pci/pci.h          |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h       |   3 ++
>  migration/cpr.c               |  10 +++-
>  migration/target.c            |  14 +++++
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>      int r;
>  
> +    /*
> +     * A reused vfio-pci device is already configured, so do not reset it
> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> +     * updated with new state in cpr-load with no ill effects.
> +     */
> +    if (dev->reused) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  


Hmm that's a weird thing to do. I suspect this works because
"reused" means something like "in the process of being restored"?
Because clearly, we do not want to skip this part e.g. when
guest resets the device.
So a better name could be called for, but really I don't
love how vfio gets to poke at internal PCI state.
I'd rather we found a way just not to call this function.
If we can't, maybe an explicit API, and make it
actually say what it's doing?


> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    assert(!container->reused);
> +
>      if (iotlb && container->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  {
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>          .vaddr = (__u64)(uintptr_t)vaddr,
>          .iova = iova,
>          .size = size,
>      };
>  
> +    /*
> +     * Set the new vaddr for any mappings registered during cpr-load.
> +     * Reused is cleared thereafter.
> +     */
> +    if (container->reused) {
> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +            goto fail;
> +        }
> +        return 0;
> +    }
> +
> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>      return -errno;
>  }
>  
> @@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    vfio_container_region_add(container, section);
> +}
> +
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section)
> +{
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      void *vaddr;
> @@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          int iommu_idx;
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
> +
>          /*
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
> @@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
>      }
>  }
>  
> +void vfio_listener_register(VFIOContainer *container)
> +{
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +}
> +
>  static struct vfio_info_cap_header *
>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>  {
> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>  {
>      int iommu_type, ret;
>  
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> +     * If you ever add new types or spapr cpr support, kind reader, please
> +     * also implement VFIO_GET_IOMMU.
> +     */
> +    if (container->reused) {
> +        container->iommu_type = VFIO_TYPE1v2_IOMMU;
> +        return 0;
> +    }
> +
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
> @@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd > 0);
>  
>      /*
>       * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  It not, then after the loop, create
> +     * the container struct and group list.
> +     */
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +        if ((reused && container->fd == fd) ||
> +            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
>                  error_setg_errno(errp, -ret,
> @@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              }
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -            vfio_kvm_device_add_group(group);
> +            if (!reused) {
> +                vfio_kvm_device_add_group(group);
> +                cpr_save_fd("vfio_container_for_group", group->groupid,
> +                            container->fd);
> +            }
>              return 0;
>          }
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    }
> +
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
> +     * mappings remembered by the kernel.
> +     */
> +    if (!reused) {
> +        vfio_listener_register(container);
> +    }
>  
>      if (container->error) {
>          ret = -1;
> @@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    if (!reused) {
> +        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
> +    }
>  
>      return 0;
>  listener_release_exit:
> @@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      VFIOGroup *group;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
> +    bool reused;
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
> @@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    reused = (group->fd >= 0);
> +    if (!reused) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (!reused) {
> +        cpr_save_fd("vfio_group", groupid, group->fd);
> +    }
> +
>      return group;
>  
>  close_fd_exit:
> @@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
> +    if (!reused) {
> +        cpr_save_fd(name, 0, fd);
> +    }
>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
> @@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..2c39cd5
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}
> +
> +/*
> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> + */
> +int vfio_cpr_save(Error **errp)
> +{
> +    ERRP_GUARD();
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> +                return -1;
> +            }
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/*
> + * Register the listener for each container, which causes its callback to be
> + * invoked for every flat section.  The callback will see that the container
> + * is reused, and call vfo_dma_map with the new vaddr.
> + */
> +int vfio_cpr_load(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            vfio_listener_register(container);
> +            container->reused = false;
> +        }
> +    }
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a90cce2..acac8a7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
>  #include "qapi/qmp/qdict.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          vfio_put_group(group);
>          goto error;
>      }
> +    pdev->reused = vdev->vbasedev.reused;
>  
>      vfio_populate_device(vdev, &err);
>      if (err) {
> @@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
> +    if (vdev->pdev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void vfio_merge_config(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    g_autofree uint8_t *phys_config = g_malloc(size);
> +    uint32_t mask;
> +    int ret, i;
> +
> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
> +    if (ret < size) {
> +        ret = ret < 0 ? errno : EFAULT;
> +        error_report("failed to read device config space: %s", strerror(ret));
> +        return;
> +    }
> +
> +    for (i = 0; i < size; i++) {
> +        mask = vdev->emulated_config_bits[i];
> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
> +    }
> +}
> +
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vfio_merge_config(vdev);
> +
> +    pdev->reused = false;
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_load = vfio_pci_pre_load,
> +    .post_load = vfio_pci_post_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f..63dd0fe 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cc63dd4..8557e82 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -361,6 +361,7 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1641753..bc23c29 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_cpr_save(Error **errp);
> +int vfio_cpr_load(Error **errp);
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> @@ -236,6 +241,9 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +void vfio_listener_register(VFIOContainer *container);
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index a4da24e..a4007cf 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>  int cpr_state_load(Error **errp);
>  void cpr_state_print(void);
>  
> +int cpr_vfio_save(Error **errp);
> +int cpr_vfio_load(Error **errp);
> +
>  #endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 37eca66..cee82cf 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,6 +7,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "exec/memory.h"
> +#include "hw/vfio/vfio-common.h"
>  #include "io/channel-buffer.h"
>  #include "io/channel-file.h"
>  #include "migration.h"
> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>          return;
>      }
> -
> +    if (cpr_vfio_save(errp)) {
> +        return;
> +    }
>      cpr_walk_fd(preserve_fd, 0);
>      if (cpr_state_save(errp)) {
>          return;
> @@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>          goto out;
>      }
>  
> +    if (cpr_get_mode() == CPR_MODE_RESTART &&
> +        cpr_vfio_load(errp)) {
> +        goto out;
> +    }
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();
> diff --git a/migration/target.c b/migration/target.c
> index 4390bf0..984bc9e 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,6 +8,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include "migration/cpr.h"
>  #include CONFIG_DEVICES
>  
>  #ifdef CONFIG_VFIO
> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>          info->vfio->transferred = vfio_mig_bytes_transferred();
>      }
>  }
> +
> +int cpr_vfio_save(Error **errp)
> +{
> +    return vfio_cpr_save(errp);
> +}
> +
> +int cpr_vfio_load(Error **errp)
> +{
> +    return vfio_cpr_load(errp);
> +}
> +
>  #else
>  
>  void populate_vfio_info(MigrationInfo *info) {}
> +int cpr_vfio_save(Error **errp) { return 0; }
> +int cpr_vfio_load(Error **errp) { return 0; }
>  
>  #endif /* CONFIG_VFIO */
> -- 
> 1.8.3.1

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 4 years, 1 month ago

On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |  10 ++++
>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h       |   3 ++
>>  migration/cpr.c               |  10 +++-
>>  migration/target.c            |  14 +++++
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    /*
>> +     * A reused vfio-pci device is already configured, so do not reset it
>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>> +     * updated with new state in cpr-load with no ill effects.
>> +     */
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
> 
> 
> Hmm that's a weird thing to do. I suspect this works because
> "reused" means something like "in the process of being restored"?
> Because clearly, we do not want to skip this part e.g. when
> guest resets the device.

Exactly.  vfio_realize sets the flag if it detects the device is reused during
a restart, and vfio_pci_post_load clears the reused flag.

> So a better name could be called for, but really I don't
> love how vfio gets to poke at internal PCI state.
> I'd rather we found a way just not to call this function.
> If we can't, maybe an explicit API, and make it
> actually say what it's doing?

How about:

pci_set_restore(PCIDevice *dev) { dev->restore = true; }
pci_clr_restore(PCIDevice *dev) { dev->restore = false; }

vfio_realize()
  pci_set_restore(pdev)

vfio_pci_post_load()
  pci_clr_restore(pdev)

pci_do_device_reset()
    if (dev->restore)
        return;

- Steve

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Michael S. Tsirkin 4 years, 1 month ago

On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >> Enable vfio-pci devices to be saved and restored across an exec restart
> >> of qemu.
> >>
> >> At vfio creation time, save the value of vfio container, group, and device
> >> descriptors in cpr state.
> >>
> >> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >> should not persist across miscellaneous fork and exec calls that may be
> >> performed during normal operation.
> >>
> >> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >> the descriptors, and notes that the device is being reused.  Device and
> >> iommu state is already configured, so operations in vfio_realize that
> >> would modify the configuration are skipped for a reused device, including
> >> vfio ioctl's and writes to PCI configuration space.  The result is that
> >> vfio_realize constructs qemu data structures that reflect the current
> >> state of the device.  However, the reconstruction is not complete until
> >> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >> state.  It rebuilds vector data structures and attaches the interrupts to
> >> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> >> which walks the flattened ranges of the vfio_address_spaces and calls
> >> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >> starts the VM and suppresses vfio pci device reset.
> >>
> >> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >> support.  Part 3 adds INTX support.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  MAINTAINERS                   |   1 +
> >>  hw/pci/pci.c                  |  10 ++++
> >>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
> >>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
> >>  hw/vfio/meson.build           |   1 +
> >>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events          |   1 +
> >>  include/hw/pci/pci.h          |   1 +
> >>  include/hw/vfio/vfio-common.h |   8 +++
> >>  include/migration/cpr.h       |   3 ++
> >>  migration/cpr.c               |  10 +++-
> >>  migration/target.c            |  14 +++++
> >>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>  create mode 100644 hw/vfio/cpr.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index cfe7480..feed239 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -2992,6 +2992,7 @@ CPR
> >>  M: Steve Sistare <steven.sistare@oracle.com>
> >>  M: Mark Kanda <mark.kanda@oracle.com>
> >>  S: Maintained
> >> +F: hw/vfio/cpr.c
> >>  F: include/migration/cpr.h
> >>  F: migration/cpr.c
> >>  F: qapi/cpr.json
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 0fd21e1..e35df4f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>  {
> >>      int r;
> >>  
> >> +    /*
> >> +     * A reused vfio-pci device is already configured, so do not reset it
> >> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> >> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> >> +     * updated with new state in cpr-load with no ill effects.
> >> +     */
> >> +    if (dev->reused) {
> >> +        return;
> >> +    }
> >> +
> >>      pci_device_deassert_intx(dev);
> >>      assert(dev->irq_state == 0);
> >>  
> > 
> > 
> > Hmm that's a weird thing to do. I suspect this works because
> > "reused" means something like "in the process of being restored"?
> > Because clearly, we do not want to skip this part e.g. when
> > guest resets the device.
> 
> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> a restart, and vfio_pci_post_load clears the reused flag.
> 
> > So a better name could be called for, but really I don't
> > love how vfio gets to poke at internal PCI state.
> > I'd rather we found a way just not to call this function.
> > If we can't, maybe an explicit API, and make it
> > actually say what it's doing?
> 
> How about:
> 
> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> 
> vfio_realize()
>   pci_set_restore(pdev)
> 
> vfio_pci_post_load()
>   pci_clr_restore(pdev)
> 
> pci_do_device_reset()
>     if (dev->restore)
>         return;
> 
> - Steve


Not too bad. I'd like a better definition of what dev->restore is
exactly and to add them in comments near where it
is defined and used.

E.g. does this mean "device is being restored because of qemu restart"?

Do we need a per device flag for this thing or would a global
"qemu restart in progress" flag be enough?

-- 
MST

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 4 years, 1 month ago

On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>> of qemu.
>>>>
>>>> At vfio creation time, save the value of vfio container, group, and device
>>>> descriptors in cpr state.
>>>>
>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>>>> should not persist across miscellaneous fork and exec calls that may be
>>>> performed during normal operation.
>>>>
>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>> the descriptors, and notes that the device is being reused.  Device and
>>>> iommu state is already configured, so operations in vfio_realize that
>>>> would modify the configuration are skipped for a reused device, including
>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>> vfio_realize constructs qemu data structures that reflect the current
>>>> state of the device.  However, the reconstruction is not complete until
>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>> starts the VM and suppresses vfio pci device reset.
>>>>
>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>>>> support.  Part 3 adds INTX support.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  MAINTAINERS                   |   1 +
>>>>  hw/pci/pci.c                  |  10 ++++
>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>  hw/vfio/meson.build           |   1 +
>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>  hw/vfio/trace-events          |   1 +
>>>>  include/hw/pci/pci.h          |   1 +
>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>  include/migration/cpr.h       |   3 ++
>>>>  migration/cpr.c               |  10 +++-
>>>>  migration/target.c            |  14 +++++
>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>  create mode 100644 hw/vfio/cpr.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index cfe7480..feed239 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>  M: Steve Sistare <steven.sistare@oracle.com>
>>>>  M: Mark Kanda <mark.kanda@oracle.com>
>>>>  S: Maintained
>>>> +F: hw/vfio/cpr.c
>>>>  F: include/migration/cpr.h
>>>>  F: migration/cpr.c
>>>>  F: qapi/cpr.json
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index 0fd21e1..e35df4f 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>  {
>>>>      int r;
>>>>  
>>>> +    /*
>>>> +     * A reused vfio-pci device is already configured, so do not reset it
>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>>>> +     * updated with new state in cpr-load with no ill effects.
>>>> +     */
>>>> +    if (dev->reused) {
>>>> +        return;
>>>> +    }
>>>> +
>>>>      pci_device_deassert_intx(dev);
>>>>      assert(dev->irq_state == 0);
>>>>  
>>>
>>>
>>> Hmm that's a weird thing to do. I suspect this works because
>>> "reused" means something like "in the process of being restored"?
>>> Because clearly, we do not want to skip this part e.g. when
>>> guest resets the device.
>>
>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
>> a restart, and vfio_pci_post_load clears the reused flag.
>>
>>> So a better name could be called for, but really I don't
>>> love how vfio gets to poke at internal PCI state.
>>> I'd rather we found a way just not to call this function.
>>> If we can't, maybe an explicit API, and make it
>>> actually say what it's doing?
>>
>> How about:
>>
>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>
>> vfio_realize()
>>   pci_set_restore(pdev)
>>
>> vfio_pci_post_load()
>>   pci_clr_restore(pdev)
>>
>> pci_do_device_reset()
>>     if (dev->restore)
>>         return;
>>
>> - Steve
> 
> 
> Not too bad. I'd like a better definition of what dev->restore is
> exactly and to add them in comments near where it
> is defined and used.

Will do.

> E.g. does this mean "device is being restored because of qemu restart"?
> 
> Do we need a per device flag for this thing or would a global
> "qemu restart in progress" flag be enough?

A global flag (or function, which already exists) would suppress reset for all
PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
load may implicitly depend on the device having been reset for correctness, by 
virtue of some fields being initialized in the reset function.

- Steve

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Michael S. Tsirkin 4 years, 1 month ago

On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> >> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >>>> Enable vfio-pci devices to be saved and restored across an exec restart
> >>>> of qemu.
> >>>>
> >>>> At vfio creation time, save the value of vfio container, group, and device
> >>>> descriptors in cpr state.
> >>>>
> >>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >>>> should not persist across miscellaneous fork and exec calls that may be
> >>>> performed during normal operation.
> >>>>
> >>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >>>> the descriptors, and notes that the device is being reused.  Device and
> >>>> iommu state is already configured, so operations in vfio_realize that
> >>>> would modify the configuration are skipped for a reused device, including
> >>>> vfio ioctl's and writes to PCI configuration space.  The result is that
> >>>> vfio_realize constructs qemu data structures that reflect the current
> >>>> state of the device.  However, the reconstruction is not complete until
> >>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >>>> state.  It rebuilds vector data structures and attaches the interrupts to
> >>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> >>>> which walks the flattened ranges of the vfio_address_spaces and calls
> >>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >>>> starts the VM and suppresses vfio pci device reset.
> >>>>
> >>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >>>> support.  Part 3 adds INTX support.
> >>>>
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> ---
> >>>>  MAINTAINERS                   |   1 +
> >>>>  hw/pci/pci.c                  |  10 ++++
> >>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
> >>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
> >>>>  hw/vfio/meson.build           |   1 +
> >>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
> >>>>  hw/vfio/trace-events          |   1 +
> >>>>  include/hw/pci/pci.h          |   1 +
> >>>>  include/hw/vfio/vfio-common.h |   8 +++
> >>>>  include/migration/cpr.h       |   3 ++
> >>>>  migration/cpr.c               |  10 +++-
> >>>>  migration/target.c            |  14 +++++
> >>>>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>>>  create mode 100644 hw/vfio/cpr.c
> >>>>
> >>>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>>> index cfe7480..feed239 100644
> >>>> --- a/MAINTAINERS
> >>>> +++ b/MAINTAINERS
> >>>> @@ -2992,6 +2992,7 @@ CPR
> >>>>  M: Steve Sistare <steven.sistare@oracle.com>
> >>>>  M: Mark Kanda <mark.kanda@oracle.com>
> >>>>  S: Maintained
> >>>> +F: hw/vfio/cpr.c
> >>>>  F: include/migration/cpr.h
> >>>>  F: migration/cpr.c
> >>>>  F: qapi/cpr.json
> >>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>>> index 0fd21e1..e35df4f 100644
> >>>> --- a/hw/pci/pci.c
> >>>> +++ b/hw/pci/pci.c
> >>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>>>  {
> >>>>      int r;
> >>>>  
> >>>> +    /*
> >>>> +     * A reused vfio-pci device is already configured, so do not reset it
> >>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> >>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> >>>> +     * updated with new state in cpr-load with no ill effects.
> >>>> +     */
> >>>> +    if (dev->reused) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>>      pci_device_deassert_intx(dev);
> >>>>      assert(dev->irq_state == 0);
> >>>>  
> >>>
> >>>
> >>> Hmm that's a weird thing to do. I suspect this works because
> >>> "reused" means something like "in the process of being restored"?
> >>> Because clearly, we do not want to skip this part e.g. when
> >>> guest resets the device.
> >>
> >> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> >> a restart, and vfio_pci_post_load clears the reused flag.
> >>
> >>> So a better name could be called for, but really I don't
> >>> love how vfio gets to poke at internal PCI state.
> >>> I'd rather we found a way just not to call this function.
> >>> If we can't, maybe an explicit API, and make it
> >>> actually say what it's doing?
> >>
> >> How about:
> >>
> >> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> >> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> >>
> >> vfio_realize()
> >>   pci_set_restore(pdev)
> >>
> >> vfio_pci_post_load()
> >>   pci_clr_restore(pdev)
> >>
> >> pci_do_device_reset()
> >>     if (dev->restore)
> >>         return;
> >>
> >> - Steve
> > 
> > 
> > Not too bad. I'd like a better definition of what dev->restore is
> > exactly and to add them in comments near where it
> > is defined and used.
> 
> Will do.
> 
> > E.g. does this mean "device is being restored because of qemu restart"?
> > 
> > Do we need a per device flag for this thing or would a global
> > "qemu restart in progress" flag be enough?
> 
> A global flag (or function, which already exists) would suppress reset for all
> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
> load may implicitly depend on the device having been reset for correctness, by 
> virtue of some fields being initialized in the reset function.
> 
> - Steve

So just so I understand, how do these other devices work with restart?
Do they use the save/loadvm machinery? And the reason vfio doesn't
is because it generally does not support savevm/loadvm?

-- 
MST

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 4 years, 1 month ago

On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
>> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>>>> of qemu.
>>>>>>
>>>>>> At vfio creation time, save the value of vfio container, group, and device
>>>>>> descriptors in cpr state.
>>>>>>
>>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>>>>>> should not persist across miscellaneous fork and exec calls that may be
>>>>>> performed during normal operation.
>>>>>>
>>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>>>> the descriptors, and notes that the device is being reused.  Device and
>>>>>> iommu state is already configured, so operations in vfio_realize that
>>>>>> would modify the configuration are skipped for a reused device, including
>>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>>>> vfio_realize constructs qemu data structures that reflect the current
>>>>>> state of the device.  However, the reconstruction is not complete until
>>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>>>> starts the VM and suppresses vfio pci device reset.
>>>>>>
>>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>>>>>> support.  Part 3 adds INTX support.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  MAINTAINERS                   |   1 +
>>>>>>  hw/pci/pci.c                  |  10 ++++
>>>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>>>  hw/vfio/meson.build           |   1 +
>>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>>>  hw/vfio/trace-events          |   1 +
>>>>>>  include/hw/pci/pci.h          |   1 +
>>>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>>>  include/migration/cpr.h       |   3 ++
>>>>>>  migration/cpr.c               |  10 +++-
>>>>>>  migration/target.c            |  14 +++++
>>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>>>  create mode 100644 hw/vfio/cpr.c
>>>>>>
>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>> index cfe7480..feed239 100644
>>>>>> --- a/MAINTAINERS
>>>>>> +++ b/MAINTAINERS
>>>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>>>  M: Steve Sistare <steven.sistare@oracle.com>
>>>>>>  M: Mark Kanda <mark.kanda@oracle.com>
>>>>>>  S: Maintained
>>>>>> +F: hw/vfio/cpr.c
>>>>>>  F: include/migration/cpr.h
>>>>>>  F: migration/cpr.c
>>>>>>  F: qapi/cpr.json
>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>> index 0fd21e1..e35df4f 100644
>>>>>> --- a/hw/pci/pci.c
>>>>>> +++ b/hw/pci/pci.c
>>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>>>  {
>>>>>>      int r;
>>>>>>  
>>>>>> +    /*
>>>>>> +     * A reused vfio-pci device is already configured, so do not reset it
>>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>>>>>> +     * updated with new state in cpr-load with no ill effects.
>>>>>> +     */
>>>>>> +    if (dev->reused) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>>      pci_device_deassert_intx(dev);
>>>>>>      assert(dev->irq_state == 0);
>>>>>>  
>>>>>
>>>>>
>>>>> Hmm that's a weird thing to do. I suspect this works because
>>>>> "reused" means something like "in the process of being restored"?
>>>>> Because clearly, we do not want to skip this part e.g. when
>>>>> guest resets the device.
>>>>
>>>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
>>>> a restart, and vfio_pci_post_load clears the reused flag.
>>>>
>>>>> So a better name could be called for, but really I don't
>>>>> love how vfio gets to poke at internal PCI state.
>>>>> I'd rather we found a way just not to call this function.
>>>>> If we can't, maybe an explicit API, and make it
>>>>> actually say what it's doing?
>>>>
>>>> How about:
>>>>
>>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>>>
>>>> vfio_realize()
>>>>   pci_set_restore(pdev)
>>>>
>>>> vfio_pci_post_load()
>>>>   pci_clr_restore(pdev)
>>>>
>>>> pci_do_device_reset()
>>>>     if (dev->restore)
>>>>         return;
>>>>
>>>> - Steve
>>>
>>>
>>> Not too bad. I'd like a better definition of what dev->restore is
>>> exactly and to add them in comments near where it
>>> is defined and used.
>>
>> Will do.
>>
>>> E.g. does this mean "device is being restored because of qemu restart"?
>>>
>>> Do we need a per device flag for this thing or would a global
>>> "qemu restart in progress" flag be enough?
>>
>> A global flag (or function, which already exists) would suppress reset for all
>> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
>> load may implicitly depend on the device having been reset for correctness, by 
>> virtue of some fields being initialized in the reset function.
>>
>> - Steve
> 
> So just so I understand, how do these other devices work with restart?
> Do they use the save/loadvm machinery? And the reason vfio doesn't
> is because it generally does not support savevm/loadvm?

They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft state,
plus it preserves its device descriptors.  The only bit we are skipping is the reset
function for vfio-pci, because the hardware device is actively processing dma and 
interrupts, and they would be lost.  Reset is called unconditionally for all devices 
during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() ->
qemu_system_reset().

- Steve

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Michael S. Tsirkin 4 years, 1 month ago

On Wed, Jan 05, 2022 at 06:24:25PM -0500, Steven Sistare wrote:
> On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
> >> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> >>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> >>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
> >>>>>> of qemu.
> >>>>>>
> >>>>>> At vfio creation time, save the value of vfio container, group, and device
> >>>>>> descriptors in cpr state.
> >>>>>>
> >>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >>>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >>>>>> should not persist across miscellaneous fork and exec calls that may be
> >>>>>> performed during normal operation.
> >>>>>>
> >>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >>>>>> the descriptors, and notes that the device is being reused.  Device and
> >>>>>> iommu state is already configured, so operations in vfio_realize that
> >>>>>> would modify the configuration are skipped for a reused device, including
> >>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
> >>>>>> vfio_realize constructs qemu data structures that reflect the current
> >>>>>> state of the device.  However, the reconstruction is not complete until
> >>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
> >>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> >>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
> >>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >>>>>> starts the VM and suppresses vfio pci device reset.
> >>>>>>
> >>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >>>>>> support.  Part 3 adds INTX support.
> >>>>>>
> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>> ---
> >>>>>>  MAINTAINERS                   |   1 +
> >>>>>>  hw/pci/pci.c                  |  10 ++++
> >>>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
> >>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
> >>>>>>  hw/vfio/meson.build           |   1 +
> >>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
> >>>>>>  hw/vfio/trace-events          |   1 +
> >>>>>>  include/hw/pci/pci.h          |   1 +
> >>>>>>  include/hw/vfio/vfio-common.h |   8 +++
> >>>>>>  include/migration/cpr.h       |   3 ++
> >>>>>>  migration/cpr.c               |  10 +++-
> >>>>>>  migration/target.c            |  14 +++++
> >>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>>>>>  create mode 100644 hw/vfio/cpr.c
> >>>>>>
> >>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>>>>> index cfe7480..feed239 100644
> >>>>>> --- a/MAINTAINERS
> >>>>>> +++ b/MAINTAINERS
> >>>>>> @@ -2992,6 +2992,7 @@ CPR
> >>>>>>  M: Steve Sistare <steven.sistare@oracle.com>
> >>>>>>  M: Mark Kanda <mark.kanda@oracle.com>
> >>>>>>  S: Maintained
> >>>>>> +F: hw/vfio/cpr.c
> >>>>>>  F: include/migration/cpr.h
> >>>>>>  F: migration/cpr.c
> >>>>>>  F: qapi/cpr.json
> >>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>>>>> index 0fd21e1..e35df4f 100644
> >>>>>> --- a/hw/pci/pci.c
> >>>>>> +++ b/hw/pci/pci.c
> >>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>>>>>  {
> >>>>>>      int r;
> >>>>>>  
> >>>>>> +    /*
> >>>>>> +     * A reused vfio-pci device is already configured, so do not reset it
> >>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> >>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> >>>>>> +     * updated with new state in cpr-load with no ill effects.
> >>>>>> +     */
> >>>>>> +    if (dev->reused) {
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>>      pci_device_deassert_intx(dev);
> >>>>>>      assert(dev->irq_state == 0);
> >>>>>>  
> >>>>>
> >>>>>
> >>>>> Hmm that's a weird thing to do. I suspect this works because
> >>>>> "reused" means something like "in the process of being restored"?
> >>>>> Because clearly, we do not want to skip this part e.g. when
> >>>>> guest resets the device.
> >>>>
> >>>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> >>>> a restart, and vfio_pci_post_load clears the reused flag.
> >>>>
> >>>>> So a better name could be called for, but really I don't
> >>>>> love how vfio gets to poke at internal PCI state.
> >>>>> I'd rather we found a way just not to call this function.
> >>>>> If we can't, maybe an explicit API, and make it
> >>>>> actually say what it's doing?
> >>>>
> >>>> How about:
> >>>>
> >>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> >>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> >>>>
> >>>> vfio_realize()
> >>>>   pci_set_restore(pdev)
> >>>>
> >>>> vfio_pci_post_load()
> >>>>   pci_clr_restore(pdev)
> >>>>
> >>>> pci_do_device_reset()
> >>>>     if (dev->restore)
> >>>>         return;
> >>>>
> >>>> - Steve
> >>>
> >>>
> >>> Not too bad. I'd like a better definition of what dev->restore is
> >>> exactly and to add them in comments near where it
> >>> is defined and used.
> >>
> >> Will do.
> >>
> >>> E.g. does this mean "device is being restored because of qemu restart"?
> >>>
> >>> Do we need a per device flag for this thing or would a global
> >>> "qemu restart in progress" flag be enough?
> >>
> >> A global flag (or function, which already exists) would suppress reset for all
> >> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
> >> load may implicitly depend on the device having been reset for correctness, by 
> >> virtue of some fields being initialized in the reset function.
> >>
> >> - Steve

I took a look and I don't really see any cases like this.
I think pci_qdev_realize will initialize the pci core to a correct state,
pci_do_device_reset isn't necessary right after realize.
It seems safe to just skip it for all devices unconditionally.
A bunch of devices do depend on reset to init them correctly,
e.g. hw/ide/piix.c sets pci status in piix_ide_reset.
But pci core does not seem to.


> > So just so I understand, how do these other devices work with restart?
> > Do they use the save/loadvm machinery? And the reason vfio doesn't
> > is because it generally does not support savevm/loadvm?
> 
> They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft state,
> plus it preserves its device descriptors.  The only bit we are skipping is the reset
> function for vfio-pci, because the hardware device is actively processing dma and 
> interrupts, and they would be lost.  Reset is called unconditionally for all devices 
> during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() ->
> qemu_system_reset().
> 
> - Steve

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 4 years, 1 month ago

On 1/6/2022 4:12 AM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 06:24:25PM -0500, Steven Sistare wrote:
>> On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
>>>> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>>>>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>>>>>> of qemu.
>>>>>>>>
>>>>>>>> At vfio creation time, save the value of vfio container, group, and device
>>>>>>>> descriptors in cpr state.
>>>>>>>>
>>>>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>>>>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>>>>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>>>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>>>>>>>> should not persist across miscellaneous fork and exec calls that may be
>>>>>>>> performed during normal operation.
>>>>>>>>
>>>>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>>>>>> the descriptors, and notes that the device is being reused.  Device and
>>>>>>>> iommu state is already configured, so operations in vfio_realize that
>>>>>>>> would modify the configuration are skipped for a reused device, including
>>>>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>>>>>> vfio_realize constructs qemu data structures that reflect the current
>>>>>>>> state of the device.  However, the reconstruction is not complete until
>>>>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>>>>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>>>>>> starts the VM and suppresses vfio pci device reset.
>>>>>>>>
>>>>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>>>>>>>> support.  Part 3 adds INTX support.
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>>  MAINTAINERS                   |   1 +
>>>>>>>>  hw/pci/pci.c                  |  10 ++++
>>>>>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>>>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>>>>>  hw/vfio/meson.build           |   1 +
>>>>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>>>>>  hw/vfio/trace-events          |   1 +
>>>>>>>>  include/hw/pci/pci.h          |   1 +
>>>>>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>>>>>  include/migration/cpr.h       |   3 ++
>>>>>>>>  migration/cpr.c               |  10 +++-
>>>>>>>>  migration/target.c            |  14 +++++
>>>>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>>>>>  create mode 100644 hw/vfio/cpr.c
>>>>>>>>
>>>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>>>> index cfe7480..feed239 100644
>>>>>>>> --- a/MAINTAINERS
>>>>>>>> +++ b/MAINTAINERS
>>>>>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>>>>>  M: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>  M: Mark Kanda <mark.kanda@oracle.com>
>>>>>>>>  S: Maintained
>>>>>>>> +F: hw/vfio/cpr.c
>>>>>>>>  F: include/migration/cpr.h
>>>>>>>>  F: migration/cpr.c
>>>>>>>>  F: qapi/cpr.json
>>>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>>>> index 0fd21e1..e35df4f 100644
>>>>>>>> --- a/hw/pci/pci.c
>>>>>>>> +++ b/hw/pci/pci.c
>>>>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>>>>>  {
>>>>>>>>      int r;
>>>>>>>>  
>>>>>>>> +    /*
>>>>>>>> +     * A reused vfio-pci device is already configured, so do not reset it
>>>>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>>>>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>>>>>>>> +     * updated with new state in cpr-load with no ill effects.
>>>>>>>> +     */
>>>>>>>> +    if (dev->reused) {
>>>>>>>> +        return;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>      pci_device_deassert_intx(dev);
>>>>>>>>      assert(dev->irq_state == 0);
>>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>> Hmm that's a weird thing to do. I suspect this works because
>>>>>>> "reused" means something like "in the process of being restored"?
>>>>>>> Because clearly, we do not want to skip this part e.g. when
>>>>>>> guest resets the device.
>>>>>>
>>>>>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
>>>>>> a restart, and vfio_pci_post_load clears the reused flag.
>>>>>>
>>>>>>> So a better name could be called for, but really I don't
>>>>>>> love how vfio gets to poke at internal PCI state.
>>>>>>> I'd rather we found a way just not to call this function.
>>>>>>> If we can't, maybe an explicit API, and make it
>>>>>>> actually say what it's doing?
>>>>>>
>>>>>> How about:
>>>>>>
>>>>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>>>>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>>>>>
>>>>>> vfio_realize()
>>>>>>   pci_set_restore(pdev)
>>>>>>
>>>>>> vfio_pci_post_load()
>>>>>>   pci_clr_restore(pdev)
>>>>>>
>>>>>> pci_do_device_reset()
>>>>>>     if (dev->restore)
>>>>>>         return;
>>>>>>
>>>>>> - Steve
>>>>>
>>>>>
>>>>> Not too bad. I'd like a better definition of what dev->restore is
>>>>> exactly and to add them in comments near where it
>>>>> is defined and used.
>>>>
>>>> Will do.
>>>>
>>>>> E.g. does this mean "device is being restored because of qemu restart"?
>>>>>
>>>>> Do we need a per device flag for this thing or would a global
>>>>> "qemu restart in progress" flag be enough?
>>>>
>>>> A global flag (or function, which already exists) would suppress reset for all
>>>> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
>>>> load may implicitly depend on the device having been reset for correctness, by 
>>>> virtue of some fields being initialized in the reset function.
>>>>
>>>> - Steve
> 
> I took a look and I don't really see any cases like this.
> I think pci_qdev_realize will initialize the pci core to a correct state,
> pci_do_device_reset isn't necessary right after realize.
> It seems safe to just skip it for all devices unconditionally.
> A bunch of devices do depend on reset to init them correctly,
> e.g. hw/ide/piix.c sets pci status in piix_ide_reset.
> But pci core does not seem to.

Cool.  After comparing pci_do_device_reset to pci_qdev_realize -> do_pci_register_device,
I concur.

This will do the trick.  Mode is set before reset is called, and cleared in cpr-load.

--------------------------
#include "migration/cpr.h"

static void pci_do_device_reset(PCIDevice *dev)
{
    int r;

    /*
     * A PCI device that is resuming for cpr is already configured, so do
     * not reset it here when we are called from qemu_system_reset prior to
     * cpr-load, else interrupts may be lost for vfio-pci devices.  It is
     * safe to skip this reset for all PCI devices, because cpr-load will set
     * all fields that would have been set here.
     */
    if (cpr_get_mode() == CPR_MODE_RESTART) {
        return;
    }
-----------------------------

- Steve

>>> So just so I understand, how do these other devices work with restart?
>>> Do they use the save/loadvm machinery? And the reason vfio doesn't
>>> is because it generally does not support savevm/loadvm?
>>
>> They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft state,
>> plus it preserves its device descriptors.  The only bit we are skipping is the reset
>> function for vfio-pci, because the hardware device is actively processing dma and 
>> interrupts, and they would be lost.  Reset is called unconditionally for all devices 
>> during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() ->
>> qemu_system_reset().
>>
>> - Steve
>

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Alex Williamson 3 years, 11 months ago

On Wed, 22 Dec 2021 11:05:24 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |  10 ++++
>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/pci/pci.h          |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h       |   3 ++
>  migration/cpr.c               |  10 +++-
>  migration/target.c            |  14 +++++
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>      int r;
>  
> +    /*
> +     * A reused vfio-pci device is already configured, so do not reset it
> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> +     * updated with new state in cpr-load with no ill effects.
> +     */
> +    if (dev->reused) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    assert(!container->reused);
> +
>      if (iotlb && container->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  {
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>          .vaddr = (__u64)(uintptr_t)vaddr,
>          .iova = iova,
>          .size = size,
>      };
>  
> +    /*
> +     * Set the new vaddr for any mappings registered during cpr-load.
> +     * Reused is cleared thereafter.
> +     */
> +    if (container->reused) {
> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +            goto fail;
> +        }
> +        return 0;
> +    }
> +
> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>      return -errno;
>  }
>  
> @@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    vfio_container_region_add(container, section);
> +}
> +
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section)
> +{
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      void *vaddr;
> @@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          int iommu_idx;
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
> +
>          /*
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
> @@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
>      }
>  }
>  
> +void vfio_listener_register(VFIOContainer *container)
> +{
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +}
> +
>  static struct vfio_info_cap_header *
>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>  {
> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>  {
>      int iommu_type, ret;
>  
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> +     * If you ever add new types or spapr cpr support, kind reader, please
> +     * also implement VFIO_GET_IOMMU.
> +     */

VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
problem is that vfio_iommu_type1_check_extension() should actually base
some of the details on the instantiated vfio_iommu, ex.

	switch (arg) {
	case VFIO_TYPE1_IOMMU:
		return (iommu && iommu->v2) ? 0 : 1;
	case VFIO_UNMAP_ALL:
	case VFIO_UPDATE_VADDR:
	case VFIO_TYPE1v2_IOMMU:
		return (iommu && !iommu->v2) ? 0 : 1;
	case VFIO_TYPE1_NESTING_IOMMU:
		return (iommu && !iommu->nesting) ? 0 : 1;
	...

We can't support v1 if we've already set a v2 container and vice versa.
There are probably some corner cases and compatibility to puzzle
through, but I wouldn't think we need a new ioctl to check this.


> +    if (container->reused) {
> +        container->iommu_type = VFIO_TYPE1v2_IOMMU;
> +        return 0;
> +    }
> +
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
> @@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd > 0);
>  
>      /*
>       * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  It not, then after the loop, create

s/It/If/

> +     * the container struct and group list.
> +     */
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +        if ((reused && container->fd == fd) ||
> +            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {


We can have multiple containers, so this can still call the ioctl when
reused = true.  I think it still works, but it's a bit ugly, we're
relying on the ioctl failing when the container is already set for the
group.  Does this need to be something like:

        if (reused) {
            if (container->fd != fd) {
                continue;
            }
        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
            continue;
        }

>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
>                  error_setg_errno(errp, -ret,
> @@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              }
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -            vfio_kvm_device_add_group(group);
> +            if (!reused) {
> +                vfio_kvm_device_add_group(group);
> +                cpr_save_fd("vfio_container_for_group", group->groupid,
> +                            container->fd);
> +            }
>              return 0;
>          }
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    }
> +
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
> +     * mappings remembered by the kernel.
> +     */
> +    if (!reused) {
> +        vfio_listener_register(container);
> +    }
>  
>      if (container->error) {
>          ret = -1;
> @@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    if (!reused) {
> +        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
> +    }
>  
>      return 0;
>  listener_release_exit:
> @@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);

Did you consider having cpr_save_fd() do a find_name() and update/no-op
if found so that we can casually call cpr_save_fd() without nesting it
in a branch the same as done for cpr_delete_fd()?

>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      VFIOGroup *group;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
> +    bool reused;
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
> @@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    reused = (group->fd >= 0);
> +    if (!reused) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (!reused) {
> +        cpr_save_fd("vfio_group", groupid, group->fd);
> +    }

If cpr_save_fd() were idempotent as above, we wouldn't need the reused
variable here and the previous chunk could be simplified.  It might
even suggest a function like "cpr_find_or_open_fd()".

> +
>      return group;
>  
>  close_fd_exit:
> @@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
> +    if (!reused) {
> +        cpr_save_fd(name, 0, fd);
> +    }

Another cleanup here if we didn't need to tiptoe around cpr_save_fd().

>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
> @@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..2c39cd5
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}

We could have minimally used this where we assumed a TYPE1v2 container.

> +
> +/*
> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> + */
> +int vfio_cpr_save(Error **errp)
> +{
> +    ERRP_GUARD();
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> +                return -1;
> +            }
> +        }
> +    }

Seems like we ought to validate all containers support CPR before we
start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
replay the listeners to remap the vaddrs in case of an error?

> +
> +    return 0;
> +}
> +
> +/*
> + * Register the listener for each container, which causes its callback to be
> + * invoked for every flat section.  The callback will see that the container
> + * is reused, and call vfo_dma_map with the new vaddr.
> + */
> +int vfio_cpr_load(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            vfio_listener_register(container);
> +            container->reused = false;
> +        }
> +    }
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a90cce2..acac8a7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
>  #include "qapi/qmp/qdict.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          vfio_put_group(group);
>          goto error;
>      }
> +    pdev->reused = vdev->vbasedev.reused;
>  
>      vfio_populate_device(vdev, &err);
>      if (err) {
> @@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
> +    if (vdev->pdev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void vfio_merge_config(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    g_autofree uint8_t *phys_config = g_malloc(size);
> +    uint32_t mask;
> +    int ret, i;
> +
> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
> +    if (ret < size) {
> +        ret = ret < 0 ? errno : EFAULT;
> +        error_report("failed to read device config space: %s", strerror(ret));
> +        return;
> +    }
> +
> +    for (i = 0; i < size; i++) {
> +        mask = vdev->emulated_config_bits[i];
> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
> +    }
> +}

IIUC, we get a copy of config space from the vfio device and for each
byte, we keep what we have in emulated config space for emulated bits
and fill in from the device for non-emulated bits.  Meanwhile,
vfio_pci_read_config() doesn't ever return non-emulated bits from
emulated config space, so what specifically are we accomplishing here?

> +
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}

The previous function seemed like maybe an attempt to make non-emulated
bits in emulated config space consistent for testing, but here we're
masking all non-emulated bits out of that mask.  Why do we need to do
both?

> +
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vfio_merge_config(vdev);
> +
> +    pdev->reused = false;
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_load = vfio_pci_pre_load,
> +    .post_load = vfio_pci_post_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f..63dd0fe 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cc63dd4..8557e82 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -361,6 +361,7 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1641753..bc23c29 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_cpr_save(Error **errp);
> +int vfio_cpr_load(Error **errp);
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> @@ -236,6 +241,9 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +void vfio_listener_register(VFIOContainer *container);
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index a4da24e..a4007cf 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>  int cpr_state_load(Error **errp);
>  void cpr_state_print(void);
>  
> +int cpr_vfio_save(Error **errp);
> +int cpr_vfio_load(Error **errp);
> +
>  #endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 37eca66..cee82cf 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,6 +7,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "exec/memory.h"
> +#include "hw/vfio/vfio-common.h"
>  #include "io/channel-buffer.h"
>  #include "io/channel-file.h"
>  #include "migration.h"
> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>          return;
>      }
> -
> +    if (cpr_vfio_save(errp)) {
> +        return;
> +    }

Why is vfio so unique that it needs separate handlers versus other
devices?  Thanks,

Alex


>      cpr_walk_fd(preserve_fd, 0);
>      if (cpr_state_save(errp)) {
>          return;
> @@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>          goto out;
>      }
>  
> +    if (cpr_get_mode() == CPR_MODE_RESTART &&
> +        cpr_vfio_load(errp)) {
> +        goto out;
> +    }
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();
> diff --git a/migration/target.c b/migration/target.c
> index 4390bf0..984bc9e 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,6 +8,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include "migration/cpr.h"
>  #include CONFIG_DEVICES
>  
>  #ifdef CONFIG_VFIO
> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>          info->vfio->transferred = vfio_mig_bytes_transferred();
>      }
>  }
> +
> +int cpr_vfio_save(Error **errp)
> +{
> +    return vfio_cpr_save(errp);
> +}
> +
> +int cpr_vfio_load(Error **errp)
> +{
> +    return vfio_cpr_load(errp);
> +}
> +
>  #else
>  
>  void populate_vfio_info(MigrationInfo *info) {}
> +int cpr_vfio_save(Error **errp) { return 0; }
> +int cpr_vfio_load(Error **errp) { return 0; }
>  
>  #endif /* CONFIG_VFIO */

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 3 years, 11 months ago

On 3/7/2022 5:16 PM, Alex Williamson wrote:
> On Wed, 22 Dec 2021 11:05:24 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |  10 ++++
>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h       |   3 ++
>>  migration/cpr.c               |  10 +++-
>>  migration/target.c            |  14 +++++
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    /*
>> +     * A reused vfio-pci device is already configured, so do not reset it
>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>> +     * updated with new state in cpr-load with no ill effects.
>> +     */
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 5b87f95..90f66ad 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          .size = size,
>>      };
>>  
>> +    assert(!container->reused);
>> +
>>      if (iotlb && container->dirty_pages_supported &&
>>          vfio_devices_all_running_and_saving(container)) {
>>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>  {
>>      struct vfio_iommu_type1_dma_map map = {
>>          .argsz = sizeof(map),
>> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>>          .vaddr = (__u64)(uintptr_t)vaddr,
>>          .iova = iova,
>>          .size = size,
>>      };
>>  
>> +    /*
>> +     * Set the new vaddr for any mappings registered during cpr-load.
>> +     * Reused is cleared thereafter.
>> +     */
>> +    if (container->reused) {
>> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
>> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +            goto fail;
>> +        }
>> +        return 0;
>> +    }
>> +
>> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          return 0;
>>      }
>>  
>> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
>> +fail:
>> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
>> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>>      return -errno;
>>  }
>>  
>> @@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>                                       MemoryRegionSection *section)
>>  {
>>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    vfio_container_region_add(container, section);
>> +}
>> +
>> +void vfio_container_region_add(VFIOContainer *container,
>> +                               MemoryRegionSection *section)
>> +{
>>      hwaddr iova, end;
>>      Int128 llend, llsize;
>>      void *vaddr;
>> @@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>          int iommu_idx;
>>  
>>          trace_vfio_listener_region_add_iommu(iova, end);
>> +
>>          /*
>>           * FIXME: For VFIO iommu types which have KVM acceleration to
>>           * avoid bouncing all map/unmaps through qemu this way, this
>> @@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
>>      }
>>  }
>>  
>> +void vfio_listener_register(VFIOContainer *container)
>> +{
>> +    container->listener = vfio_memory_listener;
>> +    memory_listener_register(&container->listener, container->space->as);
>> +}
>> +
>>  static struct vfio_info_cap_header *
>>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>>  {
>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>  {
>>      int iommu_type, ret;
>>  
>> +    /*
>> +     * If container is reused, just set its type and skip the ioctls, as the
>> +     * container and group are already configured in the kernel.
>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>> +     * If you ever add new types or spapr cpr support, kind reader, please
>> +     * also implement VFIO_GET_IOMMU.
>> +     */
> 
> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> problem is that vfio_iommu_type1_check_extension() should actually base
> some of the details on the instantiated vfio_iommu, ex.
> 
> 	switch (arg) {
> 	case VFIO_TYPE1_IOMMU:
> 		return (iommu && iommu->v2) ? 0 : 1;
> 	case VFIO_UNMAP_ALL:
> 	case VFIO_UPDATE_VADDR:
> 	case VFIO_TYPE1v2_IOMMU:
> 		return (iommu && !iommu->v2) ? 0 : 1;
> 	case VFIO_TYPE1_NESTING_IOMMU:
> 		return (iommu && !iommu->nesting) ? 0 : 1;
> 	...
> 
> We can't support v1 if we've already set a v2 container and vice versa.
> There are probably some corner cases and compatibility to puzzle
> through, but I wouldn't think we need a new ioctl to check this.

That change makes sense, and may be worth while on its own merits, but does not
solve the problem, which is that qemu will not be able to infer iommu_type in
the future if new types are added.  Given:
  * a new kernel supporting shiny new TYPE1v3
  * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
    knowledge of v3
  * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.

Then the new qemu has no way to infer iommu_type.  If it has code that makes 
decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.

For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
"what are you" about a container feels like a bug to me.

>> +    if (container->reused) {
>> +        container->iommu_type = VFIO_TYPE1v2_IOMMU;
>> +        return 0;
>> +    }
>> +
>>      iommu_type = vfio_get_iommu_type(container, errp);
>>      if (iommu_type < 0) {
>>          return iommu_type;
>> @@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    bool reused;
>>      VFIOAddressSpace *space;
>>  
>>      space = vfio_get_address_space(as);
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    reused = (fd > 0);
>>  
>>      /*
>>       * VFIO is currently incompatible with discarding of RAM insofar as the
>> @@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       * details once we know which type of IOMMU we are using.
>>       */
>>  
>> +    /*
>> +     * If the container is reused, then the group is already attached in the
>> +     * kernel.  If a container with matching fd is found, then update the
>> +     * userland group list and return.  It not, then after the loop, create
> 
> s/It/If/

Check.

>> +     * the container struct and group list.
>> +     */
>> +
>>      QLIST_FOREACH(container, &space->containers, next) {
>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> +        if ((reused && container->fd == fd) ||
>> +            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> 
> 
> We can have multiple containers, so this can still call the ioctl when
> reused = true.  I think it still works, but it's a bit ugly, we're
> relying on the ioctl failing when the container is already set for the
> group.  Does this need to be something like:
> 
>         if (reused) {
>             if (container->fd != fd) {
>                 continue;
>             }
>         } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>             continue;
>         }

Better, thanks.

>>              ret = vfio_ram_block_discard_disable(container, true);
>>              if (ret) {
>>                  error_setg_errno(errp, -ret,
>> @@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>              }
>>              group->container = container;
>>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> -            vfio_kvm_device_add_group(group);
>> +            if (!reused) {
>> +                vfio_kvm_device_add_group(group);
>> +                cpr_save_fd("vfio_container_for_group", group->groupid,
>> +                            container->fd);
>> +            }
>>              return 0;
>>          }
>>      }
>>  
>> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>> +    if (!reused) {
>> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>> +    }
>> +
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>>          ret = -errno;
>> @@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      container = g_malloc0(sizeof(*container));
>>      container->space = space;
>>      container->fd = fd;
>> +    container->reused = reused;
>>      container->error = NULL;
>>      container->dirty_pages_supported = false;
>>      container->dma_max_mappings = 0;
>> @@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      group->container = container;
>>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>  
>> -    container->listener = vfio_memory_listener;
>> -
>> -    memory_listener_register(&container->listener, container->space->as);
>> +    /*
>> +     * If reused, register the listener later, after all state that may
>> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
>> +     * the listener will invoke its callback on each flat section and call
>> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
>> +     * mappings remembered by the kernel.
>> +     */
>> +    if (!reused) {
>> +        vfio_listener_register(container);
>> +    }
>>  
>>      if (container->error) {
>>          ret = -1;
>> @@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      }
>>  
>>      container->initialized = true;
>> +    if (!reused) {
>> +        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
>> +    }
>>  
>>      return 0;
>>  listener_release_exit:
>> @@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  
>>      QLIST_REMOVE(group, container_next);
>>      group->container = NULL;
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
> 
> Did you consider having cpr_save_fd() do a find_name() and update/no-op
> if found so that we can casually call cpr_save_fd() without nesting it
> in a branch the same as done for cpr_delete_fd()?

I did, but decided it was better to keep cpr_state simple and let higher layers decide
whether the extra search is helpful.  At some call sites, the cpr_save_fd is already 
in a conditional scope with other work. And, in vfio, the "if (reused)" is slightly more
efficient than find_name.  Small reasons, but justified IMO.

>>      /*
>>       * Explicitly release the listener first before unset container,
>> @@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      VFIOGroup *group;
>>      char path[32];
>>      struct vfio_group_status status = { .argsz = sizeof(status) };
>> +    bool reused;
>>  
>>      QLIST_FOREACH(group, &vfio_group_list, next) {
>>          if (group->groupid == groupid) {
>> @@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      group = g_malloc0(sizeof(*group));
>>  
>>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open_old(path, O_RDWR);
>> +
>> +    group->fd = cpr_find_fd("vfio_group", groupid);
>> +    reused = (group->fd >= 0);
>> +    if (!reused) {
>> +        group->fd = qemu_open_old(path, O_RDWR);
>> +    }
>> +
>>      if (group->fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open %s", path);
>>          goto free_group_exit;
>> @@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>  
>>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>>  
>> +    if (!reused) {
>> +        cpr_save_fd("vfio_group", groupid, group->fd);
>> +    }
> 
> If cpr_save_fd() were idempotent as above, we wouldn't need the reused
> variable here and the previous chunk could be simplified.  It might
> even suggest a function like "cpr_find_or_open_fd()".

Sure, I will add a new function.

>> +
>>      return group;
>>  
>>  close_fd_exit:
>> @@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
>>      vfio_disconnect_container(group);
>>      QLIST_REMOVE(group, next);
>>      trace_vfio_put_group(group->fd);
>> +    cpr_delete_fd("vfio_group", group->groupid);
>>      close(group->fd);
>>      g_free(group);
>>  
>> @@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>  {
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, fd;
>> +    bool reused;
>> +
>> +    fd = cpr_find_fd(name, 0);
>> +    reused = (fd >= 0);
>> +    if (!reused) {
>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
>>  
>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "error getting device from group %d",
>>                           group->groupid);
>> @@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>      vbasedev->num_irqs = dev_info.num_irqs;
>>      vbasedev->num_regions = dev_info.num_regions;
>>      vbasedev->flags = dev_info.flags;
>> +    vbasedev->reused = reused;
>> +    if (!reused) {
>> +        cpr_save_fd(name, 0, fd);
>> +    }
> 
> Another cleanup here if we didn't need to tiptoe around cpr_save_fd().

Yes, I will find and fix all such callsites.

>>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>>                            dev_info.num_irqs);
>> @@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>>      QLIST_REMOVE(vbasedev, next);
>>      vbasedev->group = NULL;
>>      trace_vfio_put_base_device(vbasedev->fd);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>      close(vbasedev->fd);
>>  }
>>  
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> new file mode 100644
>> index 0000000..2c39cd5
>> --- /dev/null
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,94 @@
>> +/*
>> + * Copyright (c) 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/kvm.h"
>> +#include "qapi/error.h"
>> +#include "trace.h"
>> +
>> +static int
>> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> +        .iova = 0,
>> +        .size = 0,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>> +                         "or VFIO_UNMAP_ALL");
>> +        return false;
>> +    } else {
>> +        return true;
>> +    }
>> +}
> 
> We could have minimally used this where we assumed a TYPE1v2 container.

Are you referring to vfio_init_container (discussed above)?
Are you suggesting that, if reused is true, we validate those extensions are
present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

>> +
>> +/*
>> + * Verify that all containers support CPR, and unmap all dma vaddr's.
>> + */
>> +int vfio_cpr_save(Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>> +                return -1;
>> +            }
>> +        }
>> +    }
> 
> Seems like we ought to validate all containers support CPR before we
> start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
> this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
> replay the listeners to remap the vaddrs in case of an error?

Already done.  I refactored that code into a separate patch to tease out some
of the complexity:
  vfio-pci: recover from unmap-all-vaddr failure

>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Register the listener for each container, which causes its callback to be
>> + * invoked for every flat section.  The callback will see that the container
>> + * is reused, and call vfo_dma_map with the new vaddr.
>> + */
>> +int vfio_cpr_load(Error **errp)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +            vfio_listener_register(container);
>> +            container->reused = false;
>> +        }
>> +    }
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->reused = false;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af29..e247b2b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>    'migration.c',
>>  ))
>>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> +  'cpr.c',
>>    'display.c',
>>    'pci-quirks.c',
>>    'pci.c',
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index a90cce2..acac8a7 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -30,6 +30,7 @@
>>  #include "hw/qdev-properties-system.h"
>>  #include "migration/vmstate.h"
>>  #include "qapi/qmp/qdict.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/module.h"
>> @@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          vfio_put_group(group);
>>          goto error;
>>      }
>> +    pdev->reused = vdev->vbasedev.reused;
>>  
>>      vfio_populate_device(vdev, &err);
>>      if (err) {
>> @@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
>>  {
>>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>>  
>> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
>> +    if (vdev->pdev.reused) {
>> +        return;
>> +    }
>> +
>>      trace_vfio_pci_reset(vdev->vbasedev.name);
>>  
>>      vfio_pci_pre_reset(vdev);
>> @@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static void vfio_merge_config(VFIOPCIDevice *vdev)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    g_autofree uint8_t *phys_config = g_malloc(size);
>> +    uint32_t mask;
>> +    int ret, i;
>> +
>> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
>> +    if (ret < size) {
>> +        ret = ret < 0 ? errno : EFAULT;
>> +        error_report("failed to read device config space: %s", strerror(ret));
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < size; i++) {
>> +        mask = vdev->emulated_config_bits[i];
>> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
>> +    }
>> +}
> 
> IIUC, we get a copy of config space from the vfio device and for each
> byte, we keep what we have in emulated config space for emulated bits
> and fill in from the device for non-emulated bits.  Meanwhile,
> vfio_pci_read_config() doesn't ever return non-emulated bits from
> emulated config space, so what specifically are we accomplishing here?

Nothing, apparently.  I speculated that pdev->config[] could be used as a cache 
for the kernel config for non-emulated bits.  I'll delete it.

>> +
>> +/*
>> + * The kernel may change non-emulated config bits.  Exclude them from the
>> + * changed-bits check in get_pci_config_device.
>> + */
>> +static int vfio_pci_pre_load(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    int i;
>> +
>> +    for (i = 0; i < size; i++) {
>> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
>> +    }
>> +
>> +    return 0;
>> +}
> 
> The previous function seemed like maybe an attempt to make non-emulated
> bits in emulated config space consistent for testing, but here we're
> masking all non-emulated bits out of that mask.  Why do we need to do
> both?

We do need this one, as I was triggerring errors in get_pci_config_device.

>> +
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vfio_merge_config(vdev);
>> +
>> +    pdev->reused = false;
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_pci_needed(void *opaque)
>> +{
>> +    return cpr_get_mode() == CPR_MODE_RESTART;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> +    .name = "vfio-pci",
>> +    .unmigratable = 1,
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .pre_load = vfio_pci_pre_load,
>> +    .post_load = vfio_pci_post_load,
>> +    .needed = vfio_pci_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->reset = vfio_pci_reset;
>>      device_class_set_props(dc, vfio_pci_dev_properties);
>> +    dc->vmsd = &vfio_pci_vmstate;
>>      dc->desc = "VFIO-based PCI device assignment";
>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>      pdc->realize = vfio_realize;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 0ef1b5f..63dd0fe 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>  vfio_dma_unmap_overflow_workaround(void) ""
>> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>>  
>>  # platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index cc63dd4..8557e82 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -361,6 +361,7 @@ struct PCIDevice {
>>      /* ID of standby device in net_failover pair */
>>      char *failover_pair_id;
>>      uint32_t acpi_index;
>> +    bool reused;
>>  };
>>  
>>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 1641753..bc23c29 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>> +    bool reused;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>>      bool no_mmap;
>>      bool ram_block_discard_allowed;
>>      bool enable_migration;
>> +    bool reused;
>>      VFIODeviceOps *ops;
>>      unsigned int num_irqs;
>>      unsigned int num_regions;
>> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>  void vfio_put_group(VFIOGroup *group);
>>  int vfio_get_device(VFIOGroup *group, const char *name,
>>                      VFIODevice *vbasedev, Error **errp);
>> +int vfio_cpr_save(Error **errp);
>> +int vfio_cpr_load(Error **errp);
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>>  
>>  extern const MemoryRegionOps vfio_region_ops;
>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>> @@ -236,6 +241,9 @@ struct vfio_info_cap_header *
>>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>>  #endif
>>  extern const MemoryListener vfio_prereg_listener;
>> +void vfio_listener_register(VFIOContainer *container);
>> +void vfio_container_region_add(VFIOContainer *container,
>> +                               MemoryRegionSection *section);
>>  
>>  int vfio_spapr_create_window(VFIOContainer *container,
>>                               MemoryRegionSection *section,
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index a4da24e..a4007cf 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>>  int cpr_state_load(Error **errp);
>>  void cpr_state_print(void);
>>  
>> +int cpr_vfio_save(Error **errp);
>> +int cpr_vfio_load(Error **errp);
>> +
>>  #endif
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 37eca66..cee82cf 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -7,6 +7,7 @@
>>  
>>  #include "qemu/osdep.h"
>>  #include "exec/memory.h"
>> +#include "hw/vfio/vfio-common.h"
>>  #include "io/channel-buffer.h"
>>  #include "io/channel-file.h"
>>  #include "migration.h"
>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>          return;
>>      }
>> -
>> +    if (cpr_vfio_save(errp)) {
>> +        return;
>> +    }
> 
> Why is vfio so unique that it needs separate handlers versus other
> devices?  Thanks,

In earlier patches these functions fiddled with more objects, but at this point
they are simple enough to convert to pre_save and post_load vmstate handlers for
the container and group objects.  However, we would still need to call special 
functons for vfio from qmp_cpr_exec:

  * validate all containers support CPR before we start blasting vaddrs
    However, I could validate all, in every call to pre_save for each container.
    That would be less efficient, but fits the vmstate model.

  * restore all vaddr's if qemu_save_device_state fails.
    However, I could recover for all containers inside pre_save when one container fails.
    Feels strange touching all objects in a function for one, but there is no real
    downside.

Currently the logic for both of those is in vfio_cpr_save (in the patch 
"vfio-pci: recover from unmap-all-vaddr failure").

So, convert to pre_save and post_load handlers or not?

- Steve

>>      cpr_walk_fd(preserve_fd, 0);
>>      if (cpr_state_save(errp)) {
>>          return;
>> @@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>>          goto out;
>>      }
>>  
>> +    if (cpr_get_mode() == CPR_MODE_RESTART &&
>> +        cpr_vfio_load(errp)) {
>> +        goto out;
>> +    }
>> +
>>      state = global_state_get_runstate();
>>      if (state == RUN_STATE_RUNNING) {
>>          vm_start();
>> diff --git a/migration/target.c b/migration/target.c
>> index 4390bf0..984bc9e 100644
>> --- a/migration/target.c
>> +++ b/migration/target.c
>> @@ -8,6 +8,7 @@
>>  #include "qemu/osdep.h"
>>  #include "qapi/qapi-types-migration.h"
>>  #include "migration.h"
>> +#include "migration/cpr.h"
>>  #include CONFIG_DEVICES
>>  
>>  #ifdef CONFIG_VFIO
>> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>>          info->vfio->transferred = vfio_mig_bytes_transferred();
>>      }
>>  }
>> +
>> +int cpr_vfio_save(Error **errp)
>> +{
>> +    return vfio_cpr_save(errp);
>> +}
>> +
>> +int cpr_vfio_load(Error **errp)
>> +{
>> +    return vfio_cpr_load(errp);
>> +}
>> +
>>  #else
>>  
>>  void populate_vfio_info(MigrationInfo *info) {}
>> +int cpr_vfio_save(Error **errp) { return 0; }
>> +int cpr_vfio_load(Error **errp) { return 0; }
>>  
>>  #endif /* CONFIG_VFIO */
>

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Alex Williamson 3 years, 11 months ago

On Thu, 10 Mar 2022 10:00:29 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/7/2022 5:16 PM, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 11:05:24 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>  {
> >>      int iommu_type, ret;
> >>  
> >> +    /*
> >> +     * If container is reused, just set its type and skip the ioctls, as the
> >> +     * container and group are already configured in the kernel.
> >> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >> +     * If you ever add new types or spapr cpr support, kind reader, please
> >> +     * also implement VFIO_GET_IOMMU.
> >> +     */  
> > 
> > VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> > problem is that vfio_iommu_type1_check_extension() should actually base
> > some of the details on the instantiated vfio_iommu, ex.
> > 
> > 	switch (arg) {
> > 	case VFIO_TYPE1_IOMMU:
> > 		return (iommu && iommu->v2) ? 0 : 1;
> > 	case VFIO_UNMAP_ALL:
> > 	case VFIO_UPDATE_VADDR:
> > 	case VFIO_TYPE1v2_IOMMU:
> > 		return (iommu && !iommu->v2) ? 0 : 1;
> > 	case VFIO_TYPE1_NESTING_IOMMU:
> > 		return (iommu && !iommu->nesting) ? 0 : 1;
> > 	...
> > 
> > We can't support v1 if we've already set a v2 container and vice versa.
> > There are probably some corner cases and compatibility to puzzle
> > through, but I wouldn't think we need a new ioctl to check this.  
> 
> That change makes sense, and may be worth while on its own merits, but does not
> solve the problem, which is that qemu will not be able to infer iommu_type in
> the future if new types are added.  Given:
>   * a new kernel supporting shiny new TYPE1v3
>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
>     knowledge of v3
>   * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.
> 
> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.
> 
> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
> hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
> "what are you" about a container feels like a bug to me.

Hmm, I don't think the kernel has an innate responsibility to remind
the user of a configuration that they've already made.  But I also
don't follow your TYPE1v3 example.  If we added such a type, I imagine
the switch would change to:

	switch (arg)
	case VFIO_TYPE1_IOMMU:
		return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
	case VFIO_UNMAP_ALL:
	case VFIO_UPDATE_VADDR:
		return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
	case VFIO_TYPE1v2_IOMMU:
		return (iommu && !iommu-v2) ? 0 : 1;
	case VFIO_TYPE1v3_IOMMU:
		return (iommu && !iommu->v3) ? 0 : 1;
	...

How would that not allow exactly the scenario described, ie. new QEMU
can see that old QEMU left it a v2 IOMMU.

...
> >> +
> >> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> >> +{
> >> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> >> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> >> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> >> +                         "or VFIO_UNMAP_ALL");
> >> +        return false;
> >> +    } else {
> >> +        return true;
> >> +    }
> >> +}  
> > 
> > We could have minimally used this where we assumed a TYPE1v2 container.  
> 
> Are you referring to vfio_init_container (discussed above)?
> Are you suggesting that, if reused is true, we validate those extensions are
> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

Yeah, though maybe it's not sufficiently precise to be worthwhile given
the current kernel behavior.

> >> +
> >> +/*
> >> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> >> + */
> >> +int vfio_cpr_save(Error **errp)
> >> +{
> >> +    ERRP_GUARD();
> >> +    VFIOAddressSpace *space;
> >> +    VFIOContainer *container;
> >> +
> >> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> >> +        QLIST_FOREACH(container, &space->containers, next) {
> >> +            if (!vfio_is_cpr_capable(container, errp)) {
> >> +                return -1;
> >> +            }
> >> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> >> +                return -1;
> >> +            }
> >> +        }
> >> +    }  
> > 
> > Seems like we ought to validate all containers support CPR before we
> > start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
> > this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
> > replay the listeners to remap the vaddrs in case of an error?  
> 
> Already done.  I refactored that code into a separate patch to tease out some
> of the complexity:
>   vfio-pci: recover from unmap-all-vaddr failure

Sorry, didn't get to that one til after I'd sent comments here.

...
> >> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> >> index a4da24e..a4007cf 100644
> >> --- a/include/migration/cpr.h
> >> +++ b/include/migration/cpr.h
> >> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
> >>  int cpr_state_load(Error **errp);
> >>  void cpr_state_print(void);
> >>  
> >> +int cpr_vfio_save(Error **errp);
> >> +int cpr_vfio_load(Error **errp);
> >> +
> >>  #endif
> >> diff --git a/migration/cpr.c b/migration/cpr.c
> >> index 37eca66..cee82cf 100644
> >> --- a/migration/cpr.c
> >> +++ b/migration/cpr.c
> >> @@ -7,6 +7,7 @@
> >>  
> >>  #include "qemu/osdep.h"
> >>  #include "exec/memory.h"
> >> +#include "hw/vfio/vfio-common.h"
> >>  #include "io/channel-buffer.h"
> >>  #include "io/channel-file.h"
> >>  #include "migration.h"
> >> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
> >>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
> >>          return;
> >>      }
> >> -
> >> +    if (cpr_vfio_save(errp)) {
> >> +        return;
> >> +    }  
> > 
> > Why is vfio so unique that it needs separate handlers versus other
> > devices?  Thanks,  
> 
> In earlier patches these functions fiddled with more objects, but at this point
> they are simple enough to convert to pre_save and post_load vmstate handlers for
> the container and group objects.  However, we would still need to call special 
> functons for vfio from qmp_cpr_exec:
> 
>   * validate all containers support CPR before we start blasting vaddrs
>     However, I could validate all, in every call to pre_save for each container.
>     That would be less efficient, but fits the vmstate model.

Would it be a better option to mirror the migration blocker support, ie.
any device that doesn't support cpr registers a blocker and generic
code only needs to keep track of whether any blockers are registered.

>   * restore all vaddr's if qemu_save_device_state fails.
>     However, I could recover for all containers inside pre_save when one container fails.
>     Feels strange touching all objects in a function for one, but there is no real
>     downside.

I'm not as familiar as I should be with migration callbacks, thanks to
mostly not supporting it for vfio devices, but it seems strange to me
that there's no existing callback or notifier per device to propagate
save failure.  Do we not at least get some sort of resume callback in
that case?

As an alternative, maybe each container could register a vm change
handler that would trigger reloading vaddrs if we move to a running
state and a flag on the container indicates vaddrs were invalidated?
Thanks,

Alex

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 3 years, 11 months ago

On 3/10/2022 1:35 PM, Alex Williamson wrote:
> On Thu, 10 Mar 2022 10:00:29 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 3/7/2022 5:16 PM, Alex Williamson wrote:
>>> On Wed, 22 Dec 2021 11:05:24 -0800
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>>>  {
>>>>      int iommu_type, ret;
>>>>  
>>>> +    /*
>>>> +     * If container is reused, just set its type and skip the ioctls, as the
>>>> +     * container and group are already configured in the kernel.
>>>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>>>> +     * If you ever add new types or spapr cpr support, kind reader, please
>>>> +     * also implement VFIO_GET_IOMMU.
>>>> +     */  
>>>
>>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
>>> problem is that vfio_iommu_type1_check_extension() should actually base
>>> some of the details on the instantiated vfio_iommu, ex.
>>>
>>> 	switch (arg) {
>>> 	case VFIO_TYPE1_IOMMU:
>>> 		return (iommu && iommu->v2) ? 0 : 1;
>>> 	case VFIO_UNMAP_ALL:
>>> 	case VFIO_UPDATE_VADDR:
>>> 	case VFIO_TYPE1v2_IOMMU:
>>> 		return (iommu && !iommu->v2) ? 0 : 1;
>>> 	case VFIO_TYPE1_NESTING_IOMMU:
>>> 		return (iommu && !iommu->nesting) ? 0 : 1;
>>> 	...
>>>
>>> We can't support v1 if we've already set a v2 container and vice versa.
>>> There are probably some corner cases and compatibility to puzzle
>>> through, but I wouldn't think we need a new ioctl to check this.  
>>
>> That change makes sense, and may be worth while on its own merits, but does not
>> solve the problem, which is that qemu will not be able to infer iommu_type in
>> the future if new types are added.  Given:
>>   * a new kernel supporting shiny new TYPE1v3
>>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
>>     knowledge of v3
>>   * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.
>>
>> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
>> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
>> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.
>>
>> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
>> hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
>> "what are you" about a container feels like a bug to me.
> 
> Hmm, I don't think the kernel has an innate responsibility to remind
> the user of a configuration that they've already made.  

No, but it can make userland cleaner.  For example, CRIU checkpoint/restart queries
the kernel to save process state, and later makes syscalls to restore it.  Where the
kernel does not export sufficient information, CRIU must provide interpose libraries
so it can remember state internally on its way to the kernel.  And applications must
link against the interpose libraries.

> But I also
> don't follow your TYPE1v3 example.  If we added such a type, I imagine
> the switch would change to:
> 
> 	switch (arg)
> 	case VFIO_TYPE1_IOMMU:
> 		return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
> 	case VFIO_UNMAP_ALL:
> 	case VFIO_UPDATE_VADDR:
> 		return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
> 	case VFIO_TYPE1v2_IOMMU:
> 		return (iommu && !iommu-v2) ? 0 : 1;
> 	case VFIO_TYPE1v3_IOMMU:
> 		return (iommu && !iommu->v3) ? 0 : 1;
> 	...
> 
> How would that not allow exactly the scenario described, ie. new QEMU
> can see that old QEMU left it a v2 IOMMU.

OK, that works as long as the switch returns true for all options before
VFIO_SET_IOMMU is called.  I guess your test for "iommu" above does that,
which I missed before.  If we are on the same page now, I will modify my
comment "please also implement VFIO_GET_IOMMU".

> ...
>>>> +
>>>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>>>> +{
>>>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>>>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>>>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>>>> +                         "or VFIO_UNMAP_ALL");
>>>> +        return false;
>>>> +    } else {
>>>> +        return true;
>>>> +    }
>>>> +}  
>>>
>>> We could have minimally used this where we assumed a TYPE1v2 container.  
>>
>> Are you referring to vfio_init_container (discussed above)?
>> Are you suggesting that, if reused is true, we validate those extensions are
>> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?
> 
> Yeah, though maybe it's not sufficiently precise to be worthwhile given
> the current kernel behavior.
> 
>>>> +
>>>> +/*
>>>> + * Verify that all containers support CPR, and unmap all dma vaddr's.
>>>> + */
>>>> +int vfio_cpr_save(Error **errp)
>>>> +{
>>>> +    ERRP_GUARD();
>>>> +    VFIOAddressSpace *space;
>>>> +    VFIOContainer *container;
>>>> +
>>>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>>>> +        QLIST_FOREACH(container, &space->containers, next) {
>>>> +            if (!vfio_is_cpr_capable(container, errp)) {
>>>> +                return -1;
>>>> +            }
>>>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>>>> +                return -1;
>>>> +            }
>>>> +        }
>>>> +    }  
>>>
>>> Seems like we ought to validate all containers support CPR before we
>>> start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
>>> this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
>>> replay the listeners to remap the vaddrs in case of an error?  
>>
>> Already done.  I refactored that code into a separate patch to tease out some
>> of the complexity:
>>   vfio-pci: recover from unmap-all-vaddr failure
> 
> Sorry, didn't get to that one til after I'd sent comments here.
> 
> ...
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index a4da24e..a4007cf 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>>>>  int cpr_state_load(Error **errp);
>>>>  void cpr_state_print(void);
>>>>  
>>>> +int cpr_vfio_save(Error **errp);
>>>> +int cpr_vfio_load(Error **errp);
>>>> +
>>>>  #endif
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index 37eca66..cee82cf 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -7,6 +7,7 @@
>>>>  
>>>>  #include "qemu/osdep.h"
>>>>  #include "exec/memory.h"
>>>> +#include "hw/vfio/vfio-common.h"
>>>>  #include "io/channel-buffer.h"
>>>>  #include "io/channel-file.h"
>>>>  #include "migration.h"
>>>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>>>          return;
>>>>      }
>>>> -
>>>> +    if (cpr_vfio_save(errp)) {
>>>> +        return;
>>>> +    }  
>>>
>>> Why is vfio so unique that it needs separate handlers versus other
>>> devices?  Thanks,  
>>
>> In earlier patches these functions fiddled with more objects, but at this point
>> they are simple enough to convert to pre_save and post_load vmstate handlers for
>> the container and group objects.  However, we would still need to call special 
>> functons for vfio from qmp_cpr_exec:
>>
>>   * validate all containers support CPR before we start blasting vaddrs
>>     However, I could validate all, in every call to pre_save for each container.
>>     That would be less efficient, but fits the vmstate model.
> 
> Would it be a better option to mirror the migration blocker support, ie.
> any device that doesn't support cpr registers a blocker and generic
> code only needs to keep track of whether any blockers are registered.

We cannot specifically use migrate_add_blocker(), because it is checked in
the migration specific function migrate_prepare(), in a layer of functions 
above the simpler qemu_save_device_state() used in cpr.  But yes, we could
do something similar for vfio.  Increment a global counter in vfio_realize
if the container does not support cpr, and decrement it when the container is
destroyed.  pre_save could just check the counter.

>>   * restore all vaddr's if qemu_save_device_state fails.
>>     However, I could recover for all containers inside pre_save when one container fails.
>>     Feels strange touching all objects in a function for one, but there is no real
>>     downside.
> 
> I'm not as familiar as I should be with migration callbacks, thanks to
> mostly not supporting it for vfio devices, but it seems strange to me
> that there's no existing callback or notifier per device to propagate
> save failure.  Do we not at least get some sort of resume callback in
> that case?

We do not:
    struct VMStateDescription {
        int (*pre_load)(void *opaque);
        int (*post_load)(void *opaque, int version_id);
        int (*pre_save)(void *opaque);
        int (*post_save)(void *opaque);

The handler returns an error, which stops further saves and is propagated back
to the top level caller qemu_save_device_state().

The vast majority of handlers do not have side effects, with no need to unwind 
anything on failure.

This raises another point.  If pre_save succeeds for all the containers,
but fails for some non-vfio object, then the overall operation is abandoned,
but we do not restore the vaddr's.  To plug that hole, we need to call the
unwind code from qmp_cpr_save, or implement your alternative below.

> As an alternative, maybe each container could register a vm change
> handler that would trigger reloading vaddrs if we move to a running
> state and a flag on the container indicates vaddrs were invalidated?
> Thanks,

That works and is modular, but I dislike that it adds checks on the
happy path for a case that will rarely happen, and it pushes recovery from
failure further away from the original failure, which would make debugging
cascading failures more difficult.

- Steve

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Alex Williamson 3 years, 11 months ago

On Thu, 10 Mar 2022 14:55:50 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/10/2022 1:35 PM, Alex Williamson wrote:
> > On Thu, 10 Mar 2022 10:00:29 -0500
> > Steven Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> On 3/7/2022 5:16 PM, Alex Williamson wrote:  
> >>> On Wed, 22 Dec 2021 11:05:24 -0800
> >>> Steve Sistare <steven.sistare@oracle.com> wrote:  
> >>>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>>>  {
> >>>>      int iommu_type, ret;
> >>>>  
> >>>> +    /*
> >>>> +     * If container is reused, just set its type and skip the ioctls, as the
> >>>> +     * container and group are already configured in the kernel.
> >>>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >>>> +     * If you ever add new types or spapr cpr support, kind reader, please
> >>>> +     * also implement VFIO_GET_IOMMU.
> >>>> +     */    
> >>>
> >>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> >>> problem is that vfio_iommu_type1_check_extension() should actually base
> >>> some of the details on the instantiated vfio_iommu, ex.
> >>>
> >>> 	switch (arg) {
> >>> 	case VFIO_TYPE1_IOMMU:
> >>> 		return (iommu && iommu->v2) ? 0 : 1;
> >>> 	case VFIO_UNMAP_ALL:
> >>> 	case VFIO_UPDATE_VADDR:
> >>> 	case VFIO_TYPE1v2_IOMMU:
> >>> 		return (iommu && !iommu->v2) ? 0 : 1;
> >>> 	case VFIO_TYPE1_NESTING_IOMMU:
> >>> 		return (iommu && !iommu->nesting) ? 0 : 1;
> >>> 	...
> >>>
> >>> We can't support v1 if we've already set a v2 container and vice versa.
> >>> There are probably some corner cases and compatibility to puzzle
> >>> through, but I wouldn't think we need a new ioctl to check this.    
> >>
> >> That change makes sense, and may be worth while on its own merits, but does not
> >> solve the problem, which is that qemu will not be able to infer iommu_type in
> >> the future if new types are added.  Given:
> >>   * a new kernel supporting shiny new TYPE1v3
> >>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
> >>     knowledge of v3
> >>   * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.
> >>
> >> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> >> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
> >> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.
> >>
> >> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
> >> hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
> >> "what are you" about a container feels like a bug to me.  
> > 
> > Hmm, I don't think the kernel has an innate responsibility to remind
> > the user of a configuration that they've already made.    
> 
> No, but it can make userland cleaner.  For example, CRIU checkpoint/restart queries
> the kernel to save process state, and later makes syscalls to restore it.  Where the
> kernel does not export sufficient information, CRIU must provide interpose libraries
> so it can remember state internally on its way to the kernel.  And applications must
> link against the interpose libraries.

The counter argument is that it bloats the kernel to add interfaces to
report back things that userspace should already know.  Which has more
exploit vectors, a new kernel ioctl or yet another userspace library?
 
> > But I also
> > don't follow your TYPE1v3 example.  If we added such a type, I imagine
> > the switch would change to:
> > 
> > 	switch (arg)
> > 	case VFIO_TYPE1_IOMMU:
> > 		return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
> > 	case VFIO_UNMAP_ALL:
> > 	case VFIO_UPDATE_VADDR:
> > 		return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
> > 	case VFIO_TYPE1v2_IOMMU:
> > 		return (iommu && !iommu-v2) ? 0 : 1;
> > 	case VFIO_TYPE1v3_IOMMU:
> > 		return (iommu && !iommu->v3) ? 0 : 1;
> > 	...
> > 
> > How would that not allow exactly the scenario described, ie. new QEMU
> > can see that old QEMU left it a v2 IOMMU.  
> 
> OK, that works as long as the switch returns true for all options before
> VFIO_SET_IOMMU is called.  I guess your test for "iommu" above does that,
> which I missed before.  If we are on the same page now, I will modify my
> comment "please also implement VFIO_GET_IOMMU".

Yes, in the above all extensions are supported before the container
type is set, then once set only the relevant extensions are available.

...
> >>>> diff --git a/migration/cpr.c b/migration/cpr.c
> >>>> index 37eca66..cee82cf 100644
> >>>> --- a/migration/cpr.c
> >>>> +++ b/migration/cpr.c
> >>>> @@ -7,6 +7,7 @@
> >>>>  
> >>>>  #include "qemu/osdep.h"
> >>>>  #include "exec/memory.h"
> >>>> +#include "hw/vfio/vfio-common.h"
> >>>>  #include "io/channel-buffer.h"
> >>>>  #include "io/channel-file.h"
> >>>>  #include "migration.h"
> >>>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
> >>>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
> >>>>          return;
> >>>>      }
> >>>> -
> >>>> +    if (cpr_vfio_save(errp)) {
> >>>> +        return;
> >>>> +    }    
> >>>
> >>> Why is vfio so unique that it needs separate handlers versus other
> >>> devices?  Thanks,    
> >>
> >> In earlier patches these functions fiddled with more objects, but at this point
> >> they are simple enough to convert to pre_save and post_load vmstate handlers for
> >> the container and group objects.  However, we would still need to call special 
> >> functons for vfio from qmp_cpr_exec:
> >>
> >>   * validate all containers support CPR before we start blasting vaddrs
> >>     However, I could validate all, in every call to pre_save for each container.
> >>     That would be less efficient, but fits the vmstate model.  
> > 
> > Would it be a better option to mirror the migration blocker support, ie.
> > any device that doesn't support cpr registers a blocker and generic
> > code only needs to keep track of whether any blockers are registered.  
> 
> We cannot specifically use migrate_add_blocker(), because it is checked in
> the migration specific function migrate_prepare(), in a layer of functions 
> above the simpler qemu_save_device_state() used in cpr.  But yes, we could
> do something similar for vfio.  Increment a global counter in vfio_realize
> if the container does not support cpr, and decrement it when the container is
> destroyed.  pre_save could just check the counter.

Right, not suggesting to piggyback on migrate_add_blocker() only to use
a similar mechanism.  Only drivers that can't support cpr need register
a blocker but testing for blockers is done generically, not just for
vfio devices.

> >>   * restore all vaddr's if qemu_save_device_state fails.
> >>     However, I could recover for all containers inside pre_save when one container fails.
> >>     Feels strange touching all objects in a function for one, but there is no real
> >>     downside.  
> > 
> > I'm not as familiar as I should be with migration callbacks, thanks to
> > mostly not supporting it for vfio devices, but it seems strange to me
> > that there's no existing callback or notifier per device to propagate
> > save failure.  Do we not at least get some sort of resume callback in
> > that case?  
> 
> We do not:
>     struct VMStateDescription {
>         int (*pre_load)(void *opaque);
>         int (*post_load)(void *opaque, int version_id);
>         int (*pre_save)(void *opaque);
>         int (*post_save)(void *opaque);
> 
> The handler returns an error, which stops further saves and is propagated back
> to the top level caller qemu_save_device_state().
> 
> The vast majority of handlers do not have side effects, with no need to unwind 
> anything on failure.
> 
> This raises another point.  If pre_save succeeds for all the containers,
> but fails for some non-vfio object, then the overall operation is abandoned,
> but we do not restore the vaddr's.  To plug that hole, we need to call the
> unwind code from qmp_cpr_save, or implement your alternative below.

We're trying to reuse migration interfaces, are we also triggering
migration state change notifiers?  ie.
MIGRATION_STATUS_{CANCELLING,CANCELLED,FAILED}  We already hook vfio
devices supporting migration into that notifier to tell the driver to
move the device back to the running state on failure, which seems a bit
unique to vfio devices.  Containers could maybe register their own
callbacks.

> > As an alternative, maybe each container could register a vm change
> > handler that would trigger reloading vaddrs if we move to a running
> > state and a flag on the container indicates vaddrs were invalidated?
> > Thanks,  
> 
> That works and is modular, but I dislike that it adds checks on the
> happy path for a case that will rarely happen, and it pushes recovery from
> failure further away from the original failure, which would make debugging
> cascading failures more difficult.

Would using the migration notifier move us sufficiently closer to the
failure point?  Otherwise I think you're talking about unwinding all
the containers when any one fails, where you didn't like that object
overreach, or maybe adding an optional callback... but I wonder if the
above notifier essentially already does that.

In any case, I think we have options to either implement new or use
existing notifier-like functionality to avoid all these vfio specific
callouts.  Thanks,

Alex

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)

Posted by Steven Sistare 3 years, 11 months ago

On 3/10/2022 5:30 PM, Alex Williamson wrote:
> On Thu, 10 Mar 2022 14:55:50 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 3/10/2022 1:35 PM, Alex Williamson wrote:
>>> On Thu, 10 Mar 2022 10:00:29 -0500
>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>   
>>>> On 3/7/2022 5:16 PM, Alex Williamson wrote:  
>>>>> On Wed, 22 Dec 2021 11:05:24 -0800
>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:  
>>>>> [...]
>>>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>>>> index 37eca66..cee82cf 100644
>>>>>> --- a/migration/cpr.c
>>>>>> +++ b/migration/cpr.c
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  
>>>>>>  #include "qemu/osdep.h"
>>>>>>  #include "exec/memory.h"
>>>>>> +#include "hw/vfio/vfio-common.h"
>>>>>>  #include "io/channel-buffer.h"
>>>>>>  #include "io/channel-file.h"
>>>>>>  #include "migration.h"
>>>>>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>>>>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>>>>>          return;
>>>>>>      }
>>>>>> -
>>>>>> +    if (cpr_vfio_save(errp)) {
>>>>>> +        return;
>>>>>> +    }    
>>>>>
>>>>> Why is vfio so unique that it needs separate handlers versus other
>>>>> devices?  Thanks,    
>>>>
>>>> In earlier patches these functions fiddled with more objects, but at this point
>>>> they are simple enough to convert to pre_save and post_load vmstate handlers for
>>>> the container and group objects.  However, we would still need to call special 
>>>> functons for vfio from qmp_cpr_exec:
>>>>
>>>>   * validate all containers support CPR before we start blasting vaddrs
>>>>     However, I could validate all, in every call to pre_save for each container.
>>>>     That would be less efficient, but fits the vmstate model.  
>>>
>>> Would it be a better option to mirror the migration blocker support, ie.
>>> any device that doesn't support cpr registers a blocker and generic
>>> code only needs to keep track of whether any blockers are registered.  
>>
>> We cannot specifically use migrate_add_blocker(), because it is checked in
>> the migration specific function migrate_prepare(), in a layer of functions 
>> above the simpler qemu_save_device_state() used in cpr.  But yes, we could
>> do something similar for vfio.  Increment a global counter in vfio_realize
>> if the container does not support cpr, and decrement it when the container is
>> destroyed.  pre_save could just check the counter.
> 
> Right, not suggesting to piggyback on migrate_add_blocker() only to use
> a similar mechanism.  Only drivers that can't support cpr need register
> a blocker but testing for blockers is done generically, not just for
> vfio devices.
> 
>>>>   * restore all vaddr's if qemu_save_device_state fails.
>>>>     However, I could recover for all containers inside pre_save when one container fails.
>>>>     Feels strange touching all objects in a function for one, but there is no real
>>>>     downside.  
>>>
>>> I'm not as familiar as I should be with migration callbacks, thanks to
>>> mostly not supporting it for vfio devices, but it seems strange to me
>>> that there's no existing callback or notifier per device to propagate
>>> save failure.  Do we not at least get some sort of resume callback in
>>> that case?  
>>
>> We do not:
>>     struct VMStateDescription {
>>         int (*pre_load)(void *opaque);
>>         int (*post_load)(void *opaque, int version_id);
>>         int (*pre_save)(void *opaque);
>>         int (*post_save)(void *opaque);
>>
>> The handler returns an error, which stops further saves and is propagated back
>> to the top level caller qemu_save_device_state().
>>
>> The vast majority of handlers do not have side effects, with no need to unwind 
>> anything on failure.
>>
>> This raises another point.  If pre_save succeeds for all the containers,
>> but fails for some non-vfio object, then the overall operation is abandoned,
>> but we do not restore the vaddr's.  To plug that hole, we need to call the
>> unwind code from qmp_cpr_save, or implement your alternative below.
> 
> We're trying to reuse migration interfaces, are we also triggering
> migration state change notifiers?  ie.
> MIGRATION_STATUS_{CANCELLING,CANCELLED,FAILED}  

No. That happens in the migration layer which we do not use.

> We already hook vfio
> devices supporting migration into that notifier to tell the driver to
> move the device back to the running state on failure, which seems a bit
> unique to vfio devices.  Containers could maybe register their own
> callbacks.
> 
>>> As an alternative, maybe each container could register a vm change
>>> handler that would trigger reloading vaddrs if we move to a running
>>> state and a flag on the container indicates vaddrs were invalidated?
>>> Thanks,  
>>
>> That works and is modular, but I dislike that it adds checks on the
>> happy path for a case that will rarely happen, and it pushes recovery from
>> failure further away from the original failure, which would make debugging
>> cascading failures more difficult.
> 
> Would using the migration notifier move us sufficiently closer to the
> failure point?  Otherwise I think you're talking about unwinding all
> the containers when any one fails, where you didn't like that object
> overreach, or maybe adding an optional callback... but I wonder if the
> above notifier essentially already does that.
> 
> In any case, I think we have options to either implement new or use
> existing notifier-like functionality to avoid all these vfio specific
> callouts.  Thanks,

Yes, defining a cpr notifier for failure and cleanup is a good solution.
I'll work on that and a cpr blocker.  I'll use the latter for vfio and
the chardevs.

- Steve