[Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Alexey Kardashevskiy posted 1 patch 48 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20171215062914.27927-1-aik@ozlabs.ru
Test checkpatch passed
Test docker passed
Test ppc passed
Test s390x passed
include/hw/vfio/vfio-common.h |  1 +
linux-headers/linux/vfio.h    |  5 +++++
hw/ppc/spapr.c                |  8 ++++++++
hw/vfio/common.c              | 15 +++++++++++++++
hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
5 files changed, 50 insertions(+), 2 deletions(-)

[Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 48 weeks ago
This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
which tells that a region with MSIX data can be mapped entirely, i.e.
the VFIO PCI driver won't prevent MSIX vectors area from being mapped.

With this change, all BARs are mapped in a single chunk and MSIX vectors
are emulated on top unless the machine requests not to by defining and
enabling a new "vfio-no-msix-emulation" property. At the moment only
sPAPR machine does so - it prohibits MSIX emulation and does not allow
enabling it as it does not define the "set" callback for the new property;
the new property also does not appear in "-machine pseries,help".

If the new capability is present, this puts MSIX IO memory region under
mapped memory region. If the capability is not there, it falls back to
the old behaviour with the sparse capability.

In MSIX vectors section is not aligned to the page size, the KVM memory
listener does not register it with the KVM as a memory slot and MSIX is
emulated by QEMU as before.

This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
for the new capability: https://www.spinics.net/lists/kvm/msg160282.html

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is mtree and flatview BEFORE this patch:

"info mtree":
memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
      000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
      000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
      0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]

"info mtree -f":
FlatView #0
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
  000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
  000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]



This is AFTER this patch applied:

"info mtree":
memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
      0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
        000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
        000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
      0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]


"info mtree -f":
FlatView #2
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]



This is AFTER this patch applied AND spapr_get_msix_emulation() patched
to enable emulation:

"info mtree":
memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
      0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
        000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
        000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
      0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]

"info mtree -f":
FlatView #1
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
  000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
  000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
---
 include/hw/vfio/vfio-common.h |  1 +
 linux-headers/linux/vfio.h    |  5 +++++
 hw/ppc/spapr.c                |  8 ++++++++
 hw/vfio/common.c              | 15 +++++++++++++++
 hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
 5 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f3a2ac9..927d600 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
                          struct vfio_region_info **info);
 int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
                              uint32_t subtype, struct vfio_region_info **info);
+bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
 #endif
 extern const MemoryListener vfio_prereg_listener;
 
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 4312e96..b45182e 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
 #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
 
+/*
+ * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
+ */
+#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
+
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
  *				    struct vfio_irq_info)
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 9de63f0..693394a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
     spapr->use_hotplug_event_source = value;
 }
 
+static bool spapr_get_msix_emulation(Object *obj, Error **errp)
+{
+    return true;
+}
+
 static char *spapr_get_resize_hpt(Object *obj, Error **errp)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
@@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
     object_property_set_description(obj, "vsmt",
                                     "Virtual SMT: KVM behaves as if this were"
                                     " the host's SMT mode", &error_abort);
+    object_property_add_bool(obj, "vfio-no-msix-emulation",
+                             spapr_get_msix_emulation, NULL, NULL);
 }
 
 static void spapr_machine_finalizefn(Object *obj)
@@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
 /*
  * pseries-2.11
  */
+
 static void spapr_machine_2_11_instance_options(MachineState *machine)
 {
 }
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 1fb8a8e..da5f182 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
     return -ENODEV;
 }
 
+bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
+{
+    struct vfio_region_info *info = NULL;
+    bool ret = false;
+
+    if (!vfio_get_region_info(vbasedev, region, &info)) {
+        if (vfio_get_region_info_cap(info, cap_type)) {
+            ret = true;
+        }
+        g_free(info);
+    }
+
+    return ret;
+}
+
 /*
  * Interfaces for IBM EEH (Enhanced Error Handling)
  */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c977ee3..27a3706 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
     off_t start, end;
     VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
 
+    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
+                            vdev->msix->table_bar)) {
+        return;
+    }
+
     /*
      * We expect to find a single mmap covering the whole BAR, anything else
      * means it's either unsupported or already setup.
@@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
     vfio_pci_fixup_msix_region(vdev);
 }
 
+static MemoryRegion *vfio_msix_parent(VFIORegion *region)
+{
+    if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) {
+        return &region->mmaps[0].mem;
+    }
+
+    return region->mem;
+}
+
 static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
 {
     int ret;
@@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
     vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
                                     sizeof(unsigned long));
     ret = msix_init(&vdev->pdev, vdev->msix->entries,
-                    vdev->bars[vdev->msix->table_bar].region.mem,
+                    vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region),
                     vdev->msix->table_bar, vdev->msix->table_offset,
-                    vdev->bars[vdev->msix->pba_bar].region.mem,
+                    vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region),
                     vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
                     &err);
     if (ret < 0) {
@@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
      */
     memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
 
+    if (object_property_get_bool(OBJECT(qdev_get_machine()),
+                                 "vfio-no-msix-emulation", NULL)) {
+        memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false);
+    }
+
     return 0;
 }
 
-- 
2.11.0


Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by David Gibson 47 weeks ago
On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote:
> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> which tells that a region with MSIX data can be mapped entirely, i.e.
> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> 
> With this change, all BARs are mapped in a single chunk and MSIX vectors
> are emulated on top unless the machine requests not to by defining and
> enabling a new "vfio-no-msix-emulation" property. At the moment only
> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> enabling it as it does not define the "set" callback for the new property;
> the new property also does not appear in "-machine pseries,help".
> 
> If the new capability is present, this puts MSIX IO memory region under
> mapped memory region. If the capability is not there, it falls back to
> the old behaviour with the sparse capability.
> 
> In MSIX vectors section is not aligned to the page size, the KVM memory
> listener does not register it with the KVM as a memory slot and MSIX is
> emulated by QEMU as before.
> 
> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is mtree and flatview BEFORE this patch:
> 
> "info mtree":
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> "info mtree -f":
> FlatView #0
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> 
> This is AFTER this patch applied:
> 
> "info mtree":
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> "info mtree -f":
> FlatView #2
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> 
> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> to enable emulation:
> 
> "info mtree":
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> "info mtree -f":
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> ---
>  include/hw/vfio/vfio-common.h |  1 +
>  linux-headers/linux/vfio.h    |  5 +++++
>  hw/ppc/spapr.c                |  8 ++++++++
>  hw/vfio/common.c              | 15 +++++++++++++++
>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>  5 files changed, 50 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f3a2ac9..927d600 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                           struct vfio_region_info **info);
>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>                               uint32_t subtype, struct vfio_region_info **info);
> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 4312e96..b45182e 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/*
> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> + */
> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> +
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *				    struct vfio_irq_info)
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 9de63f0..693394a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>      spapr->use_hotplug_event_source = value;
>  }
>  
> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> +{
> +    return true;
> +}
> +
>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>      object_property_set_description(obj, "vsmt",
>                                      "Virtual SMT: KVM behaves as if this were"
>                                      " the host's SMT mode", &error_abort);
> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> +                             spapr_get_msix_emulation, NULL, NULL);

I prefer the approach where the property is in the PCI device, set by
the machine defaults mechanism, rather than the property being in the
machine and the device having to look it up.

If Alex prefers the latter though, I'm ok with that way around.

>  }
>  
>  static void spapr_machine_finalizefn(Object *obj)
> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>  /*
>   * pseries-2.11
>   */
> +
>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>  {
>  }
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 1fb8a8e..da5f182 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>      return -ENODEV;
>  }
>  
> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> +{
> +    struct vfio_region_info *info = NULL;
> +    bool ret = false;
> +
> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> +        if (vfio_get_region_info_cap(info, cap_type)) {
> +            ret = true;
> +        }
> +        g_free(info);
> +    }
> +
> +    return ret;
> +}
> +
>  /*
>   * Interfaces for IBM EEH (Enhanced Error Handling)
>   */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c977ee3..27a3706 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>      off_t start, end;
>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>  
> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> +                            vdev->msix->table_bar)) {
> +        return;
> +    }
> +
>      /*
>       * We expect to find a single mmap covering the whole BAR, anything else
>       * means it's either unsupported or already setup.
> @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>      vfio_pci_fixup_msix_region(vdev);
>  }
>  
> +static MemoryRegion *vfio_msix_parent(VFIORegion *region)
> +{
> +    if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) {
> +        return &region->mmaps[0].mem;
> +    }
> +
> +    return region->mem;
> +}
> +
>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>  {
>      int ret;
> @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>      vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
>                                      sizeof(unsigned long));
>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
> -                    vdev->bars[vdev->msix->table_bar].region.mem,
> +                    vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region),
>                      vdev->msix->table_bar, vdev->msix->table_offset,
> -                    vdev->bars[vdev->msix->pba_bar].region.mem,
> +                    vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region),
>                      vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
>                      &err);
>      if (ret < 0) {
> @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>       */
>      memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
>  
> +    if (object_property_get_bool(OBJECT(qdev_get_machine()),
> +                                 "vfio-no-msix-emulation", NULL)) {
> +        memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false);
> +    }
> +

As Alex pointed out elsewhere, the need for the MSI-X emulation isn't
really VFIO specific.  It would make more sense to have the test
directly in msix_init(), and have it skip adding the msix region to
the BAR if the machine has said it's not necessary.

>      return 0;
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alex Williamson 47 weeks ago
On Tue, 19 Dec 2017 15:52:02 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote:
> > This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> > which tells that a region with MSIX data can be mapped entirely, i.e.
> > the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> > 
> > With this change, all BARs are mapped in a single chunk and MSIX vectors
> > are emulated on top unless the machine requests not to by defining and
> > enabling a new "vfio-no-msix-emulation" property. At the moment only
> > sPAPR machine does so - it prohibits MSIX emulation and does not allow
> > enabling it as it does not define the "set" callback for the new property;
> > the new property also does not appear in "-machine pseries,help".
> > 
> > If the new capability is present, this puts MSIX IO memory region under
> > mapped memory region. If the capability is not there, it falls back to
> > the old behaviour with the sparse capability.
> > 
> > In MSIX vectors section is not aligned to the page size, the KVM memory
> > listener does not register it with the KVM as a memory slot and MSIX is
> > emulated by QEMU as before.
> > 
> > This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> > for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > This is mtree and flatview BEFORE this patch:
> > 
> > "info mtree":
> > memory-region: pci@800000020000000.mmio
> >   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > "info mtree -f":
> > FlatView #0
> >  AS "memory", root: system
> >  AS "cpu-memory", root: system
> >  Root memory region: system
> >   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > 
> > 
> > This is AFTER this patch applied:
> > 
> > "info mtree":
> > memory-region: pci@800000020000000.mmio
> >   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > 
> > "info mtree -f":
> > FlatView #2
> >  AS "memory", root: system
> >  AS "cpu-memory", root: system
> >  Root memory region: system
> >   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > 
> > 
> > This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> > to enable emulation:
> > 
> > "info mtree":
> > memory-region: pci@800000020000000.mmio
> >   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > "info mtree -f":
> > FlatView #1
> >  AS "memory", root: system
> >  AS "cpu-memory", root: system
> >  Root memory region: system
> >   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > ---
> >  include/hw/vfio/vfio-common.h |  1 +
> >  linux-headers/linux/vfio.h    |  5 +++++
> >  hw/ppc/spapr.c                |  8 ++++++++
> >  hw/vfio/common.c              | 15 +++++++++++++++
> >  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >  5 files changed, 50 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index f3a2ac9..927d600 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >                           struct vfio_region_info **info);
> >  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >                               uint32_t subtype, struct vfio_region_info **info);
> > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >  #endif
> >  extern const MemoryListener vfio_prereg_listener;
> >  
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 4312e96..b45182e 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >  
> > +/*
> > + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> > + */
> > +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> > +
> >  /**
> >   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >   *				    struct vfio_irq_info)
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 9de63f0..693394a 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >      spapr->use_hotplug_event_source = value;
> >  }
> >  
> > +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> > +{
> > +    return true;
> > +}
> > +
> >  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >  {
> >      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >      object_property_set_description(obj, "vsmt",
> >                                      "Virtual SMT: KVM behaves as if this were"
> >                                      " the host's SMT mode", &error_abort);
> > +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> > +                             spapr_get_msix_emulation, NULL, NULL);  
> 
> I prefer the approach where the property is in the PCI device, set by
> the machine defaults mechanism, rather than the property being in the
> machine and the device having to look it up.
> 
> If Alex prefers the latter though, I'm ok with that way around.

I prefer anything that doesn't result in a user visible option.

> >  }
> >  
> >  static void spapr_machine_finalizefn(Object *obj)
> > @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >  /*
> >   * pseries-2.11
> >   */
> > +
> >  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >  {
> >  }
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 1fb8a8e..da5f182 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >      return -ENODEV;
> >  }
> >  
> > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> > +{
> > +    struct vfio_region_info *info = NULL;
> > +    bool ret = false;
> > +
> > +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> > +        if (vfio_get_region_info_cap(info, cap_type)) {
> > +            ret = true;
> > +        }
> > +        g_free(info);
> > +    }
> > +
> > +    return ret;
> > +}
> > +
> >  /*
> >   * Interfaces for IBM EEH (Enhanced Error Handling)
> >   */
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c977ee3..27a3706 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >      off_t start, end;
> >      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >  
> > +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> > +                            vdev->msix->table_bar)) {
> > +        return;
> > +    }
> > +
> >      /*
> >       * We expect to find a single mmap covering the whole BAR, anything else
> >       * means it's either unsupported or already setup.
> > @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
> >      vfio_pci_fixup_msix_region(vdev);
> >  }
> >  
> > +static MemoryRegion *vfio_msix_parent(VFIORegion *region)
> > +{
> > +    if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) {
> > +        return &region->mmaps[0].mem;
> > +    }
> > +
> > +    return region->mem;
> > +}
> > +
> >  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >  {
> >      int ret;
> > @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >      vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
> >                                      sizeof(unsigned long));
> >      ret = msix_init(&vdev->pdev, vdev->msix->entries,
> > -                    vdev->bars[vdev->msix->table_bar].region.mem,
> > +                    vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region),
> >                      vdev->msix->table_bar, vdev->msix->table_offset,
> > -                    vdev->bars[vdev->msix->pba_bar].region.mem,
> > +                    vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region),
> >                      vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
> >                      &err);
> >      if (ret < 0) {
> > @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >       */
> >      memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
> >  
> > +    if (object_property_get_bool(OBJECT(qdev_get_machine()),
> > +                                 "vfio-no-msix-emulation", NULL)) {
> > +        memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false);
> > +    }
> > +  
> 
> As Alex pointed out elsewhere, the need for the MSI-X emulation isn't
> really VFIO specific.  It would make more sense to have the test
> directly in msix_init(), and have it skip adding the msix region to
> the BAR if the machine has said it's not necessary.

How much of the MSI-X capability do SPAPR guests make use of?  Asking
msix_init() to manage an MSI-X capability that references MMIO regions
on a BAR that aren't implemented is a bit incongruous.  Is there some
way that the MSI-X capability can make it clear that MSI-X MMIO isn't
implemented?  Offsets past the end of the BAR?  Reserved BIR values?
You're basically asking msix_init() to create either an incomplete or
invalid configuration.  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 47 weeks ago
On 20/12/17 04:26, Alex Williamson wrote:
> On Tue, 19 Dec 2017 15:52:02 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
>> On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote:
>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>
>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>> are emulated on top unless the machine requests not to by defining and
>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>> enabling it as it does not define the "set" callback for the new property;
>>> the new property also does not appear in "-machine pseries,help".
>>>
>>> If the new capability is present, this puts MSIX IO memory region under
>>> mapped memory region. If the capability is not there, it falls back to
>>> the old behaviour with the sparse capability.
>>>
>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>> listener does not register it with the KVM as a memory slot and MSIX is
>>> emulated by QEMU as before.
>>>
>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> This is mtree and flatview BEFORE this patch:
>>>
>>> "info mtree":
>>> memory-region: pci@800000020000000.mmio
>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>> "info mtree -f":
>>> FlatView #0
>>>  AS "memory", root: system
>>>  AS "cpu-memory", root: system
>>>  Root memory region: system
>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>>
>>>
>>> This is AFTER this patch applied:
>>>
>>> "info mtree":
>>> memory-region: pci@800000020000000.mmio
>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>>
>>> "info mtree -f":
>>> FlatView #2
>>>  AS "memory", root: system
>>>  AS "cpu-memory", root: system
>>>  Root memory region: system
>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>>
>>>
>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>> to enable emulation:
>>>
>>> "info mtree":
>>> memory-region: pci@800000020000000.mmio
>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>> "info mtree -f":
>>> FlatView #1
>>>  AS "memory", root: system
>>>  AS "cpu-memory", root: system
>>>  Root memory region: system
>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>> ---
>>>  include/hw/vfio/vfio-common.h |  1 +
>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index f3a2ac9..927d600 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>                           struct vfio_region_info **info);
>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>                               uint32_t subtype, struct vfio_region_info **info);
>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>  #endif
>>>  extern const MemoryListener vfio_prereg_listener;
>>>  
>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>> index 4312e96..b45182e 100644
>>> --- a/linux-headers/linux/vfio.h
>>> +++ b/linux-headers/linux/vfio.h
>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>  
>>> +/*
>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>> + */
>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>> +
>>>  /**
>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>   *				    struct vfio_irq_info)
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 9de63f0..693394a 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>      spapr->use_hotplug_event_source = value;
>>>  }
>>>  
>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>> +{
>>> +    return true;
>>> +}
>>> +
>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>  {
>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>      object_property_set_description(obj, "vsmt",
>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>                                      " the host's SMT mode", &error_abort);
>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>> +                             spapr_get_msix_emulation, NULL, NULL);  
>>
>> I prefer the approach where the property is in the PCI device, set by
>> the machine defaults mechanism, rather than the property being in the
>> machine and the device having to look it up.
>>
>> If Alex prefers the latter though, I'm ok with that way around.
> 
> I prefer anything that doesn't result in a user visible option.


The only way to have this is an interface implemented by a machine. I can
do that, call it "TYPE_MSIX_MMIO" or "TYPE_CONFIG_VFIO".


>>>  }
>>>  
>>>  static void spapr_machine_finalizefn(Object *obj)
>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>  /*
>>>   * pseries-2.11
>>>   */
>>> +
>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>  {
>>>  }
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 1fb8a8e..da5f182 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>      return -ENODEV;
>>>  }
>>>  
>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>> +{
>>> +    struct vfio_region_info *info = NULL;
>>> +    bool ret = false;
>>> +
>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>> +            ret = true;
>>> +        }
>>> +        g_free(info);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>  /*
>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>   */
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index c977ee3..27a3706 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>      off_t start, end;
>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>  
>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>> +                            vdev->msix->table_bar)) {
>>> +        return;
>>> +    }
>>> +
>>>      /*
>>>       * We expect to find a single mmap covering the whole BAR, anything else
>>>       * means it's either unsupported or already setup.
>>> @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>>>      vfio_pci_fixup_msix_region(vdev);
>>>  }
>>>  
>>> +static MemoryRegion *vfio_msix_parent(VFIORegion *region)
>>> +{
>>> +    if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) {
>>> +        return &region->mmaps[0].mem;
>>> +    }
>>> +
>>> +    return region->mem;
>>> +}
>>> +
>>>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>>>  {
>>>      int ret;
>>> @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>>>      vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
>>>                                      sizeof(unsigned long));
>>>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
>>> -                    vdev->bars[vdev->msix->table_bar].region.mem,
>>> +                    vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region),
>>>                      vdev->msix->table_bar, vdev->msix->table_offset,
>>> -                    vdev->bars[vdev->msix->pba_bar].region.mem,
>>> +                    vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region),
>>>                      vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
>>>                      &err);
>>>      if (ret < 0) {
>>> @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>>>       */
>>>      memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
>>>  
>>> +    if (object_property_get_bool(OBJECT(qdev_get_machine()),
>>> +                                 "vfio-no-msix-emulation", NULL)) {
>>> +        memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false);
>>> +    }
>>> +  
>>
>> As Alex pointed out elsewhere, the need for the MSI-X emulation isn't
>> really VFIO specific.  It would make more sense to have the test
>> directly in msix_init(), and have it skip adding the msix region to
>> the BAR if the machine has said it's not necessary.
> 
> How much of the MSI-X capability do SPAPR guests make use of? 

Presence of the capability seems to be just enough.

I disabled the sanity test in msix_init(), hardcoded:
vdev->msix->table_offset = 0xff0000;
vdev->msix->pba_offset = 0xfff000;
in vfio_msix_setup() and got it up and running:

00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
PCI-Express Fusion-MPT SAS-3 (rev 02)

Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at 2000c0000000 [disabled] [size=1M]

Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=1 offset=00ff0000
        PBA: BAR=1 offset=00fff000


> Asking
> msix_init() to manage an MSI-X capability that references MMIO regions
> on a BAR that aren't implemented is a bit incongruous.  Is there some
> way that the MSI-X capability can make it clear that MSI-X MMIO isn't
> implemented?  Offsets past the end of the BAR? 

This works.

> Reserved BIR values?

This does not - the only reserved value is 7 and it crashes as the device
does not implement BAR7 (ROM).


> You're basically asking msix_init() to create either an incomplete or
> invalid configuration.  Thanks,

I still like "vfio-no-msix-emulation" better.


-- 
Alexey

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by David Gibson 47 weeks ago
On Wed, Dec 20, 2017 at 12:13:04PM +1100, Alexey Kardashevskiy wrote:
> On 20/12/17 04:26, Alex Williamson wrote:
> > On Tue, 19 Dec 2017 15:52:02 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> >> On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote:
> >>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> >>> which tells that a region with MSIX data can be mapped entirely, i.e.
> >>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> >>>
> >>> With this change, all BARs are mapped in a single chunk and MSIX vectors
> >>> are emulated on top unless the machine requests not to by defining and
> >>> enabling a new "vfio-no-msix-emulation" property. At the moment only
> >>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> >>> enabling it as it does not define the "set" callback for the new property;
> >>> the new property also does not appear in "-machine pseries,help".
> >>>
> >>> If the new capability is present, this puts MSIX IO memory region under
> >>> mapped memory region. If the capability is not there, it falls back to
> >>> the old behaviour with the sparse capability.
> >>>
> >>> In MSIX vectors section is not aligned to the page size, the KVM memory
> >>> listener does not register it with the KVM as a memory slot and MSIX is
> >>> emulated by QEMU as before.
> >>>
> >>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> >>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>> ---
> >>>
> >>> This is mtree and flatview BEFORE this patch:
> >>>
> >>> "info mtree":
> >>> memory-region: pci@800000020000000.mmio
> >>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>> "info mtree -f":
> >>> FlatView #0
> >>>  AS "memory", root: system
> >>>  AS "cpu-memory", root: system
> >>>  Root memory region: system
> >>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>>
> >>>
> >>> This is AFTER this patch applied:
> >>>
> >>> "info mtree":
> >>> memory-region: pci@800000020000000.mmio
> >>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>>
> >>> "info mtree -f":
> >>> FlatView #2
> >>>  AS "memory", root: system
> >>>  AS "cpu-memory", root: system
> >>>  Root memory region: system
> >>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>>
> >>>
> >>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> >>> to enable emulation:
> >>>
> >>> "info mtree":
> >>> memory-region: pci@800000020000000.mmio
> >>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>> "info mtree -f":
> >>> FlatView #1
> >>>  AS "memory", root: system
> >>>  AS "cpu-memory", root: system
> >>>  Root memory region: system
> >>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>> ---
> >>>  include/hw/vfio/vfio-common.h |  1 +
> >>>  linux-headers/linux/vfio.h    |  5 +++++
> >>>  hw/ppc/spapr.c                |  8 ++++++++
> >>>  hw/vfio/common.c              | 15 +++++++++++++++
> >>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >>>  5 files changed, 50 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>> index f3a2ac9..927d600 100644
> >>> --- a/include/hw/vfio/vfio-common.h
> >>> +++ b/include/hw/vfio/vfio-common.h
> >>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>>                           struct vfio_region_info **info);
> >>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>                               uint32_t subtype, struct vfio_region_info **info);
> >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >>>  #endif
> >>>  extern const MemoryListener vfio_prereg_listener;
> >>>  
> >>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>> index 4312e96..b45182e 100644
> >>> --- a/linux-headers/linux/vfio.h
> >>> +++ b/linux-headers/linux/vfio.h
> >>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>>  
> >>> +/*
> >>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> >>> + */
> >>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> >>> +
> >>>  /**
> >>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >>>   *				    struct vfio_irq_info)
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 9de63f0..693394a 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >>>      spapr->use_hotplug_event_source = value;
> >>>  }
> >>>  
> >>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> >>> +{
> >>> +    return true;
> >>> +}
> >>> +
> >>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >>>  {
> >>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> >>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >>>      object_property_set_description(obj, "vsmt",
> >>>                                      "Virtual SMT: KVM behaves as if this were"
> >>>                                      " the host's SMT mode", &error_abort);
> >>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>> +                             spapr_get_msix_emulation, NULL, NULL);  
> >>
> >> I prefer the approach where the property is in the PCI device, set by
> >> the machine defaults mechanism, rather than the property being in the
> >> machine and the device having to look it up.
> >>
> >> If Alex prefers the latter though, I'm ok with that way around.
> > 
> > I prefer anything that doesn't result in a user visible option.
> 
> 
> The only way to have this is an interface implemented by a machine. I can
> do that, call it "TYPE_MSIX_MMIO" or "TYPE_CONFIG_VFIO".

I new interface seems overkill.  You could just add a
"needs_msix_emulation" method to MachineClass.  It will return true by
default, but pseries will override it. (or invert the sense, I don't care)

> >>>  }
> >>>  
> >>>  static void spapr_machine_finalizefn(Object *obj)
> >>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >>>  /*
> >>>   * pseries-2.11
> >>>   */
> >>> +
> >>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >>>  {
> >>>  }
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>> index 1fb8a8e..da5f182 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>      return -ENODEV;
> >>>  }
> >>>  
> >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> >>> +{
> >>> +    struct vfio_region_info *info = NULL;
> >>> +    bool ret = false;
> >>> +
> >>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> >>> +        if (vfio_get_region_info_cap(info, cap_type)) {
> >>> +            ret = true;
> >>> +        }
> >>> +        g_free(info);
> >>> +    }
> >>> +
> >>> +    return ret;
> >>> +}
> >>> +
> >>>  /*
> >>>   * Interfaces for IBM EEH (Enhanced Error Handling)
> >>>   */
> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>> index c977ee3..27a3706 100644
> >>> --- a/hw/vfio/pci.c
> >>> +++ b/hw/vfio/pci.c
> >>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>      off_t start, end;
> >>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >>>  
> >>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> >>> +                            vdev->msix->table_bar)) {
> >>> +        return;
> >>> +    }
> >>> +
> >>>      /*
> >>>       * We expect to find a single mmap covering the whole BAR, anything else
> >>>       * means it's either unsupported or already setup.
> >>> @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
> >>>      vfio_pci_fixup_msix_region(vdev);
> >>>  }
> >>>  
> >>> +static MemoryRegion *vfio_msix_parent(VFIORegion *region)
> >>> +{
> >>> +    if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) {
> >>> +        return &region->mmaps[0].mem;
> >>> +    }
> >>> +
> >>> +    return region->mem;
> >>> +}
> >>> +
> >>>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >>>  {
> >>>      int ret;
> >>> @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >>>      vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
> >>>                                      sizeof(unsigned long));
> >>>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
> >>> -                    vdev->bars[vdev->msix->table_bar].region.mem,
> >>> +                    vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region),
> >>>                      vdev->msix->table_bar, vdev->msix->table_offset,
> >>> -                    vdev->bars[vdev->msix->pba_bar].region.mem,
> >>> +                    vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region),
> >>>                      vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
> >>>                      &err);
> >>>      if (ret < 0) {
> >>> @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> >>>       */
> >>>      memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
> >>>  
> >>> +    if (object_property_get_bool(OBJECT(qdev_get_machine()),
> >>> +                                 "vfio-no-msix-emulation", NULL)) {
> >>> +        memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false);
> >>> +    }
> >>> +  
> >>
> >> As Alex pointed out elsewhere, the need for the MSI-X emulation isn't
> >> really VFIO specific.  It would make more sense to have the test
> >> directly in msix_init(), and have it skip adding the msix region to
> >> the BAR if the machine has said it's not necessary.
> > 
> > How much of the MSI-X capability do SPAPR guests make use of? 
> 
> Presence of the capability seems to be just enough.
> 
> I disabled the sanity test in msix_init(), hardcoded:
> vdev->msix->table_offset = 0xff0000;
> vdev->msix->pba_offset = 0xfff000;
> in vfio_msix_setup() and got it up and running:
> 
> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
> PCI-Express Fusion-MPT SAS-3 (rev 02)
> 
> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> Expansion ROM at 2000c0000000 [disabled] [size=1M]
> 
> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>         Vector table: BAR=1 offset=00ff0000
>         PBA: BAR=1 offset=00fff000
> 
> 
> > Asking
> > msix_init() to manage an MSI-X capability that references MMIO regions
> > on a BAR that aren't implemented is a bit incongruous.  Is there some
> > way that the MSI-X capability can make it clear that MSI-X MMIO isn't
> > implemented?  Offsets past the end of the BAR? 
> 
> This works.

Sounds ok.

> > Reserved BIR values?
> 
> This does not - the only reserved value is 7 and it crashes as the device
> does not implement BAR7 (ROM).
> 
> 
> > You're basically asking msix_init() to create either an incomplete or
> > invalid configuration.  Thanks,

I guess.  I mean this is a platform specific performance hack, I feel
as though hackiness is to be expected.

> I still like "vfio-no-msix-emulation" better.
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Auger Eric 45 weeks ago
Hi Alexey,

On 15/12/17 07:29, Alexey Kardashevskiy wrote:
> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> which tells that a region with MSIX data can be mapped entirely, i.e.
> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> 
> With this change, all BARs are mapped in a single chunk and MSIX vectors
> are emulated on top unless the machine requests not to by defining and
> enabling a new "vfio-no-msix-emulation" property. At the moment only
> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> enabling it as it does not define the "set" callback for the new property;
> the new property also does not appear in "-machine pseries,help".
> 
> If the new capability is present, this puts MSIX IO memory region under
> mapped memory region. If the capability is not there, it falls back to
> the old behaviour with the sparse capability.
> 
> In MSIX vectors section is not aligned to the page size, the KVM memory
> listener does not register it with the KVM as a memory slot and MSIX is
> emulated by QEMU as before.
> 
> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is mtree and flatview BEFORE this patch:
> 
> "info mtree":
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> "info mtree -f":
> FlatView #0
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> 
> This is AFTER this patch applied:
> 
> "info mtree":
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> "info mtree -f":
> FlatView #2
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> 
> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> to enable emulation:
> 
> "info mtree":
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> "info mtree -f":
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> ---
>  include/hw/vfio/vfio-common.h |  1 +
>  linux-headers/linux/vfio.h    |  5 +++++
>  hw/ppc/spapr.c                |  8 ++++++++
>  hw/vfio/common.c              | 15 +++++++++++++++
>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>  5 files changed, 50 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index f3a2ac9..927d600 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>                           struct vfio_region_info **info);
>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>                               uint32_t subtype, struct vfio_region_info **info);
> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
>  
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 4312e96..b45182e 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>  
> +/*
> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> + */
> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> +
>  /**
>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>   *				    struct vfio_irq_info)
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 9de63f0..693394a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>      spapr->use_hotplug_event_source = value;
>  }
>  
> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> +{
> +    return true;
> +}
> +
>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>      object_property_set_description(obj, "vsmt",
>                                      "Virtual SMT: KVM behaves as if this were"
>                                      " the host's SMT mode", &error_abort);
> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> +                             spapr_get_msix_emulation, NULL, NULL);
>  }
>  
>  static void spapr_machine_finalizefn(Object *obj)
> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>  /*
>   * pseries-2.11
>   */
> +
>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>  {
>  }
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 1fb8a8e..da5f182 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>      return -ENODEV;
>  }
>  
> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> +{
> +    struct vfio_region_info *info = NULL;
> +    bool ret = false;
> +
> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> +        if (vfio_get_region_info_cap(info, cap_type)) {
> +            ret = true;
> +        }
> +        g_free(info);
> +    }
> +
> +    return ret;
> +}
> +
>  /*
>   * Interfaces for IBM EEH (Enhanced Error Handling)
>   */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c977ee3..27a3706 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>      off_t start, end;
>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>  
> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> +                            vdev->msix->table_bar)) {
> +        return;
> +    }

I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
msix table at 0x2000, using 64kB pages.

0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
mmaps[0]
  0000000000002000-00000000000023ff (prio 0, i/o): msix-table
  0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]


My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
vfio_listener_region_add_ram gets called on the first 8kB which cannot
be mapped with 64kB page, leading to a qemu abort.

29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
0x8000000000 - 0x8000001fff [0xffffa8820000]
qemu-system-aarch64: VFIO_MAP_DMA: -22
qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
0xffffa8820000) = -22 (Invalid argument)
qemu: hardware error: vfio: DMA mapping failed, unable to continue

In my case don't we still need the fixup logic?

Thanks

Eric

> +
>      /*
>       * We expect to find a single mmap covering the whole BAR, anything else
>       * means it's either unsupported or already setup.
> @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>      vfio_pci_fixup_msix_region(vdev);
>  }
>  
> +static MemoryRegion *vfio_msix_parent(VFIORegion *region)
> +{
> +    if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) {
> +        return &region->mmaps[0].mem;
> +    }
> +
> +    return region->mem;
> +}
> +
>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>  {
>      int ret;
> @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>      vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
>                                      sizeof(unsigned long));
>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
> -                    vdev->bars[vdev->msix->table_bar].region.mem,
> +                    vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region),
>                      vdev->msix->table_bar, vdev->msix->table_offset,
> -                    vdev->bars[vdev->msix->pba_bar].region.mem,
> +                    vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region),
>                      vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
>                      &err);
>      if (ret < 0) {
> @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>       */
>      memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false);
>  
> +    if (object_property_get_bool(OBJECT(qdev_get_machine()),
> +                                 "vfio-no-msix-emulation", NULL)) {
> +        memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false);
> +    }
> +
>      return 0;
>  }
>  
> 

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alex Williamson 45 weeks ago
On Fri, 5 Jan 2018 10:48:07 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alexey,
> 
> On 15/12/17 07:29, Alexey Kardashevskiy wrote:
> > This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> > which tells that a region with MSIX data can be mapped entirely, i.e.
> > the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> > 
> > With this change, all BARs are mapped in a single chunk and MSIX vectors
> > are emulated on top unless the machine requests not to by defining and
> > enabling a new "vfio-no-msix-emulation" property. At the moment only
> > sPAPR machine does so - it prohibits MSIX emulation and does not allow
> > enabling it as it does not define the "set" callback for the new property;
> > the new property also does not appear in "-machine pseries,help".
> > 
> > If the new capability is present, this puts MSIX IO memory region under
> > mapped memory region. If the capability is not there, it falls back to
> > the old behaviour with the sparse capability.
> > 
> > In MSIX vectors section is not aligned to the page size, the KVM memory
> > listener does not register it with the KVM as a memory slot and MSIX is
> > emulated by QEMU as before.
> > 
> > This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> > for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > 
> > This is mtree and flatview BEFORE this patch:
> > 
> > "info mtree":
> > memory-region: pci@800000020000000.mmio
> >   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > "info mtree -f":
> > FlatView #0
> >  AS "memory", root: system
> >  AS "cpu-memory", root: system
> >  Root memory region: system
> >   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > 
> > 
> > This is AFTER this patch applied:
> > 
> > "info mtree":
> > memory-region: pci@800000020000000.mmio
> >   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > 
> > "info mtree -f":
> > FlatView #2
> >  AS "memory", root: system
> >  AS "cpu-memory", root: system
> >  Root memory region: system
> >   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > 
> > 
> > This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> > to enable emulation:
> > 
> > "info mtree":
> > memory-region: pci@800000020000000.mmio
> >   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > 
> > "info mtree -f":
> > FlatView #1
> >  AS "memory", root: system
> >  AS "cpu-memory", root: system
> >  Root memory region: system
> >   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> > ---
> >  include/hw/vfio/vfio-common.h |  1 +
> >  linux-headers/linux/vfio.h    |  5 +++++
> >  hw/ppc/spapr.c                |  8 ++++++++
> >  hw/vfio/common.c              | 15 +++++++++++++++
> >  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >  5 files changed, 50 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > index f3a2ac9..927d600 100644
> > --- a/include/hw/vfio/vfio-common.h
> > +++ b/include/hw/vfio/vfio-common.h
> > @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >                           struct vfio_region_info **info);
> >  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >                               uint32_t subtype, struct vfio_region_info **info);
> > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >  #endif
> >  extern const MemoryListener vfio_prereg_listener;
> >  
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 4312e96..b45182e 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >  
> > +/*
> > + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> > + */
> > +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> > +
> >  /**
> >   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >   *				    struct vfio_irq_info)
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 9de63f0..693394a 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >      spapr->use_hotplug_event_source = value;
> >  }
> >  
> > +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> > +{
> > +    return true;
> > +}
> > +
> >  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >  {
> >      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >      object_property_set_description(obj, "vsmt",
> >                                      "Virtual SMT: KVM behaves as if this were"
> >                                      " the host's SMT mode", &error_abort);
> > +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> > +                             spapr_get_msix_emulation, NULL, NULL);
> >  }
> >  
> >  static void spapr_machine_finalizefn(Object *obj)
> > @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >  /*
> >   * pseries-2.11
> >   */
> > +
> >  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >  {
> >  }
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 1fb8a8e..da5f182 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >      return -ENODEV;
> >  }
> >  
> > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> > +{
> > +    struct vfio_region_info *info = NULL;
> > +    bool ret = false;
> > +
> > +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> > +        if (vfio_get_region_info_cap(info, cap_type)) {
> > +            ret = true;
> > +        }
> > +        g_free(info);
> > +    }
> > +
> > +    return ret;
> > +}
> > +
> >  /*
> >   * Interfaces for IBM EEH (Enhanced Error Handling)
> >   */
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c977ee3..27a3706 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >      off_t start, end;
> >      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >  
> > +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> > +                            vdev->msix->table_bar)) {
> > +        return;
> > +    }  
> 
> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
> msix table at 0x2000, using 64kB pages.
> 
> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
> mmaps[0]
>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
> 
> 
> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
> vfio_listener_region_add_ram gets called on the first 8kB which cannot
> be mapped with 64kB page, leading to a qemu abort.
> 
> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
> 0x8000000000 - 0x8000001fff [0xffffa8820000]
> qemu-system-aarch64: VFIO_MAP_DMA: -22
> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
> 0xffffa8820000) = -22 (Invalid argument)
> qemu: hardware error: vfio: DMA mapping failed, unable to continue
> 
> In my case don't we still need the fixup logic?

Ah, I didn't realize you had Alexey's QEMU patch when you asked about
this.  I'd say no, the fixup shouldn't be needed, but apparently we
can't remove it until we figure out this bug and who should be filtering
out sub-page size listener events.  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 43 weeks ago
On 06/01/18 02:29, Alex Williamson wrote:
> On Fri, 5 Jan 2018 10:48:07 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alexey,
>>
>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:
>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>
>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>> are emulated on top unless the machine requests not to by defining and
>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>> enabling it as it does not define the "set" callback for the new property;
>>> the new property also does not appear in "-machine pseries,help".
>>>
>>> If the new capability is present, this puts MSIX IO memory region under
>>> mapped memory region. If the capability is not there, it falls back to
>>> the old behaviour with the sparse capability.
>>>
>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>> listener does not register it with the KVM as a memory slot and MSIX is
>>> emulated by QEMU as before.
>>>
>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> This is mtree and flatview BEFORE this patch:
>>>
>>> "info mtree":
>>> memory-region: pci@800000020000000.mmio
>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>> "info mtree -f":
>>> FlatView #0
>>>  AS "memory", root: system
>>>  AS "cpu-memory", root: system
>>>  Root memory region: system
>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>>
>>>
>>> This is AFTER this patch applied:
>>>
>>> "info mtree":
>>> memory-region: pci@800000020000000.mmio
>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>>
>>> "info mtree -f":
>>> FlatView #2
>>>  AS "memory", root: system
>>>  AS "cpu-memory", root: system
>>>  Root memory region: system
>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>>
>>>
>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>> to enable emulation:
>>>
>>> "info mtree":
>>> memory-region: pci@800000020000000.mmio
>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>
>>> "info mtree -f":
>>> FlatView #1
>>>  AS "memory", root: system
>>>  AS "cpu-memory", root: system
>>>  Root memory region: system
>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>> ---
>>>  include/hw/vfio/vfio-common.h |  1 +
>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index f3a2ac9..927d600 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>                           struct vfio_region_info **info);
>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>                               uint32_t subtype, struct vfio_region_info **info);
>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>  #endif
>>>  extern const MemoryListener vfio_prereg_listener;
>>>  
>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>> index 4312e96..b45182e 100644
>>> --- a/linux-headers/linux/vfio.h
>>> +++ b/linux-headers/linux/vfio.h
>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>  
>>> +/*
>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>> + */
>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>> +
>>>  /**
>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>   *				    struct vfio_irq_info)
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 9de63f0..693394a 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>      spapr->use_hotplug_event_source = value;
>>>  }
>>>  
>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>> +{
>>> +    return true;
>>> +}
>>> +
>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>  {
>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>      object_property_set_description(obj, "vsmt",
>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>                                      " the host's SMT mode", &error_abort);
>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>  }
>>>  
>>>  static void spapr_machine_finalizefn(Object *obj)
>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>  /*
>>>   * pseries-2.11
>>>   */
>>> +
>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>  {
>>>  }
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 1fb8a8e..da5f182 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>      return -ENODEV;
>>>  }
>>>  
>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>> +{
>>> +    struct vfio_region_info *info = NULL;
>>> +    bool ret = false;
>>> +
>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>> +            ret = true;
>>> +        }
>>> +        g_free(info);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>  /*
>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>   */
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index c977ee3..27a3706 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>      off_t start, end;
>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>  
>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>> +                            vdev->msix->table_bar)) {
>>> +        return;
>>> +    }  
>>
>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>> msix table at 0x2000, using 64kB pages.
>>
>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>> mmaps[0]
>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>
>>
>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>> be mapped with 64kB page, leading to a qemu abort.
>>
>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>> 0xffffa8820000) = -22 (Invalid argument)
>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>
>> In my case don't we still need the fixup logic?
> 
> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
> this.  I'd say no, the fixup shouldn't be needed, but apparently we
> can't remove it until we figure out this bug and who should be filtering
> out sub-page size listener events.  Thanks,


This should do it:

@@ -303,11 +303,19 @@ static bool
vfio_listener_skipped_section(MemoryRegionSection *section)
             * Sizing an enabled 64-bit BAR can cause spurious mappings to
             * addresses in the upper part of the 64-bit address space.  These
             * are never accessed by the CPU and beyond the address width of
             * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
             */
-           section->offset_within_address_space & (1ULL << 63);
+           section->offset_within_address_space & (1ULL << 63) ||
+           /*
+            * Non-system-page aligned RAM regions have no chance to be
+            * DMA mapped so skip them too.
+            */
+           (memory_region_is_ram(section->mr) &&
+            ((section->offset_within_region & qemu_real_host_page_mask) ||
+             (int128_getlo(section->size) & qemu_real_host_page_mask))
+           );
 }


Do we VFIO DMA map on every RAM MR just because it is not necessary to have
more complicated logic to detect whether it is MMIO? Or there is something
bigger?

As for now, it is possible to have non-aligned RAM MR, the KVM's listener
just does not register it with the KVM if unaligned.



-- 
Alexey

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Auger Eric 43 weeks ago
Hi Alexey,

On 16/01/18 06:17, Alexey Kardashevskiy wrote:
> On 06/01/18 02:29, Alex Williamson wrote:
>> On Fri, 5 Jan 2018 10:48:07 +0100
>> Auger Eric <eric.auger@redhat.com> wrote:
>>
>>> Hi Alexey,
>>>
>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:
>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>
>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>> are emulated on top unless the machine requests not to by defining and
>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>> enabling it as it does not define the "set" callback for the new property;
>>>> the new property also does not appear in "-machine pseries,help".
>>>>
>>>> If the new capability is present, this puts MSIX IO memory region under
>>>> mapped memory region. If the capability is not there, it falls back to
>>>> the old behaviour with the sparse capability.
>>>>
>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>> emulated by QEMU as before.
>>>>
>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>
>>>> This is mtree and flatview BEFORE this patch:
>>>>
>>>> "info mtree":
>>>> memory-region: pci@800000020000000.mmio
>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>> "info mtree -f":
>>>> FlatView #0
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>>
>>>>
>>>> This is AFTER this patch applied:
>>>>
>>>> "info mtree":
>>>> memory-region: pci@800000020000000.mmio
>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>>
>>>> "info mtree -f":
>>>> FlatView #2
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>>
>>>>
>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>> to enable emulation:
>>>>
>>>> "info mtree":
>>>> memory-region: pci@800000020000000.mmio
>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>> "info mtree -f":
>>>> FlatView #1
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>> ---
>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index f3a2ac9..927d600 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>                           struct vfio_region_info **info);
>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>  #endif
>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>  
>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>> index 4312e96..b45182e 100644
>>>> --- a/linux-headers/linux/vfio.h
>>>> +++ b/linux-headers/linux/vfio.h
>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>  
>>>> +/*
>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>> + */
>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>> +
>>>>  /**
>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>   *				    struct vfio_irq_info)
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index 9de63f0..693394a 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>      spapr->use_hotplug_event_source = value;
>>>>  }
>>>>  
>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>> +{
>>>> +    return true;
>>>> +}
>>>> +
>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>  {
>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>      object_property_set_description(obj, "vsmt",
>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>                                      " the host's SMT mode", &error_abort);
>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>  }
>>>>  
>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>  /*
>>>>   * pseries-2.11
>>>>   */
>>>> +
>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>  {
>>>>  }
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 1fb8a8e..da5f182 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>      return -ENODEV;
>>>>  }
>>>>  
>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>> +{
>>>> +    struct vfio_region_info *info = NULL;
>>>> +    bool ret = false;
>>>> +
>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>> +            ret = true;
>>>> +        }
>>>> +        g_free(info);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>   */
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index c977ee3..27a3706 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>      off_t start, end;
>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>  
>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>> +                            vdev->msix->table_bar)) {
>>>> +        return;
>>>> +    }  
>>>
>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>> msix table at 0x2000, using 64kB pages.
>>>
>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>> mmaps[0]
>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>
>>>
>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>> be mapped with 64kB page, leading to a qemu abort.
>>>
>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>> 0xffffa8820000) = -22 (Invalid argument)
>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>
>>> In my case don't we still need the fixup logic?
>>
>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>> can't remove it until we figure out this bug and who should be filtering
>> out sub-page size listener events.  Thanks,
> 
> 
> This should do it:
> 
> @@ -303,11 +303,19 @@ static bool
> vfio_listener_skipped_section(MemoryRegionSection *section)
>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>              * addresses in the upper part of the 64-bit address space.  These
>              * are never accessed by the CPU and beyond the address width of
>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>              */
> -           section->offset_within_address_space & (1ULL << 63);
> +           section->offset_within_address_space & (1ULL << 63) ||
> +           /*
> +            * Non-system-page aligned RAM regions have no chance to be
> +            * DMA mapped so skip them too.
> +            */
> +           (memory_region_is_ram(section->mr) &&
> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
> +           );
>  }
This should indeed avoid the vfio_dma_map error. I can't test it at the
moment since I don't have access to this HW anymore but as soon as I
recover it, I will let you know.

Thanks

Eric
> 
> 
> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
> more complicated logic to detect whether it is MMIO? Or there is something
> bigger?
> 
> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
> just does not register it with the KVM if unaligned.
> 
> 
> 

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alex Williamson 43 weeks ago
On Tue, 16 Jan 2018 16:17:58 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 06/01/18 02:29, Alex Williamson wrote:
> > On Fri, 5 Jan 2018 10:48:07 +0100
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Alexey,
> >>
> >> On 15/12/17 07:29, Alexey Kardashevskiy wrote:  
> >>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> >>> which tells that a region with MSIX data can be mapped entirely, i.e.
> >>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> >>>
> >>> With this change, all BARs are mapped in a single chunk and MSIX vectors
> >>> are emulated on top unless the machine requests not to by defining and
> >>> enabling a new "vfio-no-msix-emulation" property. At the moment only
> >>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> >>> enabling it as it does not define the "set" callback for the new property;
> >>> the new property also does not appear in "-machine pseries,help".
> >>>
> >>> If the new capability is present, this puts MSIX IO memory region under
> >>> mapped memory region. If the capability is not there, it falls back to
> >>> the old behaviour with the sparse capability.
> >>>
> >>> In MSIX vectors section is not aligned to the page size, the KVM memory
> >>> listener does not register it with the KVM as a memory slot and MSIX is
> >>> emulated by QEMU as before.
> >>>
> >>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> >>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> >>>
> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>> ---
> >>>
> >>> This is mtree and flatview BEFORE this patch:
> >>>
> >>> "info mtree":
> >>> memory-region: pci@800000020000000.mmio
> >>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>> "info mtree -f":
> >>> FlatView #0
> >>>  AS "memory", root: system
> >>>  AS "cpu-memory", root: system
> >>>  Root memory region: system
> >>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>>
> >>>
> >>> This is AFTER this patch applied:
> >>>
> >>> "info mtree":
> >>> memory-region: pci@800000020000000.mmio
> >>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>>
> >>> "info mtree -f":
> >>> FlatView #2
> >>>  AS "memory", root: system
> >>>  AS "cpu-memory", root: system
> >>>  Root memory region: system
> >>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>>
> >>>
> >>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> >>> to enable emulation:
> >>>
> >>> "info mtree":
> >>> memory-region: pci@800000020000000.mmio
> >>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>
> >>> "info mtree -f":
> >>> FlatView #1
> >>>  AS "memory", root: system
> >>>  AS "cpu-memory", root: system
> >>>  Root memory region: system
> >>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>> ---
> >>>  include/hw/vfio/vfio-common.h |  1 +
> >>>  linux-headers/linux/vfio.h    |  5 +++++
> >>>  hw/ppc/spapr.c                |  8 ++++++++
> >>>  hw/vfio/common.c              | 15 +++++++++++++++
> >>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >>>  5 files changed, 50 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>> index f3a2ac9..927d600 100644
> >>> --- a/include/hw/vfio/vfio-common.h
> >>> +++ b/include/hw/vfio/vfio-common.h
> >>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>>                           struct vfio_region_info **info);
> >>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>                               uint32_t subtype, struct vfio_region_info **info);
> >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >>>  #endif
> >>>  extern const MemoryListener vfio_prereg_listener;
> >>>  
> >>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>> index 4312e96..b45182e 100644
> >>> --- a/linux-headers/linux/vfio.h
> >>> +++ b/linux-headers/linux/vfio.h
> >>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>>  
> >>> +/*
> >>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> >>> + */
> >>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> >>> +
> >>>  /**
> >>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >>>   *				    struct vfio_irq_info)
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 9de63f0..693394a 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >>>      spapr->use_hotplug_event_source = value;
> >>>  }
> >>>  
> >>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> >>> +{
> >>> +    return true;
> >>> +}
> >>> +
> >>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >>>  {
> >>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> >>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >>>      object_property_set_description(obj, "vsmt",
> >>>                                      "Virtual SMT: KVM behaves as if this were"
> >>>                                      " the host's SMT mode", &error_abort);
> >>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>> +                             spapr_get_msix_emulation, NULL, NULL);
> >>>  }
> >>>  
> >>>  static void spapr_machine_finalizefn(Object *obj)
> >>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >>>  /*
> >>>   * pseries-2.11
> >>>   */
> >>> +
> >>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >>>  {
> >>>  }
> >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>> index 1fb8a8e..da5f182 100644
> >>> --- a/hw/vfio/common.c
> >>> +++ b/hw/vfio/common.c
> >>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>      return -ENODEV;
> >>>  }
> >>>  
> >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> >>> +{
> >>> +    struct vfio_region_info *info = NULL;
> >>> +    bool ret = false;
> >>> +
> >>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> >>> +        if (vfio_get_region_info_cap(info, cap_type)) {
> >>> +            ret = true;
> >>> +        }
> >>> +        g_free(info);
> >>> +    }
> >>> +
> >>> +    return ret;
> >>> +}
> >>> +
> >>>  /*
> >>>   * Interfaces for IBM EEH (Enhanced Error Handling)
> >>>   */
> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>> index c977ee3..27a3706 100644
> >>> --- a/hw/vfio/pci.c
> >>> +++ b/hw/vfio/pci.c
> >>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>      off_t start, end;
> >>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >>>  
> >>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> >>> +                            vdev->msix->table_bar)) {
> >>> +        return;
> >>> +    }    
> >>
> >> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
> >> msix table at 0x2000, using 64kB pages.
> >>
> >> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
> >>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
> >> mmaps[0]
> >>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
> >>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
> >>
> >>
> >> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
> >> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
> >> vfio_listener_region_add_ram gets called on the first 8kB which cannot
> >> be mapped with 64kB page, leading to a qemu abort.
> >>
> >> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
> >> 0x8000000000 - 0x8000001fff [0xffffa8820000]
> >> qemu-system-aarch64: VFIO_MAP_DMA: -22
> >> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
> >> 0xffffa8820000) = -22 (Invalid argument)
> >> qemu: hardware error: vfio: DMA mapping failed, unable to continue
> >>
> >> In my case don't we still need the fixup logic?  
> > 
> > Ah, I didn't realize you had Alexey's QEMU patch when you asked about
> > this.  I'd say no, the fixup shouldn't be needed, but apparently we
> > can't remove it until we figure out this bug and who should be filtering
> > out sub-page size listener events.  Thanks,  
> 
> 
> This should do it:
> 
> @@ -303,11 +303,19 @@ static bool
> vfio_listener_skipped_section(MemoryRegionSection *section)
>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>              * addresses in the upper part of the 64-bit address space.  These
>              * are never accessed by the CPU and beyond the address width of
>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>              */
> -           section->offset_within_address_space & (1ULL << 63);
> +           section->offset_within_address_space & (1ULL << 63) ||
> +           /*
> +            * Non-system-page aligned RAM regions have no chance to be
> +            * DMA mapped so skip them too.
> +            */
> +           (memory_region_is_ram(section->mr) &&
> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
> +           );
>  }
> 
> 
> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
> more complicated logic to detect whether it is MMIO? Or there is something
> bigger?

Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
silly address hack above to try to exclude PCI BARs as they're being
resized, and b) there is some value to enabling peer-to-peer mappings
between devices.  Is there a better, more generic solution to a) now?
If we know a MR is MMIO, then I think we can handle any associated
mapping failure as non-fatal.  It's nice to enable p2p, but it's barely
used and not clear that it works reliably (sometimes even on bare
metal).  Nit, we don't yet have the IOVA range of the IOMMU exposed,
thus the silly address hack, but we do have supported page sizes, so we
should probably be referencing that rather than host page size.
 
> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
> just does not register it with the KVM if unaligned.

For a sanity check, we don't support running a guest with a smaller
page size than the host, right?  I assume that would be the only case
where we would have a reasonable chance of the guest using a host
sub-page as a DMA target, but that seems like just one of an infinite
number of problems with such a setup.  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 43 weeks ago
On 19/01/18 08:59, Alex Williamson wrote:
> On Tue, 16 Jan 2018 16:17:58 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 06/01/18 02:29, Alex Williamson wrote:
>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>   
>>>> Hi Alexey,
>>>>
>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:  
>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>
>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>> are emulated on top unless the machine requests not to by defining and
>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>> enabling it as it does not define the "set" callback for the new property;
>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>
>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>> the old behaviour with the sparse capability.
>>>>>
>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>> emulated by QEMU as before.
>>>>>
>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>
>>>>> This is mtree and flatview BEFORE this patch:
>>>>>
>>>>> "info mtree":
>>>>> memory-region: pci@800000020000000.mmio
>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>> "info mtree -f":
>>>>> FlatView #0
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>>
>>>>>
>>>>> This is AFTER this patch applied:
>>>>>
>>>>> "info mtree":
>>>>> memory-region: pci@800000020000000.mmio
>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>>
>>>>> "info mtree -f":
>>>>> FlatView #2
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>>
>>>>>
>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>> to enable emulation:
>>>>>
>>>>> "info mtree":
>>>>> memory-region: pci@800000020000000.mmio
>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>> "info mtree -f":
>>>>> FlatView #1
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>> ---
>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>> index f3a2ac9..927d600 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>>                           struct vfio_region_info **info);
>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>>  #endif
>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>  
>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>> index 4312e96..b45182e 100644
>>>>> --- a/linux-headers/linux/vfio.h
>>>>> +++ b/linux-headers/linux/vfio.h
>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>>  
>>>>> +/*
>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>>> + */
>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>>> +
>>>>>  /**
>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>   *				    struct vfio_irq_info)
>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>> index 9de63f0..693394a 100644
>>>>> --- a/hw/ppc/spapr.c
>>>>> +++ b/hw/ppc/spapr.c
>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>      spapr->use_hotplug_event_source = value;
>>>>>  }
>>>>>  
>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>> +{
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>  {
>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>      object_property_set_description(obj, "vsmt",
>>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>>                                      " the host's SMT mode", &error_abort);
>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>  }
>>>>>  
>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>  /*
>>>>>   * pseries-2.11
>>>>>   */
>>>>> +
>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>  {
>>>>>  }
>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>> index 1fb8a8e..da5f182 100644
>>>>> --- a/hw/vfio/common.c
>>>>> +++ b/hw/vfio/common.c
>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>      return -ENODEV;
>>>>>  }
>>>>>  
>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>>> +{
>>>>> +    struct vfio_region_info *info = NULL;
>>>>> +    bool ret = false;
>>>>> +
>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>> +            ret = true;
>>>>> +        }
>>>>> +        g_free(info);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>   */
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index c977ee3..27a3706 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>      off_t start, end;
>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>  
>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>> +                            vdev->msix->table_bar)) {
>>>>> +        return;
>>>>> +    }    
>>>>
>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>> msix table at 0x2000, using 64kB pages.
>>>>
>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>> mmaps[0]
>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>
>>>>
>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>
>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>
>>>> In my case don't we still need the fixup logic?  
>>>
>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>> can't remove it until we figure out this bug and who should be filtering
>>> out sub-page size listener events.  Thanks,  
>>
>>
>> This should do it:
>>
>> @@ -303,11 +303,19 @@ static bool
>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>              * addresses in the upper part of the 64-bit address space.  These
>>              * are never accessed by the CPU and beyond the address width of
>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>>              */
>> -           section->offset_within_address_space & (1ULL << 63);
>> +           section->offset_within_address_space & (1ULL << 63) ||
>> +           /*
>> +            * Non-system-page aligned RAM regions have no chance to be
>> +            * DMA mapped so skip them too.
>> +            */
>> +           (memory_region_is_ram(section->mr) &&
>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>> +           );
>>  }
>>
>>
>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>> more complicated logic to detect whether it is MMIO? Or there is something
>> bigger?
> 
> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
> silly address hack above to try to exclude PCI BARs as they're being
> resized, and b) there is some value to enabling peer-to-peer mappings
> between devices.  Is there a better, more generic solution to a) now?

"now?" what? :)

> If we know a MR is MMIO, then I think we can handle any associated
> mapping failure as non-fatal. 


mr->ram_device sound like an appropriate flag to check.


> It's nice to enable p2p, but it's barely
> used and not clear that it works reliably (sometimes even on bare
> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed,
> thus the silly address hack, but we do have supported page sizes, so we
> should probably be referencing that rather than host page size.

Ah, correct,
s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/

>  
>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
>> just does not register it with the KVM if unaligned.
> 
> For a sanity check, we don't support running a guest with a smaller
> page size than the host, right?

We do, at least on POWER.

> I assume that would be the only case
> where we would have a reasonable chance of the guest using a host
> sub-page as a DMA target, but that seems like just one of an infinite
> number of problems with such a setup.  Thanks,

Like what? I am missing the point here. The most popular config on POWER
today is 64k system page size in both host and guest and 4K IOMMU page size
for 32bit DMA, seems alright, I also tried 4K guests, no problem there too.



-- 
Alexey

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alex Williamson 43 weeks ago
On Fri, 19 Jan 2018 13:41:41 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 19/01/18 08:59, Alex Williamson wrote:
> > On Tue, 16 Jan 2018 16:17:58 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 06/01/18 02:29, Alex Williamson wrote:  
> >>> On Fri, 5 Jan 2018 10:48:07 +0100
> >>> Auger Eric <eric.auger@redhat.com> wrote:
> >>>     
> >>>> Hi Alexey,
> >>>>
> >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:    
> >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> >>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
> >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> >>>>>
> >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
> >>>>> are emulated on top unless the machine requests not to by defining and
> >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
> >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> >>>>> enabling it as it does not define the "set" callback for the new property;
> >>>>> the new property also does not appear in "-machine pseries,help".
> >>>>>
> >>>>> If the new capability is present, this puts MSIX IO memory region under
> >>>>> mapped memory region. If the capability is not there, it falls back to
> >>>>> the old behaviour with the sparse capability.
> >>>>>
> >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
> >>>>> listener does not register it with the KVM as a memory slot and MSIX is
> >>>>> emulated by QEMU as before.
> >>>>>
> >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> >>>>>
> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>> ---
> >>>>>
> >>>>> This is mtree and flatview BEFORE this patch:
> >>>>>
> >>>>> "info mtree":
> >>>>> memory-region: pci@800000020000000.mmio
> >>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>> "info mtree -f":
> >>>>> FlatView #0
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>>
> >>>>>
> >>>>> This is AFTER this patch applied:
> >>>>>
> >>>>> "info mtree":
> >>>>> memory-region: pci@800000020000000.mmio
> >>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>>
> >>>>> "info mtree -f":
> >>>>> FlatView #2
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>>
> >>>>>
> >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> >>>>> to enable emulation:
> >>>>>
> >>>>> "info mtree":
> >>>>> memory-region: pci@800000020000000.mmio
> >>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>> "info mtree -f":
> >>>>> FlatView #1
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>> ---
> >>>>>  include/hw/vfio/vfio-common.h |  1 +
> >>>>>  linux-headers/linux/vfio.h    |  5 +++++
> >>>>>  hw/ppc/spapr.c                |  8 ++++++++
> >>>>>  hw/vfio/common.c              | 15 +++++++++++++++
> >>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>>>> index f3a2ac9..927d600 100644
> >>>>> --- a/include/hw/vfio/vfio-common.h
> >>>>> +++ b/include/hw/vfio/vfio-common.h
> >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>>>>                           struct vfio_region_info **info);
> >>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>>>                               uint32_t subtype, struct vfio_region_info **info);
> >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >>>>>  #endif
> >>>>>  extern const MemoryListener vfio_prereg_listener;
> >>>>>  
> >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>> index 4312e96..b45182e 100644
> >>>>> --- a/linux-headers/linux/vfio.h
> >>>>> +++ b/linux-headers/linux/vfio.h
> >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>>>>  
> >>>>> +/*
> >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> >>>>> + */
> >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> >>>>> +
> >>>>>  /**
> >>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >>>>>   *				    struct vfio_irq_info)
> >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>> index 9de63f0..693394a 100644
> >>>>> --- a/hw/ppc/spapr.c
> >>>>> +++ b/hw/ppc/spapr.c
> >>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >>>>>      spapr->use_hotplug_event_source = value;
> >>>>>  }
> >>>>>  
> >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> >>>>> +{
> >>>>> +    return true;
> >>>>> +}
> >>>>> +
> >>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >>>>>  {
> >>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >>>>>      object_property_set_description(obj, "vsmt",
> >>>>>                                      "Virtual SMT: KVM behaves as if this were"
> >>>>>                                      " the host's SMT mode", &error_abort);
> >>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>>>> +                             spapr_get_msix_emulation, NULL, NULL);
> >>>>>  }
> >>>>>  
> >>>>>  static void spapr_machine_finalizefn(Object *obj)
> >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >>>>>  /*
> >>>>>   * pseries-2.11
> >>>>>   */
> >>>>> +
> >>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >>>>>  {
> >>>>>  }
> >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>> index 1fb8a8e..da5f182 100644
> >>>>> --- a/hw/vfio/common.c
> >>>>> +++ b/hw/vfio/common.c
> >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>>>      return -ENODEV;
> >>>>>  }
> >>>>>  
> >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> >>>>> +{
> >>>>> +    struct vfio_region_info *info = NULL;
> >>>>> +    bool ret = false;
> >>>>> +
> >>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> >>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
> >>>>> +            ret = true;
> >>>>> +        }
> >>>>> +        g_free(info);
> >>>>> +    }
> >>>>> +
> >>>>> +    return ret;
> >>>>> +}
> >>>>> +
> >>>>>  /*
> >>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
> >>>>>   */
> >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>>> index c977ee3..27a3706 100644
> >>>>> --- a/hw/vfio/pci.c
> >>>>> +++ b/hw/vfio/pci.c
> >>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>>>      off_t start, end;
> >>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >>>>>  
> >>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> >>>>> +                            vdev->msix->table_bar)) {
> >>>>> +        return;
> >>>>> +    }      
> >>>>
> >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
> >>>> msix table at 0x2000, using 64kB pages.
> >>>>
> >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
> >>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
> >>>> mmaps[0]
> >>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
> >>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
> >>>>
> >>>>
> >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
> >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
> >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
> >>>> be mapped with 64kB page, leading to a qemu abort.
> >>>>
> >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
> >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
> >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
> >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
> >>>> 0xffffa8820000) = -22 (Invalid argument)
> >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
> >>>>
> >>>> In my case don't we still need the fixup logic?    
> >>>
> >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
> >>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
> >>> can't remove it until we figure out this bug and who should be filtering
> >>> out sub-page size listener events.  Thanks,    
> >>
> >>
> >> This should do it:
> >>
> >> @@ -303,11 +303,19 @@ static bool
> >> vfio_listener_skipped_section(MemoryRegionSection *section)
> >>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
> >>              * addresses in the upper part of the 64-bit address space.  These
> >>              * are never accessed by the CPU and beyond the address width of
> >>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
> >>              */
> >> -           section->offset_within_address_space & (1ULL << 63);
> >> +           section->offset_within_address_space & (1ULL << 63) ||
> >> +           /*
> >> +            * Non-system-page aligned RAM regions have no chance to be
> >> +            * DMA mapped so skip them too.
> >> +            */
> >> +           (memory_region_is_ram(section->mr) &&
> >> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
> >> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
> >> +           );
> >>  }
> >>
> >>
> >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
> >> more complicated logic to detect whether it is MMIO? Or there is something
> >> bigger?  
> > 
> > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
> > silly address hack above to try to exclude PCI BARs as they're being
> > resized, and b) there is some value to enabling peer-to-peer mappings
> > between devices.  Is there a better, more generic solution to a) now?  
> 
> "now?" what? :)

Now - adverb, at the present time or moment.  For example, vs in the
past.

> > If we know a MR is MMIO, then I think we can handle any associated
> > mapping failure as non-fatal.   
> 
> 
> mr->ram_device sound like an appropriate flag to check.

Yeah, I suppose we could, and that didn't exist when we added the
bit 63 hack.

> > It's nice to enable p2p, but it's barely
> > used and not clear that it works reliably (sometimes even on bare
> > metal).Nit, we don't yet have the IOVA range of the IOMMU exposed,
> > thus the silly address hack, but we do have supported page sizes, so we
> > should probably be referencing that rather than host page size.  
> 
> Ah, correct,
> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/

No, that would be the emulated iommu page size, which may or may not
(hopefully does) match the physical iommu page size.  The vfio iommu
page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1.

> >> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
> >> just does not register it with the KVM if unaligned.  
> > 
> > For a sanity check, we don't support running a guest with a smaller
> > page size than the host, right?  
> 
> We do, at least on POWER.
> 
> > I assume that would be the only case
> > where we would have a reasonable chance of the guest using a host
> > sub-page as a DMA target, but that seems like just one of an infinite
> > number of problems with such a setup.  Thanks,  
> 
> Like what? I am missing the point here. The most popular config on POWER
> today is 64k system page size in both host and guest and 4K IOMMU page size
> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too.

The IOMMU supporting a smaller page size than the processor is the
thing I think I was missing here, but it makes sense that they could,
they're not necessarily linked.  Eric, will SMMU support 4K even while
the CPU runs at 64K?  The type1 backend has a number of PAGE_SIZE
assumptions in it and will not expose IOMMU page sizes less than
PAGE_SIZE.  Perhaps this is really the root of the problem, but it is
advertising vfio_iommu_type1_info.iova_pgsizes, which userspace should
honor.  So I think running (target page size) < (host page size) is
probably broken for type1, for instance the guest could have a 4K RAM
page as a DMA target that the 64K host couldn't map due to alignment,
therefore DMA with that page would fault.  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 43 weeks ago
On 19/01/18 14:25, Alex Williamson wrote:
> On Fri, 19 Jan 2018 13:41:41 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/01/18 08:59, Alex Williamson wrote:
>>> On Tue, 16 Jan 2018 16:17:58 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 06/01/18 02:29, Alex Williamson wrote:  
>>>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>>>     
>>>>>> Hi Alexey,
>>>>>>
>>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:    
>>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>>>
>>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>>>> are emulated on top unless the machine requests not to by defining and
>>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>>>> enabling it as it does not define the "set" callback for the new property;
>>>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>>>
>>>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>>>> the old behaviour with the sparse capability.
>>>>>>>
>>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>>>> emulated by QEMU as before.
>>>>>>>
>>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> This is mtree and flatview BEFORE this patch:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #0
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #2
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>>>> to enable emulation:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #1
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>> ---
>>>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>>>> index f3a2ac9..927d600 100644
>>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>>>>                           struct vfio_region_info **info);
>>>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>>>>  #endif
>>>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>>>  
>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>> index 4312e96..b45182e 100644
>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>>>>  
>>>>>>> +/*
>>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>>>>> + */
>>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>>>>> +
>>>>>>>  /**
>>>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>>>   *				    struct vfio_irq_info)
>>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>>> index 9de63f0..693394a 100644
>>>>>>> --- a/hw/ppc/spapr.c
>>>>>>> +++ b/hw/ppc/spapr.c
>>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>>>      spapr->use_hotplug_event_source = value;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>>>> +{
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>>>  {
>>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>>>      object_property_set_description(obj, "vsmt",
>>>>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>>>>                                      " the host's SMT mode", &error_abort);
>>>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>>>  }
>>>>>>>  
>>>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>>>  /*
>>>>>>>   * pseries-2.11
>>>>>>>   */
>>>>>>> +
>>>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>>>  {
>>>>>>>  }
>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>> index 1fb8a8e..da5f182 100644
>>>>>>> --- a/hw/vfio/common.c
>>>>>>> +++ b/hw/vfio/common.c
>>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>      return -ENODEV;
>>>>>>>  }
>>>>>>>  
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>>>>> +{
>>>>>>> +    struct vfio_region_info *info = NULL;
>>>>>>> +    bool ret = false;
>>>>>>> +
>>>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>>>> +            ret = true;
>>>>>>> +        }
>>>>>>> +        g_free(info);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>>>   */
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index c977ee3..27a3706 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>>>      off_t start, end;
>>>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>>>  
>>>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>>>> +                            vdev->msix->table_bar)) {
>>>>>>> +        return;
>>>>>>> +    }      
>>>>>>
>>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>>>> msix table at 0x2000, using 64kB pages.
>>>>>>
>>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>>>> mmaps[0]
>>>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>>>
>>>>>>
>>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>>>
>>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>>>
>>>>>> In my case don't we still need the fixup logic?    
>>>>>
>>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>>>> can't remove it until we figure out this bug and who should be filtering
>>>>> out sub-page size listener events.  Thanks,    
>>>>
>>>>
>>>> This should do it:
>>>>
>>>> @@ -303,11 +303,19 @@ static bool
>>>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>              * addresses in the upper part of the 64-bit address space.  These
>>>>              * are never accessed by the CPU and beyond the address width of
>>>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>>>>              */
>>>> -           section->offset_within_address_space & (1ULL << 63);
>>>> +           section->offset_within_address_space & (1ULL << 63) ||
>>>> +           /*
>>>> +            * Non-system-page aligned RAM regions have no chance to be
>>>> +            * DMA mapped so skip them too.
>>>> +            */
>>>> +           (memory_region_is_ram(section->mr) &&
>>>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>>>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>>>> +           );
>>>>  }
>>>>
>>>>
>>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>>>> more complicated logic to detect whether it is MMIO? Or there is something
>>>> bigger?  
>>>
>>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
>>> silly address hack above to try to exclude PCI BARs as they're being
>>> resized, and b) there is some value to enabling peer-to-peer mappings
>>> between devices.  Is there a better, more generic solution to a) now?  
>>
>> "now?" what? :)
> 
> Now - adverb, at the present time or moment.  For example, vs in the
> past.

Ah, read incorrectly, I thought the sentence should have a) and b) and b)
was missing by a mistake. A heatwave here today.

> 
>>> If we know a MR is MMIO, then I think we can handle any associated
>>> mapping failure as non-fatal.   
>>
>>
>> mr->ram_device sound like an appropriate flag to check.
> 
> Yeah, I suppose we could, and that didn't exist when we added the
> bit 63 hack.

Ok, I'll make a patch for this then.


>>> It's nice to enable p2p, but it's barely
>>> used and not clear that it works reliably (sometimes even on bare
>>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed,
>>> thus the silly address hack, but we do have supported page sizes, so we
>>> should probably be referencing that rather than host page size.  
>>
>> Ah, correct,
>> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/
> 
> No, that would be the emulated iommu page size, which may or may not
> (hopefully does) match the physical iommu page size.  The vfio iommu
> page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1.

Correct.





-- 
Alexey

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Auger Eric 43 weeks ago
Hi,

On 19/01/18 04:25, Alex Williamson wrote:
> On Fri, 19 Jan 2018 13:41:41 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/01/18 08:59, Alex Williamson wrote:
>>> On Tue, 16 Jan 2018 16:17:58 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 06/01/18 02:29, Alex Williamson wrote:  
>>>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>>>     
>>>>>> Hi Alexey,
>>>>>>
>>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:    
>>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>>>
>>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>>>> are emulated on top unless the machine requests not to by defining and
>>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>>>> enabling it as it does not define the "set" callback for the new property;
>>>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>>>
>>>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>>>> the old behaviour with the sparse capability.
>>>>>>>
>>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>>>> emulated by QEMU as before.
>>>>>>>
>>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> This is mtree and flatview BEFORE this patch:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #0
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #2
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>>>> to enable emulation:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #1
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>> ---
>>>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>>>> index f3a2ac9..927d600 100644
>>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>>>>                           struct vfio_region_info **info);
>>>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>>>>  #endif
>>>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>>>  
>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>> index 4312e96..b45182e 100644
>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>>>>  
>>>>>>> +/*
>>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>>>>> + */
>>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>>>>> +
>>>>>>>  /**
>>>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>>>   *				    struct vfio_irq_info)
>>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>>> index 9de63f0..693394a 100644
>>>>>>> --- a/hw/ppc/spapr.c
>>>>>>> +++ b/hw/ppc/spapr.c
>>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>>>      spapr->use_hotplug_event_source = value;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>>>> +{
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>>>  {
>>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>>>      object_property_set_description(obj, "vsmt",
>>>>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>>>>                                      " the host's SMT mode", &error_abort);
>>>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>>>  }
>>>>>>>  
>>>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>>>  /*
>>>>>>>   * pseries-2.11
>>>>>>>   */
>>>>>>> +
>>>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>>>  {
>>>>>>>  }
>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>> index 1fb8a8e..da5f182 100644
>>>>>>> --- a/hw/vfio/common.c
>>>>>>> +++ b/hw/vfio/common.c
>>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>      return -ENODEV;
>>>>>>>  }
>>>>>>>  
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>>>>> +{
>>>>>>> +    struct vfio_region_info *info = NULL;
>>>>>>> +    bool ret = false;
>>>>>>> +
>>>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>>>> +            ret = true;
>>>>>>> +        }
>>>>>>> +        g_free(info);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>>>   */
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index c977ee3..27a3706 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>>>      off_t start, end;
>>>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>>>  
>>>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>>>> +                            vdev->msix->table_bar)) {
>>>>>>> +        return;
>>>>>>> +    }      
>>>>>>
>>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>>>> msix table at 0x2000, using 64kB pages.
>>>>>>
>>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>>>> mmaps[0]
>>>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>>>
>>>>>>
>>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>>>
>>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>>>
>>>>>> In my case don't we still need the fixup logic?    
>>>>>
>>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>>>> can't remove it until we figure out this bug and who should be filtering
>>>>> out sub-page size listener events.  Thanks,    
>>>>
>>>>
>>>> This should do it:
>>>>
>>>> @@ -303,11 +303,19 @@ static bool
>>>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>              * addresses in the upper part of the 64-bit address space.  These
>>>>              * are never accessed by the CPU and beyond the address width of
>>>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>>>>              */
>>>> -           section->offset_within_address_space & (1ULL << 63);
>>>> +           section->offset_within_address_space & (1ULL << 63) ||
>>>> +           /*
>>>> +            * Non-system-page aligned RAM regions have no chance to be
>>>> +            * DMA mapped so skip them too.
>>>> +            */
>>>> +           (memory_region_is_ram(section->mr) &&
>>>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>>>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>>>> +           );
>>>>  }
>>>>
>>>>
>>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>>>> more complicated logic to detect whether it is MMIO? Or there is something
>>>> bigger?  
>>>
>>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
>>> silly address hack above to try to exclude PCI BARs as they're being
>>> resized, and b) there is some value to enabling peer-to-peer mappings
>>> between devices.  Is there a better, more generic solution to a) now?  
>>
>> "now?" what? :)
> 
> Now - adverb, at the present time or moment.  For example, vs in the
> past.
> 
>>> If we know a MR is MMIO, then I think we can handle any associated
>>> mapping failure as non-fatal.   
>>
>>
>> mr->ram_device sound like an appropriate flag to check.
> 
> Yeah, I suppose we could, and that didn't exist when we added the
> bit 63 hack.
> 
>>> It's nice to enable p2p, but it's barely
>>> used and not clear that it works reliably (sometimes even on bare
>>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed,
>>> thus the silly address hack, but we do have supported page sizes, so we
>>> should probably be referencing that rather than host page size.  
>>
>> Ah, correct,
>> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/
> 
> No, that would be the emulated iommu page size, which may or may not
> (hopefully does) match the physical iommu page size.  The vfio iommu
> page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1.
> 
>>>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
>>>> just does not register it with the KVM if unaligned.  
>>>
>>> For a sanity check, we don't support running a guest with a smaller
>>> page size than the host, right?  
>>
>> We do, at least on POWER.
>>
>>> I assume that would be the only case
>>> where we would have a reasonable chance of the guest using a host
>>> sub-page as a DMA target, but that seems like just one of an infinite
>>> number of problems with such a setup.  Thanks,  
>>
>> Like what? I am missing the point here. The most popular config on POWER
>> today is 64k system page size in both host and guest and 4K IOMMU page size
>> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too.
> 
> The IOMMU supporting a smaller page size than the processor is the
> thing I think I was missing here, but it makes sense that they could,
> they're not necessarily linked.  Eric, will SMMU support 4K even while
> the CPU runs at 64K?

Yes that's my understanding. Also we introduced:

4644321 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

Thanks

Eric

  The type1 backend has a number of PAGE_SIZE
> assumptions in it and will not expose IOMMU page sizes less than
> PAGE_SIZE.  Perhaps this is really the root of the problem, but it is
> advertising vfio_iommu_type1_info.iova_pgsizes, which userspace should
> honor.  So I think running (target page size) < (host page size) is
> probably broken for type1, for instance the guest could have a 4K RAM
> page as a DMA target that the 64K host couldn't map due to alignment,
> therefore DMA with that page would fault.  Thanks,
> 
> Alex
> 

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alex Williamson 42 weeks ago
On Fri, 19 Jan 2018 14:15:43 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi,
> 
> On 19/01/18 04:25, Alex Williamson wrote:
> > On Fri, 19 Jan 2018 13:41:41 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 19/01/18 08:59, Alex Williamson wrote:  
> >>> On Tue, 16 Jan 2018 16:17:58 +1100
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>     
> >>>> On 06/01/18 02:29, Alex Williamson wrote:    
> >>>>> On Fri, 5 Jan 2018 10:48:07 +0100
> >>>>> Auger Eric <eric.auger@redhat.com> wrote:
> >>>>>       
> >>>>>> Hi Alexey,
> >>>>>>
> >>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:      
> >>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> >>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
> >>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> >>>>>>>
> >>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
> >>>>>>> are emulated on top unless the machine requests not to by defining and
> >>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
> >>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> >>>>>>> enabling it as it does not define the "set" callback for the new property;
> >>>>>>> the new property also does not appear in "-machine pseries,help".
> >>>>>>>
> >>>>>>> If the new capability is present, this puts MSIX IO memory region under
> >>>>>>> mapped memory region. If the capability is not there, it falls back to
> >>>>>>> the old behaviour with the sparse capability.
> >>>>>>>
> >>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
> >>>>>>> listener does not register it with the KVM as a memory slot and MSIX is
> >>>>>>> emulated by QEMU as before.
> >>>>>>>
> >>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> >>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> >>>>>>>
> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>>>> ---
> >>>>>>>
> >>>>>>> This is mtree and flatview BEFORE this patch:
> >>>>>>>
> >>>>>>> "info mtree":
> >>>>>>> memory-region: pci@800000020000000.mmio
> >>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>>>
> >>>>>>> "info mtree -f":
> >>>>>>> FlatView #0
> >>>>>>>  AS "memory", root: system
> >>>>>>>  AS "cpu-memory", root: system
> >>>>>>>  Root memory region: system
> >>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> This is AFTER this patch applied:
> >>>>>>>
> >>>>>>> "info mtree":
> >>>>>>> memory-region: pci@800000020000000.mmio
> >>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>>>
> >>>>>>>
> >>>>>>> "info mtree -f":
> >>>>>>> FlatView #2
> >>>>>>>  AS "memory", root: system
> >>>>>>>  AS "cpu-memory", root: system
> >>>>>>>  Root memory region: system
> >>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> >>>>>>> to enable emulation:
> >>>>>>>
> >>>>>>> "info mtree":
> >>>>>>> memory-region: pci@800000020000000.mmio
> >>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>>>
> >>>>>>> "info mtree -f":
> >>>>>>> FlatView #1
> >>>>>>>  AS "memory", root: system
> >>>>>>>  AS "cpu-memory", root: system
> >>>>>>>  Root memory region: system
> >>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>>> ---
> >>>>>>>  include/hw/vfio/vfio-common.h |  1 +
> >>>>>>>  linux-headers/linux/vfio.h    |  5 +++++
> >>>>>>>  hw/ppc/spapr.c                |  8 ++++++++
> >>>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
> >>>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >>>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>>>>>> index f3a2ac9..927d600 100644
> >>>>>>> --- a/include/hw/vfio/vfio-common.h
> >>>>>>> +++ b/include/hw/vfio/vfio-common.h
> >>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>>>>>>                           struct vfio_region_info **info);
> >>>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>>>>>                               uint32_t subtype, struct vfio_region_info **info);
> >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >>>>>>>  #endif
> >>>>>>>  extern const MemoryListener vfio_prereg_listener;
> >>>>>>>  
> >>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>>>> index 4312e96..b45182e 100644
> >>>>>>> --- a/linux-headers/linux/vfio.h
> >>>>>>> +++ b/linux-headers/linux/vfio.h
> >>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>>>>>>  
> >>>>>>> +/*
> >>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> >>>>>>> + */
> >>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> >>>>>>> +
> >>>>>>>  /**
> >>>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >>>>>>>   *				    struct vfio_irq_info)
> >>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>>>> index 9de63f0..693394a 100644
> >>>>>>> --- a/hw/ppc/spapr.c
> >>>>>>> +++ b/hw/ppc/spapr.c
> >>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >>>>>>>      spapr->use_hotplug_event_source = value;
> >>>>>>>  }
> >>>>>>>  
> >>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> >>>>>>> +{
> >>>>>>> +    return true;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >>>>>>>  {
> >>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> >>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >>>>>>>      object_property_set_description(obj, "vsmt",
> >>>>>>>                                      "Virtual SMT: KVM behaves as if this were"
> >>>>>>>                                      " the host's SMT mode", &error_abort);
> >>>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
> >>>>>>>  }
> >>>>>>>  
> >>>>>>>  static void spapr_machine_finalizefn(Object *obj)
> >>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >>>>>>>  /*
> >>>>>>>   * pseries-2.11
> >>>>>>>   */
> >>>>>>> +
> >>>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >>>>>>>  {
> >>>>>>>  }
> >>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>>>> index 1fb8a8e..da5f182 100644
> >>>>>>> --- a/hw/vfio/common.c
> >>>>>>> +++ b/hw/vfio/common.c
> >>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>>>>>      return -ENODEV;
> >>>>>>>  }
> >>>>>>>  
> >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> >>>>>>> +{
> >>>>>>> +    struct vfio_region_info *info = NULL;
> >>>>>>> +    bool ret = false;
> >>>>>>> +
> >>>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> >>>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
> >>>>>>> +            ret = true;
> >>>>>>> +        }
> >>>>>>> +        g_free(info);
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    return ret;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>>  /*
> >>>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
> >>>>>>>   */
> >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>>>>> index c977ee3..27a3706 100644
> >>>>>>> --- a/hw/vfio/pci.c
> >>>>>>> +++ b/hw/vfio/pci.c
> >>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>>>>>      off_t start, end;
> >>>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >>>>>>>  
> >>>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> >>>>>>> +                            vdev->msix->table_bar)) {
> >>>>>>> +        return;
> >>>>>>> +    }        
> >>>>>>
> >>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
> >>>>>> msix table at 0x2000, using 64kB pages.
> >>>>>>
> >>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
> >>>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
> >>>>>> mmaps[0]
> >>>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
> >>>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
> >>>>>>
> >>>>>>
> >>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
> >>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
> >>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
> >>>>>> be mapped with 64kB page, leading to a qemu abort.
> >>>>>>
> >>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
> >>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
> >>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
> >>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
> >>>>>> 0xffffa8820000) = -22 (Invalid argument)
> >>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
> >>>>>>
> >>>>>> In my case don't we still need the fixup logic?      
> >>>>>
> >>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
> >>>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
> >>>>> can't remove it until we figure out this bug and who should be filtering
> >>>>> out sub-page size listener events.  Thanks,      
> >>>>
> >>>>
> >>>> This should do it:
> >>>>
> >>>> @@ -303,11 +303,19 @@ static bool
> >>>> vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
> >>>>              * addresses in the upper part of the 64-bit address space.  These
> >>>>              * are never accessed by the CPU and beyond the address width of
> >>>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
> >>>>              */
> >>>> -           section->offset_within_address_space & (1ULL << 63);
> >>>> +           section->offset_within_address_space & (1ULL << 63) ||
> >>>> +           /*
> >>>> +            * Non-system-page aligned RAM regions have no chance to be
> >>>> +            * DMA mapped so skip them too.
> >>>> +            */
> >>>> +           (memory_region_is_ram(section->mr) &&
> >>>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
> >>>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
> >>>> +           );
> >>>>  }
> >>>>
> >>>>
> >>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
> >>>> more complicated logic to detect whether it is MMIO? Or there is something
> >>>> bigger?    
> >>>
> >>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
> >>> silly address hack above to try to exclude PCI BARs as they're being
> >>> resized, and b) there is some value to enabling peer-to-peer mappings
> >>> between devices.  Is there a better, more generic solution to a) now?    
> >>
> >> "now?" what? :)  
> > 
> > Now - adverb, at the present time or moment.  For example, vs in the
> > past.
> >   
> >>> If we know a MR is MMIO, then I think we can handle any associated
> >>> mapping failure as non-fatal.     
> >>
> >>
> >> mr->ram_device sound like an appropriate flag to check.  
> > 
> > Yeah, I suppose we could, and that didn't exist when we added the
> > bit 63 hack.
> >   
> >>> It's nice to enable p2p, but it's barely
> >>> used and not clear that it works reliably (sometimes even on bare
> >>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed,
> >>> thus the silly address hack, but we do have supported page sizes, so we
> >>> should probably be referencing that rather than host page size.    
> >>
> >> Ah, correct,
> >> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/  
> > 
> > No, that would be the emulated iommu page size, which may or may not
> > (hopefully does) match the physical iommu page size.  The vfio iommu
> > page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1.
> >   
> >>>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
> >>>> just does not register it with the KVM if unaligned.    
> >>>
> >>> For a sanity check, we don't support running a guest with a smaller
> >>> page size than the host, right?    
> >>
> >> We do, at least on POWER.
> >>  
> >>> I assume that would be the only case
> >>> where we would have a reasonable chance of the guest using a host
> >>> sub-page as a DMA target, but that seems like just one of an infinite
> >>> number of problems with such a setup.  Thanks,    
> >>
> >> Like what? I am missing the point here. The most popular config on POWER
> >> today is 64k system page size in both host and guest and 4K IOMMU page size
> >> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too.  
> > 
> > The IOMMU supporting a smaller page size than the processor is the
> > thing I think I was missing here, but it makes sense that they could,
> > they're not necessarily linked.  Eric, will SMMU support 4K even while
> > the CPU runs at 64K?  
> 
> Yes that's my understanding. Also we introduced:
> 
> 4644321 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

This is a bit of a different problem, for instance if PAGE_SIZE is 64K
but the IOMMU only supports 4K, type1 would previously report an empty
set in the pagesize bitmap.  With this change, we recognize that we can
compose a 64K mapping using 4K pages and therefore report 64K as a
valid page size.  It doesn't however report that we can create 4K
mappings.  VT-d sort or works the same way, but at the IOMMU API level
intel-iommu reports PAGE_MASK even though it only has native support
for a few of those discrete sizes (of course PAGE_SIZE is always 4K, so
the similarity ends there).  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Auger Eric 42 weeks ago
Hi Alex,

On 23/01/18 20:56, Alex Williamson wrote:
> On Fri, 19 Jan 2018 14:15:43 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi,
>>
>> On 19/01/18 04:25, Alex Williamson wrote:
>>> On Fri, 19 Jan 2018 13:41:41 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 19/01/18 08:59, Alex Williamson wrote:  
>>>>> On Tue, 16 Jan 2018 16:17:58 +1100
>>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>     
>>>>>> On 06/01/18 02:29, Alex Williamson wrote:    
>>>>>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>>>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>>>>>       
>>>>>>>> Hi Alexey,
>>>>>>>>
>>>>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:      
>>>>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>>>>>
>>>>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>>>>>> are emulated on top unless the machine requests not to by defining and
>>>>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>>>>>> enabling it as it does not define the "set" callback for the new property;
>>>>>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>>>>>
>>>>>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>>>>>> the old behaviour with the sparse capability.
>>>>>>>>>
>>>>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>>>>>> emulated by QEMU as before.
>>>>>>>>>
>>>>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> This is mtree and flatview BEFORE this patch:
>>>>>>>>>
>>>>>>>>> "info mtree":
>>>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>>>
>>>>>>>>> "info mtree -f":
>>>>>>>>> FlatView #0
>>>>>>>>>  AS "memory", root: system
>>>>>>>>>  AS "cpu-memory", root: system
>>>>>>>>>  Root memory region: system
>>>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is AFTER this patch applied:
>>>>>>>>>
>>>>>>>>> "info mtree":
>>>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "info mtree -f":
>>>>>>>>> FlatView #2
>>>>>>>>>  AS "memory", root: system
>>>>>>>>>  AS "cpu-memory", root: system
>>>>>>>>>  Root memory region: system
>>>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>>>>>> to enable emulation:
>>>>>>>>>
>>>>>>>>> "info mtree":
>>>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>>>
>>>>>>>>> "info mtree -f":
>>>>>>>>> FlatView #1
>>>>>>>>>  AS "memory", root: system
>>>>>>>>>  AS "cpu-memory", root: system
>>>>>>>>>  Root memory region: system
>>>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>>> ---
>>>>>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>>>>>> index f3a2ac9..927d600 100644
>>>>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>>>>>>                           struct vfio_region_info **info);
>>>>>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>>>>>>  #endif
>>>>>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>>>>>  
>>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>>>> index 4312e96..b45182e 100644
>>>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>>>>>>  
>>>>>>>>> +/*
>>>>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>>>>>>> + */
>>>>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>>>>>>> +
>>>>>>>>>  /**
>>>>>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>>>>>   *				    struct vfio_irq_info)
>>>>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>>>>> index 9de63f0..693394a 100644
>>>>>>>>> --- a/hw/ppc/spapr.c
>>>>>>>>> +++ b/hw/ppc/spapr.c
>>>>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>>>>>      spapr->use_hotplug_event_source = value;
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>>>>>> +{
>>>>>>>>> +    return true;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>>>>>  {
>>>>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>>>>>      object_property_set_description(obj, "vsmt",
>>>>>>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>>>>>>                                      " the host's SMT mode", &error_abort);
>>>>>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>>>>>  /*
>>>>>>>>>   * pseries-2.11
>>>>>>>>>   */
>>>>>>>>> +
>>>>>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>>>>>  {
>>>>>>>>>  }
>>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>>>> index 1fb8a8e..da5f182 100644
>>>>>>>>> --- a/hw/vfio/common.c
>>>>>>>>> +++ b/hw/vfio/common.c
>>>>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>>>      return -ENODEV;
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>>>>>>> +{
>>>>>>>>> +    struct vfio_region_info *info = NULL;
>>>>>>>>> +    bool ret = false;
>>>>>>>>> +
>>>>>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>>>>>> +            ret = true;
>>>>>>>>> +        }
>>>>>>>>> +        g_free(info);
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return ret;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  /*
>>>>>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>>>>>   */
>>>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>>>> index c977ee3..27a3706 100644
>>>>>>>>> --- a/hw/vfio/pci.c
>>>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>>>>>      off_t start, end;
>>>>>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>>>>>  
>>>>>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>>>>>> +                            vdev->msix->table_bar)) {
>>>>>>>>> +        return;
>>>>>>>>> +    }        
>>>>>>>>
>>>>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>>>>>> msix table at 0x2000, using 64kB pages.
>>>>>>>>
>>>>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>>>>>> mmaps[0]
>>>>>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>>>>>
>>>>>>>>
>>>>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>>>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>>>>>
>>>>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>>>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>>>>>
>>>>>>>> In my case don't we still need the fixup logic?      
>>>>>>>
>>>>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>>>>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>>>>>> can't remove it until we figure out this bug and who should be filtering
>>>>>>> out sub-page size listener events.  Thanks,      
>>>>>>
>>>>>>
>>>>>> This should do it:
>>>>>>
>>>>>> @@ -303,11 +303,19 @@ static bool
>>>>>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>>>              * addresses in the upper part of the 64-bit address space.  These
>>>>>>              * are never accessed by the CPU and beyond the address width of
>>>>>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>>>>>>              */
>>>>>> -           section->offset_within_address_space & (1ULL << 63);
>>>>>> +           section->offset_within_address_space & (1ULL << 63) ||
>>>>>> +           /*
>>>>>> +            * Non-system-page aligned RAM regions have no chance to be
>>>>>> +            * DMA mapped so skip them too.
>>>>>> +            */
>>>>>> +           (memory_region_is_ram(section->mr) &&
>>>>>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>>>>>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>>>>>> +           );
>>>>>>  }
>>>>>>
>>>>>>
>>>>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>>>>>> more complicated logic to detect whether it is MMIO? Or there is something
>>>>>> bigger?    
>>>>>
>>>>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
>>>>> silly address hack above to try to exclude PCI BARs as they're being
>>>>> resized, and b) there is some value to enabling peer-to-peer mappings
>>>>> between devices.  Is there a better, more generic solution to a) now?    
>>>>
>>>> "now?" what? :)  
>>>
>>> Now - adverb, at the present time or moment.  For example, vs in the
>>> past.
>>>   
>>>>> If we know a MR is MMIO, then I think we can handle any associated
>>>>> mapping failure as non-fatal.     
>>>>
>>>>
>>>> mr->ram_device sound like an appropriate flag to check.  
>>>
>>> Yeah, I suppose we could, and that didn't exist when we added the
>>> bit 63 hack.
>>>   
>>>>> It's nice to enable p2p, but it's barely
>>>>> used and not clear that it works reliably (sometimes even on bare
>>>>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed,
>>>>> thus the silly address hack, but we do have supported page sizes, so we
>>>>> should probably be referencing that rather than host page size.    
>>>>
>>>> Ah, correct,
>>>> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/  
>>>
>>> No, that would be the emulated iommu page size, which may or may not
>>> (hopefully does) match the physical iommu page size.  The vfio iommu
>>> page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1.
>>>   
>>>>>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
>>>>>> just does not register it with the KVM if unaligned.    
>>>>>
>>>>> For a sanity check, we don't support running a guest with a smaller
>>>>> page size than the host, right?    
>>>>
>>>> We do, at least on POWER.
>>>>  
>>>>> I assume that would be the only case
>>>>> where we would have a reasonable chance of the guest using a host
>>>>> sub-page as a DMA target, but that seems like just one of an infinite
>>>>> number of problems with such a setup.  Thanks,    
>>>>
>>>> Like what? I am missing the point here. The most popular config on POWER
>>>> today is 64k system page size in both host and guest and 4K IOMMU page size
>>>> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too.  
>>>
>>> The IOMMU supporting a smaller page size than the processor is the
>>> thing I think I was missing here, but it makes sense that they could,
>>> they're not necessarily linked.  Eric, will SMMU support 4K even while
>>> the CPU runs at 64K?  
>>
>> Yes that's my understanding.

Hum actually I was wrong. in arm-smmu-v3 driver there is a first pgsize
bitmap initialization at probe time, made by probing the HW capabilities
(looking at ID5 reg).

But on domain finalization, when the page allocator is allocated
(alloc_io_pgtable_ops), this bitmap (cfg->pgsize_bitmap in the code
below) is overriden and if PAGE_SIZE is supported and in our case
granule=64K, only 64K and 512M page sizes are set. So domain pgsize
bitmap is not exposing 4K. see arm_lpae_restrict_pgsizes in
driver/iommu/io-pgtable-arm.c


        /*
         * We need to restrict the supported page sizes to match the
         * translation regime for a particular granule. Aim to match
         * the CPU page size if possible, otherwise prefer smaller sizes.
         * While we're at it, restrict the block sizes to match the
         * chosen granule.
         */

        if (cfg->pgsize_bitmap & PAGE_SIZE)
                granule = PAGE_SIZE;
        else if (cfg->pgsize_bitmap & ~PAGE_MASK)
                granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK);
        else if (cfg->pgsize_bitmap & PAGE_MASK)
                granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK);
        else
                granule = 0;

        switch (granule) {
        case SZ_4K:
                cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G);
                break;
        case SZ_16K:
                cfg->pgsize_bitmap &= (SZ_16K | SZ_32M);
                break;
        case SZ_64K:
                cfg->pgsize_bitmap &= (SZ_64K | SZ_512M);
                break;
        default:
                cfg->pgsize_bitmap = 0;


 Also we introduced:
>>
>> 4644321 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
> 
> This is a bit of a different problem, for instance if PAGE_SIZE is 64K
> but the IOMMU only supports 4K, type1 would previously report an empty
> set in the pagesize bitmap.  With this change, we recognize that we can
> compose a 64K mapping using 4K pages and therefore report 64K as a
> valid page size.  It doesn't however report that we can create 4K
> mappings.
I agree. With the above code, if PAGE_SIZE is not supported by the smmu,
domain bitmap exposes 4K page size and then vfio iommu type1 exposes 64K.

Thanks

Eric
  VT-d sort or works the same way, but at the IOMMU API level
> intel-iommu reports PAGE_MASK even though it only has native support
> for a few of those discrete sizes (of course PAGE_SIZE is always 4K, so
> the similarity ends there).  Thanks,
> 
> Alex
> 

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 43 weeks ago
On 19/01/18 08:59, Alex Williamson wrote:
> On Tue, 16 Jan 2018 16:17:58 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 06/01/18 02:29, Alex Williamson wrote:
>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>   
>>>> Hi Alexey,
>>>>
>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:  
>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>
>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>> are emulated on top unless the machine requests not to by defining and
>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>> enabling it as it does not define the "set" callback for the new property;
>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>
>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>> the old behaviour with the sparse capability.
>>>>>
>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>> emulated by QEMU as before.
>>>>>
>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>
>>>>> This is mtree and flatview BEFORE this patch:
>>>>>
>>>>> "info mtree":
>>>>> memory-region: pci@800000020000000.mmio
>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>> "info mtree -f":
>>>>> FlatView #0
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>>
>>>>>
>>>>> This is AFTER this patch applied:
>>>>>
>>>>> "info mtree":
>>>>> memory-region: pci@800000020000000.mmio
>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>>
>>>>> "info mtree -f":
>>>>> FlatView #2
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>>
>>>>>
>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>> to enable emulation:
>>>>>
>>>>> "info mtree":
>>>>> memory-region: pci@800000020000000.mmio
>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>
>>>>> "info mtree -f":
>>>>> FlatView #1
>>>>>  AS "memory", root: system
>>>>>  AS "cpu-memory", root: system
>>>>>  Root memory region: system
>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>> ---
>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>> index f3a2ac9..927d600 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>>                           struct vfio_region_info **info);
>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>>  #endif
>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>  
>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>> index 4312e96..b45182e 100644
>>>>> --- a/linux-headers/linux/vfio.h
>>>>> +++ b/linux-headers/linux/vfio.h
>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>>  
>>>>> +/*
>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>>> + */
>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>>> +
>>>>>  /**
>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>   *				    struct vfio_irq_info)
>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>> index 9de63f0..693394a 100644
>>>>> --- a/hw/ppc/spapr.c
>>>>> +++ b/hw/ppc/spapr.c
>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>      spapr->use_hotplug_event_source = value;
>>>>>  }
>>>>>  
>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>> +{
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>  {
>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>      object_property_set_description(obj, "vsmt",
>>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>>                                      " the host's SMT mode", &error_abort);
>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>  }
>>>>>  
>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>  /*
>>>>>   * pseries-2.11
>>>>>   */
>>>>> +
>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>  {
>>>>>  }
>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>> index 1fb8a8e..da5f182 100644
>>>>> --- a/hw/vfio/common.c
>>>>> +++ b/hw/vfio/common.c
>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>      return -ENODEV;
>>>>>  }
>>>>>  
>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>>> +{
>>>>> +    struct vfio_region_info *info = NULL;
>>>>> +    bool ret = false;
>>>>> +
>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>> +            ret = true;
>>>>> +        }
>>>>> +        g_free(info);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>   */
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index c977ee3..27a3706 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>      off_t start, end;
>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>  
>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>> +                            vdev->msix->table_bar)) {
>>>>> +        return;
>>>>> +    }    
>>>>
>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>> msix table at 0x2000, using 64kB pages.
>>>>
>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>> mmaps[0]
>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>
>>>>
>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>
>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>
>>>> In my case don't we still need the fixup logic?  
>>>
>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>> can't remove it until we figure out this bug and who should be filtering
>>> out sub-page size listener events.  Thanks,  
>>
>>
>> This should do it:
>>
>> @@ -303,11 +303,19 @@ static bool
>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>              * addresses in the upper part of the 64-bit address space.  These
>>              * are never accessed by the CPU and beyond the address width of
>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>>              */
>> -           section->offset_within_address_space & (1ULL << 63);
>> +           section->offset_within_address_space & (1ULL << 63) ||
>> +           /*
>> +            * Non-system-page aligned RAM regions have no chance to be
>> +            * DMA mapped so skip them too.
>> +            */
>> +           (memory_region_is_ram(section->mr) &&
>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>> +           );
>>  }
>>
>>
>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>> more complicated logic to detect whether it is MMIO? Or there is something
>> bigger?
> 
> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
> silly address hack above to try to exclude PCI BARs as they're being
> resized, and b) there is some value to enabling peer-to-peer mappings
> between devices.  Is there a better, more generic solution to a) now?
> If we know a MR is MMIO, then I think we can handle any associated
> mapping failure as non-fatal.  It's nice to enable p2p, but it's barely
> used and not clear that it works reliably (sometimes even on bare
> metal). 

When P2P works in a guest, what does it look like exactly? Do the devices
talk via the IOMMU or directly? I'd assume the latter but then the guest
device driver needs to know physical MMIO addresses and it does not or
guest must program BARs exactly as the host did... Has anyone actually
reported this working?

I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO
windows to PHB address space") and ended up having
vfio_listener_region_add() called with RAM regions (which are mapped MMIO)
and sPAPR cannot handle it otherwise than creating a window per a memory
section and since we can have just two, this fails. Adding
memory_region_is_ram_device() to vfio_listener_skipped_section() solves my
particular problem but I wonder if I should rather enable this P2P on sPAPR?




> Nit, we don't yet have the IOVA range of the IOMMU exposed,
> thus the silly address hack, but we do have supported page sizes, so we
> should probably be referencing that rather than host page size.
>  
>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener
>> just does not register it with the KVM if unaligned.
> 
> For a sanity check, we don't support running a guest with a smaller
> page size than the host, right?  I assume that would be the only case
> where we would have a reasonable chance of the guest using a host
> sub-page as a DMA target, but that seems like just one of an infinite
> number of problems with such a setup.  Thanks,
> 
> Alex
> 


-- 
Alexey

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alex Williamson 43 weeks ago
On Fri, 19 Jan 2018 19:55:49 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 19/01/18 08:59, Alex Williamson wrote:
> > On Tue, 16 Jan 2018 16:17:58 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 06/01/18 02:29, Alex Williamson wrote:  
> >>> On Fri, 5 Jan 2018 10:48:07 +0100
> >>> Auger Eric <eric.auger@redhat.com> wrote:
> >>>     
> >>>> Hi Alexey,
> >>>>
> >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:    
> >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
> >>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
> >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
> >>>>>
> >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
> >>>>> are emulated on top unless the machine requests not to by defining and
> >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
> >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
> >>>>> enabling it as it does not define the "set" callback for the new property;
> >>>>> the new property also does not appear in "-machine pseries,help".
> >>>>>
> >>>>> If the new capability is present, this puts MSIX IO memory region under
> >>>>> mapped memory region. If the capability is not there, it falls back to
> >>>>> the old behaviour with the sparse capability.
> >>>>>
> >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
> >>>>> listener does not register it with the KVM as a memory slot and MSIX is
> >>>>> emulated by QEMU as before.
> >>>>>
> >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
> >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
> >>>>>
> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>>> ---
> >>>>>
> >>>>> This is mtree and flatview BEFORE this patch:
> >>>>>
> >>>>> "info mtree":
> >>>>> memory-region: pci@800000020000000.mmio
> >>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>> "info mtree -f":
> >>>>> FlatView #0
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
> >>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>>
> >>>>>
> >>>>> This is AFTER this patch applied:
> >>>>>
> >>>>> "info mtree":
> >>>>> memory-region: pci@800000020000000.mmio
> >>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
> >>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>>
> >>>>> "info mtree -f":
> >>>>> FlatView #2
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>>
> >>>>>
> >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
> >>>>> to enable emulation:
> >>>>>
> >>>>> "info mtree":
> >>>>> memory-region: pci@800000020000000.mmio
> >>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
> >>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
> >>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
> >>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
> >>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>>
> >>>>> "info mtree -f":
> >>>>> FlatView #1
> >>>>>  AS "memory", root: system
> >>>>>  AS "cpu-memory", root: system
> >>>>>  Root memory region: system
> >>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
> >>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
> >>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>>>> ---
> >>>>>  include/hw/vfio/vfio-common.h |  1 +
> >>>>>  linux-headers/linux/vfio.h    |  5 +++++
> >>>>>  hw/ppc/spapr.c                |  8 ++++++++
> >>>>>  hw/vfio/common.c              | 15 +++++++++++++++
> >>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
> >>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>>>> index f3a2ac9..927d600 100644
> >>>>> --- a/include/hw/vfio/vfio-common.h
> >>>>> +++ b/include/hw/vfio/vfio-common.h
> >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
> >>>>>                           struct vfio_region_info **info);
> >>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>>>                               uint32_t subtype, struct vfio_region_info **info);
> >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
> >>>>>  #endif
> >>>>>  extern const MemoryListener vfio_prereg_listener;
> >>>>>  
> >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >>>>> index 4312e96..b45182e 100644
> >>>>> --- a/linux-headers/linux/vfio.h
> >>>>> +++ b/linux-headers/linux/vfio.h
> >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
> >>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
> >>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
> >>>>>  
> >>>>> +/*
> >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
> >>>>> + */
> >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
> >>>>> +
> >>>>>  /**
> >>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> >>>>>   *				    struct vfio_irq_info)
> >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>>> index 9de63f0..693394a 100644
> >>>>> --- a/hw/ppc/spapr.c
> >>>>> +++ b/hw/ppc/spapr.c
> >>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
> >>>>>      spapr->use_hotplug_event_source = value;
> >>>>>  }
> >>>>>  
> >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
> >>>>> +{
> >>>>> +    return true;
> >>>>> +}
> >>>>> +
> >>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> >>>>>  {
> >>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
> >>>>>      object_property_set_description(obj, "vsmt",
> >>>>>                                      "Virtual SMT: KVM behaves as if this were"
> >>>>>                                      " the host's SMT mode", &error_abort);
> >>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>>>> +                             spapr_get_msix_emulation, NULL, NULL);
> >>>>>  }
> >>>>>  
> >>>>>  static void spapr_machine_finalizefn(Object *obj)
> >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
> >>>>>  /*
> >>>>>   * pseries-2.11
> >>>>>   */
> >>>>> +
> >>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
> >>>>>  {
> >>>>>  }
> >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>>>> index 1fb8a8e..da5f182 100644
> >>>>> --- a/hw/vfio/common.c
> >>>>> +++ b/hw/vfio/common.c
> >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
> >>>>>      return -ENODEV;
> >>>>>  }
> >>>>>  
> >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
> >>>>> +{
> >>>>> +    struct vfio_region_info *info = NULL;
> >>>>> +    bool ret = false;
> >>>>> +
> >>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
> >>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
> >>>>> +            ret = true;
> >>>>> +        }
> >>>>> +        g_free(info);
> >>>>> +    }
> >>>>> +
> >>>>> +    return ret;
> >>>>> +}
> >>>>> +
> >>>>>  /*
> >>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
> >>>>>   */
> >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>>>> index c977ee3..27a3706 100644
> >>>>> --- a/hw/vfio/pci.c
> >>>>> +++ b/hw/vfio/pci.c
> >>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>>>      off_t start, end;
> >>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
> >>>>>  
> >>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
> >>>>> +                            vdev->msix->table_bar)) {
> >>>>> +        return;
> >>>>> +    }      
> >>>>
> >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
> >>>> msix table at 0x2000, using 64kB pages.
> >>>>
> >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
> >>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
> >>>> mmaps[0]
> >>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
> >>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
> >>>>
> >>>>
> >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
> >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
> >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
> >>>> be mapped with 64kB page, leading to a qemu abort.
> >>>>
> >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
> >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
> >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
> >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
> >>>> 0xffffa8820000) = -22 (Invalid argument)
> >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
> >>>>
> >>>> In my case don't we still need the fixup logic?    
> >>>
> >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
> >>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
> >>> can't remove it until we figure out this bug and who should be filtering
> >>> out sub-page size listener events.  Thanks,    
> >>
> >>
> >> This should do it:
> >>
> >> @@ -303,11 +303,19 @@ static bool
> >> vfio_listener_skipped_section(MemoryRegionSection *section)
> >>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
> >>              * addresses in the upper part of the 64-bit address space.  These
> >>              * are never accessed by the CPU and beyond the address width of
> >>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
> >>              */
> >> -           section->offset_within_address_space & (1ULL << 63);
> >> +           section->offset_within_address_space & (1ULL << 63) ||
> >> +           /*
> >> +            * Non-system-page aligned RAM regions have no chance to be
> >> +            * DMA mapped so skip them too.
> >> +            */
> >> +           (memory_region_is_ram(section->mr) &&
> >> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
> >> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
> >> +           );
> >>  }
> >>
> >>
> >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
> >> more complicated logic to detect whether it is MMIO? Or there is something
> >> bigger?  
> > 
> > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
> > silly address hack above to try to exclude PCI BARs as they're being
> > resized, and b) there is some value to enabling peer-to-peer mappings
> > between devices.  Is there a better, more generic solution to a) now?
> > If we know a MR is MMIO, then I think we can handle any associated
> > mapping failure as non-fatal.  It's nice to enable p2p, but it's barely
> > used and not clear that it works reliably (sometimes even on bare
> > metal).   
> 
> When P2P works in a guest, what does it look like exactly? Do the devices
> talk via the IOMMU or directly? I'd assume the latter but then the guest
> device driver needs to know physical MMIO addresses and it does not or
> guest must program BARs exactly as the host did... Has anyone actually
> reported this working?

I've seen it work, NVIDIA has a peer to peer test in their cuda tests,
see previous patches enabling GPUDirect.  The route is generally
through the IOMMU, guests don't know host physical addresses.  I say
generally though because ACS does have a Directed Translated P2P
feature that would allow a more efficient path, but I've never seen
anything that can take advantage of this.
 
> I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO
> windows to PHB address space") and ended up having
> vfio_listener_region_add() called with RAM regions (which are mapped MMIO)
> and sPAPR cannot handle it otherwise than creating a window per a memory
> section and since we can have just two, this fails. Adding
> memory_region_is_ram_device() to vfio_listener_skipped_section() solves my
> particular problem but I wonder if I should rather enable this P2P on sPAPR?

Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO)
mappings non-fatal.  I think there's probably enough skepticism in
drivers as to whether p2p works that this isn't as critical.  Of course
making it work is even better.  Thanks,

Alex

Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR

Posted by Alexey Kardashevskiy 43 weeks ago
On 20/01/18 02:57, Alex Williamson wrote:
> On Fri, 19 Jan 2018 19:55:49 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/01/18 08:59, Alex Williamson wrote:
>>> On Tue, 16 Jan 2018 16:17:58 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 06/01/18 02:29, Alex Williamson wrote:  
>>>>> On Fri, 5 Jan 2018 10:48:07 +0100
>>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>>>     
>>>>>> Hi Alexey,
>>>>>>
>>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote:    
>>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability
>>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e.
>>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped.
>>>>>>>
>>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors
>>>>>>> are emulated on top unless the machine requests not to by defining and
>>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only
>>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow
>>>>>>> enabling it as it does not define the "set" callback for the new property;
>>>>>>> the new property also does not appear in "-machine pseries,help".
>>>>>>>
>>>>>>> If the new capability is present, this puts MSIX IO memory region under
>>>>>>> mapped memory region. If the capability is not there, it falls back to
>>>>>>> the old behaviour with the sparse capability.
>>>>>>>
>>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory
>>>>>>> listener does not register it with the KVM as a memory slot and MSIX is
>>>>>>> emulated by QEMU as before.
>>>>>>>
>>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" -
>>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> This is mtree and flatview BEFORE this patch:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #0
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled]
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #2
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched
>>>>>>> to enable emulation:
>>>>>>>
>>>>>>> "info mtree":
>>>>>>> memory-region: pci@800000020000000.mmio
>>>>>>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>>>>>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1
>>>>>>>       0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>         000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>         000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>>>>>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3
>>>>>>>       0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>>
>>>>>>> "info mtree -f":
>>>>>>> FlatView #1
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0]
>>>>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>>>>   000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600
>>>>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>>>> ---
>>>>>>>  include/hw/vfio/vfio-common.h |  1 +
>>>>>>>  linux-headers/linux/vfio.h    |  5 +++++
>>>>>>>  hw/ppc/spapr.c                |  8 ++++++++
>>>>>>>  hw/vfio/common.c              | 15 +++++++++++++++
>>>>>>>  hw/vfio/pci.c                 | 23 +++++++++++++++++++++--
>>>>>>>  5 files changed, 50 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>>>> index f3a2ac9..927d600 100644
>>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
>>>>>>>                           struct vfio_region_info **info);
>>>>>>>  int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>                               uint32_t subtype, struct vfio_region_info **info);
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region);
>>>>>>>  #endif
>>>>>>>  extern const MemoryListener vfio_prereg_listener;
>>>>>>>  
>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>>>>>>> index 4312e96..b45182e 100644
>>>>>>> --- a/linux-headers/linux/vfio.h
>>>>>>> +++ b/linux-headers/linux/vfio.h
>>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type {
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
>>>>>>>  #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
>>>>>>>  
>>>>>>> +/*
>>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped.
>>>>>>> + */
>>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE	3
>>>>>>> +
>>>>>>>  /**
>>>>>>>   * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
>>>>>>>   *				    struct vfio_irq_info)
>>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>>> index 9de63f0..693394a 100644
>>>>>>> --- a/hw/ppc/spapr.c
>>>>>>> +++ b/hw/ppc/spapr.c
>>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
>>>>>>>      spapr->use_hotplug_event_source = value;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp)
>>>>>>> +{
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static char *spapr_get_resize_hpt(Object *obj, Error **errp)
>>>>>>>  {
>>>>>>>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
>>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj)
>>>>>>>      object_property_set_description(obj, "vsmt",
>>>>>>>                                      "Virtual SMT: KVM behaves as if this were"
>>>>>>>                                      " the host's SMT mode", &error_abort);
>>>>>>> +    object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>>>> +                             spapr_get_msix_emulation, NULL, NULL);
>>>>>>>  }
>>>>>>>  
>>>>>>>  static void spapr_machine_finalizefn(Object *obj)
>>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = {
>>>>>>>  /*
>>>>>>>   * pseries-2.11
>>>>>>>   */
>>>>>>> +
>>>>>>>  static void spapr_machine_2_11_instance_options(MachineState *machine)
>>>>>>>  {
>>>>>>>  }
>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>>> index 1fb8a8e..da5f182 100644
>>>>>>> --- a/hw/vfio/common.c
>>>>>>> +++ b/hw/vfio/common.c
>>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
>>>>>>>      return -ENODEV;
>>>>>>>  }
>>>>>>>  
>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region)
>>>>>>> +{
>>>>>>> +    struct vfio_region_info *info = NULL;
>>>>>>> +    bool ret = false;
>>>>>>> +
>>>>>>> +    if (!vfio_get_region_info(vbasedev, region, &info)) {
>>>>>>> +        if (vfio_get_region_info_cap(info, cap_type)) {
>>>>>>> +            ret = true;
>>>>>>> +        }
>>>>>>> +        g_free(info);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * Interfaces for IBM EEH (Enhanced Error Handling)
>>>>>>>   */
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index c977ee3..27a3706 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>>>      off_t start, end;
>>>>>>>      VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
>>>>>>>  
>>>>>>> +    if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE,
>>>>>>> +                            vdev->msix->table_bar)) {
>>>>>>> +        return;
>>>>>>> +    }      
>>>>>>
>>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 +
>>>>>> msix table at 0x2000, using 64kB pages.
>>>>>>
>>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0
>>>>>>  0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0
>>>>>> mmaps[0]
>>>>>>   0000000000002000-00000000000023ff (prio 0, i/o): msix-table
>>>>>>   0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled]
>>>>>>
>>>>>>
>>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass
>>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the
>>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot
>>>>>> be mapped with 64kB page, leading to a qemu abort.
>>>>>>
>>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram]
>>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000]
>>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22
>>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000,
>>>>>> 0xffffa8820000) = -22 (Invalid argument)
>>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue
>>>>>>
>>>>>> In my case don't we still need the fixup logic?    
>>>>>
>>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about
>>>>> this.  I'd say no, the fixup shouldn't be needed, but apparently we
>>>>> can't remove it until we figure out this bug and who should be filtering
>>>>> out sub-page size listener events.  Thanks,    
>>>>
>>>>
>>>> This should do it:
>>>>
>>>> @@ -303,11 +303,19 @@ static bool
>>>> vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings to
>>>>              * addresses in the upper part of the 64-bit address space.  These
>>>>              * are never accessed by the CPU and beyond the address width of
>>>>              * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
>>>>              */
>>>> -           section->offset_within_address_space & (1ULL << 63);
>>>> +           section->offset_within_address_space & (1ULL << 63) ||
>>>> +           /*
>>>> +            * Non-system-page aligned RAM regions have no chance to be
>>>> +            * DMA mapped so skip them too.
>>>> +            */
>>>> +           (memory_region_is_ram(section->mr) &&
>>>> +            ((section->offset_within_region & qemu_real_host_page_mask) ||
>>>> +             (int128_getlo(section->size) & qemu_real_host_page_mask))
>>>> +           );
>>>>  }
>>>>
>>>>
>>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have
>>>> more complicated logic to detect whether it is MMIO? Or there is something
>>>> bigger?  
>>>
>>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the
>>> silly address hack above to try to exclude PCI BARs as they're being
>>> resized, and b) there is some value to enabling peer-to-peer mappings
>>> between devices.  Is there a better, more generic solution to a) now?
>>> If we know a MR is MMIO, then I think we can handle any associated
>>> mapping failure as non-fatal.  It's nice to enable p2p, but it's barely
>>> used and not clear that it works reliably (sometimes even on bare
>>> metal).   
>>
>> When P2P works in a guest, what does it look like exactly? Do the devices
>> talk via the IOMMU or directly? I'd assume the latter but then the guest
>> device driver needs to know physical MMIO addresses and it does not or
>> guest must program BARs exactly as the host did... Has anyone actually
>> reported this working?
> 
> I've seen it work, NVIDIA has a peer to peer test in their cuda tests,
> see previous patches enabling GPUDirect.  The route is generally
> through the IOMMU, guests don't know host physical addresses.  I say
> generally though because ACS does have a Directed Translated P2P
> feature that would allow a more efficient path, but I've never seen
> anything that can take advantage of this.

Ah, I see. I guess it still makes sense with IOMMU as it eliminates an
extra memory copy.

>  
>> I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO
>> windows to PHB address space") and ended up having
>> vfio_listener_region_add() called with RAM regions (which are mapped MMIO)
>> and sPAPR cannot handle it otherwise than creating a window per a memory
>> section and since we can have just two, this fails. Adding
>> memory_region_is_ram_device() to vfio_listener_skipped_section() solves my
>> particular problem but I wonder if I should rather enable this P2P on sPAPR?
> 
> Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO)
> mappings non-fatal.  I think there's probably enough skepticism in
> drivers as to whether p2p works that this isn't as critical.  Of course
> making it work is even better.  Thanks,

Well, to make P2P-via-IOMMU work on sPAPR (assuming that our IOMMU can
actually redirect back to the bus which I am not sure about), these MMIOs
need to me mapped via the huge DMA window by the guest and not by QEMU-VFIO
so sPAPR will need another machine option ("no-mmio-dma-mmap"? implemented
like "vfio-no-msix-emulation" proposed by this patch) to tell
vfio_listener_region_add() not to vfio_dma_map on MMIO MRs,
vfio_listener_region_del() will need adjustment too. Would that be
considered as an option or there is a better interface?



-- 
Alexey