[PATCH v4] x86/msi: fix locking for SR-IOV devices

Stewart Hildebrand posted 1 patch 2 months, 3 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20240827035929.118003-1-stewart.hildebrand@amd.com
xen/arch/x86/msi.c            | 33 +++++++++++----------
xen/drivers/passthrough/pci.c | 55 ++++++++++++++++++++++++++++++++---
xen/include/xen/pci.h         | 13 ++++++++-
3 files changed, 80 insertions(+), 21 deletions(-)
[PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Stewart Hildebrand 2 months, 3 weeks ago
In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
structure") a lock moved from allocate_and_map_msi_pirq() to the caller
and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
call path wasn't updated to reflect the change, leading to a failed
assertion observed under the following conditions:

* PV dom0
* Debug build (debug=y) of Xen
* There is an SR-IOV device in the system with one or more VFs enabled
* Dom0 has loaded the driver for the VF and enabled MSI-X

(XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
(XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
(XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
(XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
(XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
(XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
(XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
(XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
(XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
(XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
(XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150

In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
associated PF to access the vf_rlen array. This array is initialized in
pci_add_device() and is only populated in the associated PF's struct
pci_dev.

Add links between the VF's struct pci_dev and associated PF struct
pci_dev, ensuring the PF's struct doesn't get deallocated until all its
VFs have been removed. Access the vf_rlen array via the new link to the
PF, and remove the troublesome call to pci_get_pdev().

Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
pci_add_device() to set up the link from VF to PF. In case the new
pci_get_pdev() invocation fails to find the associated PF (returning
NULL), return an error.

In pci_remove_device(), handle an issue with a corner case when a PF is
removed with VFs enabled, then re-added with VFs disabled. During PF
removal, a buggy PF driver may fail to disable SR-IOV, leaving stale
dangling VFs. At least Linux warns in this case:

[  106.500334]  0000:01:00.0: driver left SR-IOV enabled after remove

If the PF is subsequently re-added with VFs disabled, the previously
associated VFs in Xen would not be backed by a device. Any attempt to
access the config space of the stale VF would be invalid. Avoid issues
in this case by removing the VFs right away when removing the PF. This
also happens to simplify the maintenance of the PF<->VF links since the
associated PF will always exist during the lifetime of a VF. Note,
however, if a PF is removed before the VFs, Xen will return an error
for the VF removals since they were already removed.

Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Candidate for backport to 4.19

v3->v4:
* handle case when PF is removed with VFs enabled, then re-added with
  VFs disabled

v2->v3:
* link from VF to PF's struct pci_dev *

v1->v2:
* remove call to pci_get_pdev()
---
When I tested removing a PF with VFs enabled, then re-adding with VFs
disabled, I observed the following Xen crash when dom0 attempted to
access the config space of a stale VF:

(XEN) Assertion 'pos' failed at arch/x86/msi.c:1275
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
(XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
(XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
(XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
(XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
(XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
(XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
(XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
(XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
(XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c

This crash is avoided by removing the VFs right away with the PF.
---
 xen/arch/x86/msi.c            | 33 +++++++++++----------
 xen/drivers/passthrough/pci.c | 55 ++++++++++++++++++++++++++++++++---
 xen/include/xen/pci.h         | 13 ++++++++-
 3 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 0c97fbb3fc03..8d54f268dbbf 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -662,34 +662,31 @@ static int msi_capability_init(struct pci_dev *dev,
     return 0;
 }
 
-static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
+static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf,
+                            const struct pf_info *pf_info)
 {
+    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
     u8 limit;
     u32 addr, base = PCI_BASE_ADDRESS_0;
     u64 disp = 0;
 
     if ( vf >= 0 )
     {
-        struct pci_dev *pdev = pci_get_pdev(NULL,
-                                            PCI_SBDF(seg, bus, slot, func));
         unsigned int pos;
         uint16_t ctrl, num_vf, offset, stride;
 
-        if ( !pdev )
-            return 0;
-
-        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
-        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
-        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
-        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
-        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
+        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
+        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
+        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
+        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
+        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
 
         if ( !pos ||
              !(ctrl & PCI_SRIOV_CTRL_VFE) ||
              !(ctrl & PCI_SRIOV_CTRL_MSE) ||
              !num_vf || !offset || (num_vf > 1 && !stride) ||
              bir >= PCI_SRIOV_NUM_BARS ||
-             !pdev->vf_rlen[bir] )
+             !pf_info->vf_rlen[bir] )
             return 0;
         base = pos + PCI_SRIOV_BAR;
         vf -= PCI_BDF(bus, slot, func) + offset;
@@ -703,8 +700,8 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
         }
         if ( vf >= num_vf )
             return 0;
-        BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
-        disp = vf * pdev->vf_rlen[bir];
+        BUILD_BUG_ON(ARRAY_SIZE(pf_info->vf_rlen) != PCI_SRIOV_NUM_BARS);
+        disp = vf * pf_info->vf_rlen[bir];
         limit = PCI_SRIOV_NUM_BARS;
     }
     else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
@@ -813,6 +810,7 @@ static int msix_capability_init(struct pci_dev *dev,
         int vf;
         paddr_t pba_paddr;
         unsigned int pba_offset;
+        const struct pf_info *pf_info;
 
         if ( !dev->info.is_virtfn )
         {
@@ -820,6 +818,7 @@ static int msix_capability_init(struct pci_dev *dev,
             pslot = slot;
             pfunc = func;
             vf = -1;
+            pf_info = NULL;
         }
         else
         {
@@ -827,9 +826,11 @@ static int msix_capability_init(struct pci_dev *dev,
             pslot = PCI_SLOT(dev->info.physfn.devfn);
             pfunc = PCI_FUNC(dev->info.physfn.devfn);
             vf = dev->sbdf.bdf;
+            pf_info = &dev->virtfn.pf_pdev->physfn;
         }
 
-        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
+        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf,
+                                       pf_info);
         WARN_ON(msi && msi->table_base != table_paddr);
         if ( !table_paddr )
         {
@@ -852,7 +853,7 @@ static int msix_capability_init(struct pci_dev *dev,
 
         pba_offset = pci_conf_read32(dev->sbdf, msix_pba_offset_reg(pos));
         bir = (u8)(pba_offset & PCI_MSIX_BIRMASK);
-        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
+        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf, pf_info);
         WARN_ON(!pba_paddr);
         pba_paddr += pba_offset & ~PCI_MSIX_BIRMASK;
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 5a446d3dcee0..97b3417252f7 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
 
     list_add(&pdev->alldevs_list, &pseg->alldevs_list);
 
+    INIT_LIST_HEAD(&pdev->physfn.vf_list);
+
     /* update bus2bridge */
     switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
     {
@@ -446,6 +448,10 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
 
     list_del(&pdev->alldevs_list);
     pdev_msi_deinit(pdev);
+
+    if ( pdev->info.is_virtfn )
+        list_del(&pdev->virtfn.entry);
+
     xfree(pdev);
 }
 
@@ -700,10 +706,31 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
          * extended function.
          */
         if ( pdev->info.is_virtfn )
+        {
+            struct pci_dev *pf_pdev;
+
             pdev->info.is_extfn = pf_is_extfn;
+            pf_pdev = pci_get_pdev(NULL,
+                                   PCI_SBDF(seg, info->physfn.bus,
+                                            info->physfn.devfn));
+            if ( pf_pdev )
+            {
+                pdev->virtfn.pf_pdev = pf_pdev;
+                list_add(&pdev->virtfn.entry, &pf_pdev->physfn.vf_list);
+            }
+            else
+            {
+                printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
+                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
+                       &pdev->sbdf);
+                free_pdev(pseg, pdev);
+                ret = -ENODEV;
+                goto out;
+            }
+        }
     }
 
-    if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
+    if ( !pdev->info.is_virtfn && !pdev->physfn.vf_rlen[0] )
     {
         unsigned int pos = pci_find_ext_capability(pdev->sbdf,
                                                    PCI_EXT_CAP_ID_SRIOV);
@@ -715,7 +742,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         {
             unsigned int i;
 
-            BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
+            BUILD_BUG_ON(ARRAY_SIZE(pdev->physfn.vf_rlen) !=
+                                    PCI_SRIOV_NUM_BARS);
+
             for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
@@ -730,7 +759,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                     continue;
                 }
                 ret = pci_size_mem_bar(pdev->sbdf, idx, NULL,
-                                       &pdev->vf_rlen[i],
+                                       &pdev->physfn.vf_rlen[i],
                                        PCI_BAR_VF |
                                        ((i == PCI_SRIOV_NUM_BARS - 1) ?
                                         PCI_BAR_LAST : 0));
@@ -818,6 +847,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
+            ret = 0;
+
+            if ( !pdev->info.is_virtfn )
+            {
+                struct pci_dev *vf_pdev, *tmp;
+
+                /*
+                 * If a PF with VFs enabled is removed, then re-added without
+                 * VFs enabled, the previously associated VFs would no longer be
+                 * backed by a device. Remove the associated VFs right away.
+                 */
+                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
+                                         virtfn.entry)
+                    ret = pci_remove_device(vf_pdev->sbdf.seg,
+                                            vf_pdev->sbdf.bus,
+                                            vf_pdev->sbdf.devfn) ?: ret;
+            }
+
             if ( pdev->domain )
             {
                 write_lock(&pdev->domain->pci_lock);
@@ -826,7 +873,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
                 write_unlock(&pdev->domain->pci_lock);
             }
             pci_cleanup_msi(pdev);
-            ret = iommu_remove_device(pdev);
+            ret = iommu_remove_device(pdev) ?: ret;
             printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
             free_pdev(pseg, pdev);
             break;
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 63e49f0117e9..a24026d25148 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -150,7 +150,18 @@ struct pci_dev {
         unsigned int count;
 #define PT_FAULT_THRESHOLD 10
     } fault;
-    u64 vf_rlen[6];
+    union {
+        struct pf_info {
+            /* Only populated for PFs (info.is_virtfn == false) */
+            struct list_head vf_list;             /* List head */
+            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
+        } physfn;
+        struct {
+            /* Only populated for VFs (info.is_virtfn == true) */
+            struct list_head entry;               /* Entry in vf_list */
+            struct pci_dev *pf_pdev;              /* Link from VF to PF */
+        } virtfn;
+    };
 
     /* Data for vPCI. */
     struct vpci *vpci;

base-commit: b8cdfac2be38c98dd3ad0e911a3f7f787f5bcf6b
-- 
2.46.0
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Jan Beulich 2 months, 3 weeks ago
On 27.08.2024 05:59, Stewart Hildebrand wrote:
> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
> call path wasn't updated to reflect the change, leading to a failed
> assertion observed under the following conditions:
> 
> * PV dom0
> * Debug build (debug=y) of Xen
> * There is an SR-IOV device in the system with one or more VFs enabled
> * Dom0 has loaded the driver for the VF and enabled MSI-X
> 
> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
> 
> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
> associated PF to access the vf_rlen array. This array is initialized in
> pci_add_device() and is only populated in the associated PF's struct
> pci_dev.
> 
> Add links between the VF's struct pci_dev and associated PF struct
> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
> VFs have been removed. Access the vf_rlen array via the new link to the
> PF, and remove the troublesome call to pci_get_pdev().
> 
> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
> pci_add_device() to set up the link from VF to PF. In case the new
> pci_get_pdev() invocation fails to find the associated PF (returning
> NULL), return an error.
> 
> In pci_remove_device(), handle an issue with a corner case when a PF is
> removed with VFs enabled, then re-added with VFs disabled. During PF
> removal, a buggy PF driver may fail to disable SR-IOV, leaving stale
> dangling VFs. At least Linux warns in this case:
> 
> [  106.500334]  0000:01:00.0: driver left SR-IOV enabled after remove
> 
> If the PF is subsequently re-added with VFs disabled, the previously
> associated VFs in Xen would not be backed by a device. Any attempt to
> access the config space of the stale VF would be invalid. Avoid issues
> in this case by removing the VFs right away when removing the PF. This
> also happens to simplify the maintenance of the PF<->VF links since the
> associated PF will always exist during the lifetime of a VF. Note,
> however, if a PF is removed before the VFs, Xen will return an error
> for the VF removals since they were already removed.

Not very nice, but probably tolerable.

> ---
> When I tested removing a PF with VFs enabled, then re-adding with VFs
> disabled, I observed the following Xen crash when dom0 attempted to
> access the config space of a stale VF:
> 
> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1275
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
> (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
> (XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
> (XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
> (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
> (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
> (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
> (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
> (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
> (XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c
> 
> This crash is avoided by removing the VFs right away with the PF.

If I'm not mistaken the same would happen independent of your change if
Dom0 disabled VFs without reporting their removal (first). That's
technically no different from removing with VFs enabled, then adding
with VFs disabled. We may want to be able to cope with that.

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>  
>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
>  
> +    INIT_LIST_HEAD(&pdev->physfn.vf_list);

There is a certain risk with doing such uniformly when the field is part
of a union. Yes, little initialization has happened up to here, but I'm
still concerned. (One of the reasons I don't like the struct list_head
instances to be split, despite your legitimate point regarding naming.)
At the very least this wants moving yet earlier in the function, before
the new struct is passed anywhere else.

> @@ -700,10 +706,31 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * extended function.
>           */
>          if ( pdev->info.is_virtfn )
> +        {
> +            struct pci_dev *pf_pdev;
> +
>              pdev->info.is_extfn = pf_is_extfn;
> +            pf_pdev = pci_get_pdev(NULL,
> +                                   PCI_SBDF(seg, info->physfn.bus,
> +                                            info->physfn.devfn));
> +            if ( pf_pdev )
> +            {
> +                pdev->virtfn.pf_pdev = pf_pdev;
> +                list_add(&pdev->virtfn.entry, &pf_pdev->physfn.vf_list);
> +            }
> +            else
> +            {
> +                printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
> +                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
> +                       &pdev->sbdf);
> +                free_pdev(pseg, pdev);
> +                ret = -ENODEV;

Why -ENODEV? Ideally the error code from the earlier pci_add_device() could
be retained (even if some fallback would be needed in case that one succeeded
or the path wasn't even taken). Yet on the whole I wonder if we wouldn't be
better off delaying that recursive call some, such that the lock wouldn't
be dropped anymore between the call and getting here. This would then also
eliminate the need for the new pci_get_pdev() invocation.

> @@ -818,6 +847,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> +            ret = 0;
> +
> +            if ( !pdev->info.is_virtfn )
> +            {
> +                struct pci_dev *vf_pdev, *tmp;
> +
> +                /*
> +                 * If a PF with VFs enabled is removed, then re-added without
> +                 * VFs enabled, the previously associated VFs would no longer be
> +                 * backed by a device. Remove the associated VFs right away.
> +                 */

While in the description the "enabled" and "disabled" are kind of
tolerable, at least in the code comment I think we want to be more
precise. The question isn't whether VFs are enabled, but whether we
know of the VFs.

It's further unclear here what "a device" is.

> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
> +                                         virtfn.entry)
> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
> +                                            vf_pdev->sbdf.bus,
> +                                            vf_pdev->sbdf.devfn) ?: ret;

And if this fails, the VF will still remain orphaned. I think in the
model I had suggested no such risk would exist.

Misra also isn't going to like the recursion here.

> @@ -826,7 +873,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>                  write_unlock(&pdev->domain->pci_lock);
>              }
>              pci_cleanup_msi(pdev);
> -            ret = iommu_remove_device(pdev);
> +            ret = iommu_remove_device(pdev) ?: ret;
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;

Is it even legitimate to continue cleaning up if there was an earlier
failure?

In any event - please consider e.g. the case where the XSM check allows
the PF removal, but denies the removal of some of the VFs.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -150,7 +150,18 @@ struct pci_dev {
>          unsigned int count;
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
> -    u64 vf_rlen[6];
> +    union {
> +        struct pf_info {
> +            /* Only populated for PFs (info.is_virtfn == false) */
> +            struct list_head vf_list;             /* List head */
> +            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
> +        } physfn;
> +        struct {
> +            /* Only populated for VFs (info.is_virtfn == true) */
> +            struct list_head entry;               /* Entry in vf_list */
> +            struct pci_dev *pf_pdev;              /* Link from VF to PF */

const?

Jan
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Stewart Hildebrand 1 month, 1 week ago
On 8/28/24 06:36, Jan Beulich wrote:
> On 27.08.2024 05:59, Stewart Hildebrand wrote:
>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>> call path wasn't updated to reflect the change, leading to a failed
>> assertion observed under the following conditions:
>>
>> * PV dom0
>> * Debug build (debug=y) of Xen
>> * There is an SR-IOV device in the system with one or more VFs enabled
>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>
>> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>
>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>> associated PF to access the vf_rlen array. This array is initialized in
>> pci_add_device() and is only populated in the associated PF's struct
>> pci_dev.
>>
>> Add links between the VF's struct pci_dev and associated PF struct
>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>> VFs have been removed. Access the vf_rlen array via the new link to the
>> PF, and remove the troublesome call to pci_get_pdev().
>>
>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>> pci_add_device() to set up the link from VF to PF. In case the new
>> pci_get_pdev() invocation fails to find the associated PF (returning
>> NULL), return an error.
>>
>> In pci_remove_device(), handle an issue with a corner case when a PF is
>> removed with VFs enabled, then re-added with VFs disabled. During PF
>> removal, a buggy PF driver may fail to disable SR-IOV, leaving stale
>> dangling VFs. At least Linux warns in this case:
>>
>> [  106.500334]  0000:01:00.0: driver left SR-IOV enabled after remove
>>
>> If the PF is subsequently re-added with VFs disabled, the previously
>> associated VFs in Xen would not be backed by a device. Any attempt to
>> access the config space of the stale VF would be invalid. Avoid issues
>> in this case by removing the VFs right away when removing the PF. This
>> also happens to simplify the maintenance of the PF<->VF links since the
>> associated PF will always exist during the lifetime of a VF. Note,
>> however, if a PF is removed before the VFs, Xen will return an error
>> for the VF removals since they were already removed.
> 
> Not very nice, but probably tolerable.
> 
>> ---
>> When I tested removing a PF with VFs enabled, then re-adding with VFs
>> disabled, I observed the following Xen crash when dom0 attempted to
>> access the config space of a stale VF:
>>
>> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1275
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
>> (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
>> (XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
>> (XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
>> (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
>> (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
>> (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
>> (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
>> (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
>> (XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c
>>
>> This crash is avoided by removing the VFs right away with the PF.
> 
> If I'm not mistaken the same would happen independent of your change if
> Dom0 disabled VFs without reporting their removal (first). That's
> technically no different from removing with VFs enabled, then adding
> with VFs disabled. We may want to be able to cope with that.

You're right, the ASSERT in pci_msi_conf_write_intercept() can be
triggered with a modified dom0 kernel by disabling SR-IOV and performing
a device reset on the stale VF, either manually or as part of
"xl pci-assignable-add".

  echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
  echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
  echo 0000\:01\:10.0 > /sys/bus/pci/devices/0000\:01\:10.0/driver/unbind
  echo 1 > /sys/bus/pci/devices/0000\:01\:10.0/reset

If using xl pci-assignable-add, then the ASSERT in pci_reset_msix_state()
can also be triggered.

  echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
  echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
  xl pci-assignable-add 01:10.0

(XEN) Assertion 'pos' failed at arch/x86/msi.c:1246
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040346b0a>] R pci_reset_msix_state+0x47/0x50
(XEN)    [<ffff82d040287eec>] F pdev_msix_assign+0x19/0x35
(XEN)    [<ffff82d040286184>] F drivers/passthrough/pci.c#assign_device+0x181/0x471
(XEN)    [<ffff82d040287c36>] F iommu_do_pci_domctl+0x248/0x2ec
(XEN)    [<ffff82d040284e1f>] F iommu_do_domctl+0x26/0x44
(XEN)    [<ffff82d0402483b8>] F do_domctl+0x8c1/0x1660
(XEN)    [<ffff82d04032977e>] F pv_hypercall+0x5ce/0x6af
(XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150

These test cases require a Xen with CONFIG_DEBUG=y and hacking the Linux
SR-IOV subsystem to skip the VF removal (not just by hacking a PF
driver). For completeness, this is the modification I made to my dom0
kernel for the test cases.

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b8302b07a885..562d2024636f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -385,7 +385,9 @@ void pci_iov_remove_virtfn(struct pci_dev *dev, int id)
        if (virtfn->dev.kobj.sd)
                sysfs_remove_link(&virtfn->dev.kobj, "physfn");
 
+#if 0
        pci_stop_and_remove_bus_device(virtfn);
+#endif
        virtfn_remove_bus(dev->bus, virtfn->bus);
 
        /* balance pci_get_domain_bus_and_slot() */


Anyway, the approach of removing VFs right away with the PFs isn't
sufficient. I suggest (planning for v5) getting rid of the ASSERTs and
returning an error code from pci_msi_conf_write_intercept() and
pci_reset_msix_state(). We may also want to set pdev->broken = true.

>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>>  
>>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
>>  
>> +    INIT_LIST_HEAD(&pdev->physfn.vf_list);
> 
> There is a certain risk with doing such uniformly when the field is part
> of a union. Yes, little initialization has happened up to here, but I'm
> still concerned. (One of the reasons I don't like the struct list_head
> instances to be split, despite your legitimate point regarding naming.)
> At the very least this wants moving yet earlier in the function, before
> the new struct is passed anywhere else.

Understood. I personally have a slight preference for keeping the entry
and head names distinct, so I'll plan to move the initialization
earlier. However, I could easily be convinced to un-split the struct
list_head instances if that's your preference. Let me know.

>> @@ -700,10 +706,31 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> +        {
>> +            struct pci_dev *pf_pdev;
>> +
>>              pdev->info.is_extfn = pf_is_extfn;
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
>> +            if ( pf_pdev )
>> +            {
>> +                pdev->virtfn.pf_pdev = pf_pdev;
>> +                list_add(&pdev->virtfn.entry, &pf_pdev->physfn.vf_list);
>> +            }
>> +            else
>> +            {
>> +                printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
>> +                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
>> +                       &pdev->sbdf);
>> +                free_pdev(pseg, pdev);
>> +                ret = -ENODEV;
> 
> Why -ENODEV? Ideally the error code from the earlier pci_add_device() could
> be retained (even if some fallback would be needed in case that one succeeded
> or the path wasn't even taken).

I'll retain the error code from the pci_add_device() invocation.

> Yet on the whole I wonder if we wouldn't be
> better off delaying that recursive call some, such that the lock wouldn't
> be dropped anymore between the call and getting here.

Agreed, will do.

> This would then also
> eliminate the need for the new pci_get_pdev() invocation.
> 
>> @@ -818,6 +847,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>>          if ( pdev->bus == bus && pdev->devfn == devfn )
>>          {
>> +            ret = 0;
>> +
>> +            if ( !pdev->info.is_virtfn )
>> +            {
>> +                struct pci_dev *vf_pdev, *tmp;
>> +
>> +                /*
>> +                 * If a PF with VFs enabled is removed, then re-added without
>> +                 * VFs enabled, the previously associated VFs would no longer be
>> +                 * backed by a device. Remove the associated VFs right away.
>> +                 */
> 
> While in the description the "enabled" and "disabled" are kind of
> tolerable, at least in the code comment I think we want to be more
> precise. The question isn't whether VFs are enabled, but whether we
> know of the VFs.
> 
> It's further unclear here what "a device" is.

The in-code comment shouldn't be relevant any more for v5, so I'll
remove it.

>> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
>> +                                         virtfn.entry)
>> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
>> +                                            vf_pdev->sbdf.bus,
>> +                                            vf_pdev->sbdf.devfn) ?: ret;
> 
> And if this fails, the VF will still remain orphaned. I think in the
> model I had suggested no such risk would exist.
> 
> Misra also isn't going to like the recursion here.

With the ASSERTs being addressed directly, there's no need to remove
the VFs right away with the PF.

BTW, I don't think refusing a removal "request" would be a good idea.
Dom0 isn't really requesting the device to be removed. Dom0 has already
removed the device (e.g. in response to hot-unplug or SR-IOV disable),
and is merely informing Xen of the removal.

So during PF removal, I'll plan (for v5) to unlink the the VFs from the
PF, and continue to rely on dom0 to inform Xen of PF and VF removal
individually. By unlink, I mean set vf_pdev->virtfn.pf_pdev = NULL and
remove the VFs from the list. Probably also set vf_pdev->broken = true.

>> @@ -826,7 +873,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>                  write_unlock(&pdev->domain->pci_lock);
>>              }
>>              pci_cleanup_msi(pdev);
>> -            ret = iommu_remove_device(pdev);
>> +            ret = iommu_remove_device(pdev) ?: ret;
>>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>>              free_pdev(pseg, pdev);
>>              break;
> 
> Is it even legitimate to continue cleaning up if there was an earlier
> failure?
> 
> In any event - please consider e.g. the case where the XSM check allows
> the PF removal, but denies the removal of some of the VFs.

This shouldn't be relevant any more for v5.

>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -150,7 +150,18 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> -    u64 vf_rlen[6];
>> +    union {
>> +        struct pf_info {
>> +            /* Only populated for PFs (info.is_virtfn == false) */
>> +            struct list_head vf_list;             /* List head */
>> +            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
>> +        } physfn;
>> +        struct {
>> +            /* Only populated for VFs (info.is_virtfn == true) */
>> +            struct list_head entry;               /* Entry in vf_list */
>> +            struct pci_dev *pf_pdev;              /* Link from VF to PF */
> 
> const?

Yep

> 
> Jan
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Jan Beulich 1 month, 1 week ago
On 09.10.2024 21:44, Stewart Hildebrand wrote:
> On 8/28/24 06:36, Jan Beulich wrote:
>> On 27.08.2024 05:59, Stewart Hildebrand wrote:
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>>>  
>>>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
>>>  
>>> +    INIT_LIST_HEAD(&pdev->physfn.vf_list);
>>
>> There is a certain risk with doing such uniformly when the field is part
>> of a union. Yes, little initialization has happened up to here, but I'm
>> still concerned. (One of the reasons I don't like the struct list_head
>> instances to be split, despite your legitimate point regarding naming.)
>> At the very least this wants moving yet earlier in the function, before
>> the new struct is passed anywhere else.
> 
> Understood. I personally have a slight preference for keeping the entry
> and head names distinct, so I'll plan to move the initialization
> earlier. However, I could easily be convinced to un-split the struct
> list_head instances if that's your preference. Let me know.

I indicated before that this would be my preference, without meaning to
insist on this folding.

>>> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
>>> +                                         virtfn.entry)
>>> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
>>> +                                            vf_pdev->sbdf.bus,
>>> +                                            vf_pdev->sbdf.devfn) ?: ret;
>>
>> And if this fails, the VF will still remain orphaned. I think in the
>> model I had suggested no such risk would exist.
>>
>> Misra also isn't going to like the recursion here.
> 
> With the ASSERTs being addressed directly, there's no need to remove
> the VFs right away with the PF.
> 
> BTW, I don't think refusing a removal "request" would be a good idea.
> Dom0 isn't really requesting the device to be removed. Dom0 has already
> removed the device (e.g. in response to hot-unplug or SR-IOV disable),
> and is merely informing Xen of the removal.

Just to clarify: I don't mean returning an error here to indicate
"refusal". As you say, this is merely a notification. Yet I think it's
still legitimate to pass back an error. Whether the Dom0 kernel can
do anything useful with that error is a separate question.

> So during PF removal, I'll plan (for v5) to unlink the the VFs from the
> PF, and continue to rely on dom0 to inform Xen of PF and VF removal
> individually. By unlink, I mean set vf_pdev->virtfn.pf_pdev = NULL and
> remove the VFs from the list. Probably also set vf_pdev->broken = true.

As to the latter - yes. For the rest I guess I need to see the new code.

Jan
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Stewart Hildebrand 2 months ago
On 8/28/24 06:36, Jan Beulich wrote:
> On 27.08.2024 05:59, Stewart Hildebrand wrote:
>> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
>> +                                         virtfn.entry)
>> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
>> +                                            vf_pdev->sbdf.bus,
>> +                                            vf_pdev->sbdf.devfn) ?: ret;
> 
> And if this fails, the VF will still remain orphaned. I think in the
> model I had suggested no such risk would exist.

Are you referring to your suggestion "to refuse the request (in
pci_remove_device()) when the list isn't empty" ? [1] Or something else?

[1] https://lore.kernel.org/xen-devel/74f88a45-a632-4ca6-9cee-95f52370b397@suse.com/
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Jan Beulich 1 month, 4 weeks ago
On 18.09.2024 16:11, Stewart Hildebrand wrote:
> On 8/28/24 06:36, Jan Beulich wrote:
>> On 27.08.2024 05:59, Stewart Hildebrand wrote:
>>> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
>>> +                                         virtfn.entry)
>>> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
>>> +                                            vf_pdev->sbdf.bus,
>>> +                                            vf_pdev->sbdf.devfn) ?: ret;
>>
>> And if this fails, the VF will still remain orphaned. I think in the
>> model I had suggested no such risk would exist.
> 
> Are you referring to your suggestion "to refuse the request (in
> pci_remove_device()) when the list isn't empty" ? [1] Or something else?
> 
> [1] https://lore.kernel.org/xen-devel/74f88a45-a632-4ca6-9cee-95f52370b397@suse.com/

It's been a while, but yes, I think that's the one.

Jan
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Roger Pau Monné 2 months, 3 weeks ago
On Mon, Aug 26, 2024 at 11:59:28PM -0400, Stewart Hildebrand wrote:
> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
                    ^ was?
> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
> call path wasn't updated to reflect the change, leading to a failed
> assertion observed under the following conditions:
> 
> * PV dom0
> * Debug build (debug=y) of Xen
> * There is an SR-IOV device in the system with one or more VFs enabled
> * Dom0 has loaded the driver for the VF and enabled MSI-X
> 
> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
> 
> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
> associated PF to access the vf_rlen array. This array is initialized in
> pci_add_device() and is only populated in the associated PF's struct
> pci_dev.
> 
> Add links between the VF's struct pci_dev and associated PF struct
> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
> VFs have been removed. Access the vf_rlen array via the new link to the
> PF, and remove the troublesome call to pci_get_pdev().
> 
> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
> pci_add_device() to set up the link from VF to PF. In case the new
> pci_get_pdev() invocation fails to find the associated PF (returning
> NULL), return an error.
> 
> In pci_remove_device(), handle an issue with a corner case when a PF is
> removed with VFs enabled, then re-added with VFs disabled. During PF
> removal, a buggy PF driver may fail to disable SR-IOV, leaving stale
> dangling VFs. At least Linux warns in this case:
> 
> [  106.500334]  0000:01:00.0: driver left SR-IOV enabled after remove
> 
> If the PF is subsequently re-added with VFs disabled, the previously
> associated VFs in Xen would not be backed by a device. Any attempt to
> access the config space of the stale VF would be invalid. Avoid issues
> in this case by removing the VFs right away when removing the PF. This
> also happens to simplify the maintenance of the PF<->VF links since the
> associated PF will always exist during the lifetime of a VF. Note,
> however, if a PF is removed before the VFs, Xen will return an error
> for the VF removals since they were already removed.

I think there are two different fixes in this patch, that should be
split into two separate fixes:

 - Fix for the locking issue, which requires adding a pointer to the
   parent PF in the pci_dev of VFs.
 - Introduce better tracking of the PF<->VF pci_dev relation.

My suggestion would be to introduce the tacking between PF and VFs as
a pre-patch (IOW: the linked list related changes), and then do the
addition of the pf_pdev (the proper fix) as a followup.

> 
> Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure")
> Reported-by: Teddy Astie <teddy.astie@vates.tech>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> Candidate for backport to 4.19
> 
> v3->v4:
> * handle case when PF is removed with VFs enabled, then re-added with
>   VFs disabled
> 
> v2->v3:
> * link from VF to PF's struct pci_dev *
> 
> v1->v2:
> * remove call to pci_get_pdev()
> ---
> When I tested removing a PF with VFs enabled, then re-adding with VFs
> disabled, I observed the following Xen crash when dom0 attempted to
> access the config space of a stale VF:
> 
> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1275
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
> (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
> (XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
> (XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
> (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
> (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
> (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
> (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
> (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
> (XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c
> 
> This crash is avoided by removing the VFs right away with the PF.

It's not clear to me, does this crash happen even without this change
applied?  If that's the case it seems like you need an extra Fixes
tag (see above for my recommendation to split the change into two).

> ---
>  xen/arch/x86/msi.c            | 33 +++++++++++----------
>  xen/drivers/passthrough/pci.c | 55 ++++++++++++++++++++++++++++++++---
>  xen/include/xen/pci.h         | 13 ++++++++-
>  3 files changed, 80 insertions(+), 21 deletions(-)
> 
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 0c97fbb3fc03..8d54f268dbbf 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -662,34 +662,31 @@ static int msi_capability_init(struct pci_dev *dev,
>      return 0;
>  }
>  
> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
> +static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf,

Since you already have to touch this, I don't think it would very
complicated to pass a pci_sbdf_t here instead of the seg, bus, slot
and func parameters.  You can expand back to seg, bus, slot and func
as local function parameters in order to avoid more code changes than
strictly needed.

> +                            const struct pf_info *pf_info)
>  {
> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>      u8 limit;
>      u32 addr, base = PCI_BASE_ADDRESS_0;
>      u64 disp = 0;
>  
>      if ( vf >= 0 )
>      {
> -        struct pci_dev *pdev = pci_get_pdev(NULL,
> -                                            PCI_SBDF(seg, bus, slot, func));
>          unsigned int pos;
>          uint16_t ctrl, num_vf, offset, stride;
>  
> -        if ( !pdev )
> -            return 0;

Is it worth adding a:

if ( !pf_info )
{
    ASSERT_UNREACHABLE();
    return 0;
}

As I think pf_info must be != NULL if vf >= 0?

> -
> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>  
>          if ( !pos ||
>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>               bir >= PCI_SRIOV_NUM_BARS ||
> -             !pdev->vf_rlen[bir] )
> +             !pf_info->vf_rlen[bir] )
>              return 0;
>          base = pos + PCI_SRIOV_BAR;
>          vf -= PCI_BDF(bus, slot, func) + offset;
> @@ -703,8 +700,8 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
>          }
>          if ( vf >= num_vf )
>              return 0;
> -        BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
> -        disp = vf * pdev->vf_rlen[bir];
> +        BUILD_BUG_ON(ARRAY_SIZE(pf_info->vf_rlen) != PCI_SRIOV_NUM_BARS);
> +        disp = vf * pf_info->vf_rlen[bir];
>          limit = PCI_SRIOV_NUM_BARS;
>      }
>      else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
> @@ -813,6 +810,7 @@ static int msix_capability_init(struct pci_dev *dev,
>          int vf;
>          paddr_t pba_paddr;
>          unsigned int pba_offset;
> +        const struct pf_info *pf_info;
>  
>          if ( !dev->info.is_virtfn )
>          {
> @@ -820,6 +818,7 @@ static int msix_capability_init(struct pci_dev *dev,
>              pslot = slot;
>              pfunc = func;
>              vf = -1;
> +            pf_info = NULL;
>          }
>          else
>          {
> @@ -827,9 +826,11 @@ static int msix_capability_init(struct pci_dev *dev,
>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>              vf = dev->sbdf.bdf;
> +            pf_info = &dev->virtfn.pf_pdev->physfn;
>          }
>  
> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
> +        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf,
> +                                       pf_info);
>          WARN_ON(msi && msi->table_base != table_paddr);
>          if ( !table_paddr )
>          {
> @@ -852,7 +853,7 @@ static int msix_capability_init(struct pci_dev *dev,
>  
>          pba_offset = pci_conf_read32(dev->sbdf, msix_pba_offset_reg(pos));
>          bir = (u8)(pba_offset & PCI_MSIX_BIRMASK);
> -        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
> +        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf, pf_info);
>          WARN_ON(!pba_paddr);
>          pba_paddr += pba_offset & ~PCI_MSIX_BIRMASK;
>  
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 5a446d3dcee0..97b3417252f7 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>  
>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
>  
> +    INIT_LIST_HEAD(&pdev->physfn.vf_list);
> +
>      /* update bus2bridge */
>      switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
>      {
> @@ -446,6 +448,10 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>  
>      list_del(&pdev->alldevs_list);
>      pdev_msi_deinit(pdev);
> +
> +    if ( pdev->info.is_virtfn )
> +        list_del(&pdev->virtfn.entry);
> +
>      xfree(pdev);
>  }
>  
> @@ -700,10 +706,31 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * extended function.
>           */
>          if ( pdev->info.is_virtfn )
> +        {
> +            struct pci_dev *pf_pdev;
> +
>              pdev->info.is_extfn = pf_is_extfn;
> +            pf_pdev = pci_get_pdev(NULL,
> +                                   PCI_SBDF(seg, info->physfn.bus,
> +                                            info->physfn.devfn));
> +            if ( pf_pdev )
> +            {
> +                pdev->virtfn.pf_pdev = pf_pdev;
> +                list_add(&pdev->virtfn.entry, &pf_pdev->physfn.vf_list);
> +            }
> +            else
> +            {
> +                printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
> +                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
> +                       &pdev->sbdf);
> +                free_pdev(pseg, pdev);
> +                ret = -ENODEV;
> +                goto out;
> +            }
> +        }
>      }
>  
> -    if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
> +    if ( !pdev->info.is_virtfn && !pdev->physfn.vf_rlen[0] )
>      {
>          unsigned int pos = pci_find_ext_capability(pdev->sbdf,
>                                                     PCI_EXT_CAP_ID_SRIOV);
> @@ -715,7 +742,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          {
>              unsigned int i;
>  
> -            BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
> +            BUILD_BUG_ON(ARRAY_SIZE(pdev->physfn.vf_rlen) !=
> +                                    PCI_SRIOV_NUM_BARS);
> +
>              for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
>              {
>                  unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
> @@ -730,7 +759,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>                      continue;
>                  }
>                  ret = pci_size_mem_bar(pdev->sbdf, idx, NULL,
> -                                       &pdev->vf_rlen[i],
> +                                       &pdev->physfn.vf_rlen[i],
>                                         PCI_BAR_VF |
>                                         ((i == PCI_SRIOV_NUM_BARS - 1) ?
>                                          PCI_BAR_LAST : 0));
> @@ -818,6 +847,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> +            ret = 0;
> +
> +            if ( !pdev->info.is_virtfn )
> +            {
> +                struct pci_dev *vf_pdev, *tmp;
> +
> +                /*
> +                 * If a PF with VFs enabled is removed, then re-added without
> +                 * VFs enabled, the previously associated VFs would no longer be
> +                 * backed by a device. Remove the associated VFs right away.
> +                 */
> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
> +                                         virtfn.entry)
> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
> +                                            vf_pdev->sbdf.bus,
> +                                            vf_pdev->sbdf.devfn) ?: ret;
> +            }
> +
>              if ( pdev->domain )
>              {
>                  write_lock(&pdev->domain->pci_lock);
> @@ -826,7 +873,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>                  write_unlock(&pdev->domain->pci_lock);
>              }
>              pci_cleanup_msi(pdev);
> -            ret = iommu_remove_device(pdev);
> +            ret = iommu_remove_device(pdev) ?: ret;
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 63e49f0117e9..a24026d25148 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -150,7 +150,18 @@ struct pci_dev {
>          unsigned int count;
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
> -    u64 vf_rlen[6];
> +    union {
> +        struct pf_info {
> +            /* Only populated for PFs (info.is_virtfn == false) */
> +            struct list_head vf_list;             /* List head */
> +            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
> +        } physfn;
> +        struct {
> +            /* Only populated for VFs (info.is_virtfn == true) */
> +            struct list_head entry;               /* Entry in vf_list */
> +            struct pci_dev *pf_pdev;              /* Link from VF to PF */

With this pointer being introduced for VFs, is the information in
pci_dev_info still required?  It would seem to me the physfn.bus and
physfn.devfn fields can now be fetched directly from the pf_pdev
pointer?

Thanks, Roger.
Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices
Posted by Stewart Hildebrand 1 month, 1 week ago
On 8/27/24 09:20, Roger Pau Monné wrote:
> On Mon, Aug 26, 2024 at 11:59:28PM -0400, Stewart Hildebrand wrote:
>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>                     ^ was?

Yep

>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>> call path wasn't updated to reflect the change, leading to a failed
>> assertion observed under the following conditions:
>>
>> * PV dom0
>> * Debug build (debug=y) of Xen
>> * There is an SR-IOV device in the system with one or more VFs enabled
>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>
>> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>
>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>> associated PF to access the vf_rlen array. This array is initialized in
>> pci_add_device() and is only populated in the associated PF's struct
>> pci_dev.
>>
>> Add links between the VF's struct pci_dev and associated PF struct
>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>> VFs have been removed. Access the vf_rlen array via the new link to the
>> PF, and remove the troublesome call to pci_get_pdev().
>>
>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>> pci_add_device() to set up the link from VF to PF. In case the new
>> pci_get_pdev() invocation fails to find the associated PF (returning
>> NULL), return an error.
>>
>> In pci_remove_device(), handle an issue with a corner case when a PF is
>> removed with VFs enabled, then re-added with VFs disabled. During PF
>> removal, a buggy PF driver may fail to disable SR-IOV, leaving stale
>> dangling VFs. At least Linux warns in this case:
>>
>> [  106.500334]  0000:01:00.0: driver left SR-IOV enabled after remove
>>
>> If the PF is subsequently re-added with VFs disabled, the previously
>> associated VFs in Xen would not be backed by a device. Any attempt to
>> access the config space of the stale VF would be invalid. Avoid issues
>> in this case by removing the VFs right away when removing the PF. This
>> also happens to simplify the maintenance of the PF<->VF links since the
>> associated PF will always exist during the lifetime of a VF. Note,
>> however, if a PF is removed before the VFs, Xen will return an error
>> for the VF removals since they were already removed.
> 
> I think there are two different fixes in this patch, that should be
> split into two separate fixes:
> 
>  - Fix for the locking issue, which requires adding a pointer to the
>    parent PF in the pci_dev of VFs.
>  - Introduce better tracking of the PF<->VF pci_dev relation.
> 
> My suggestion would be to introduce the tacking between PF and VFs as
> a pre-patch (IOW: the linked list related changes), and then do the
> addition of the pf_pdev (the proper fix) as a followup.

Agreed

>>
>> Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure")
>> Reported-by: Teddy Astie <teddy.astie@vates.tech>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>> Candidate for backport to 4.19
>>
>> v3->v4:
>> * handle case when PF is removed with VFs enabled, then re-added with
>>   VFs disabled
>>
>> v2->v3:
>> * link from VF to PF's struct pci_dev *
>>
>> v1->v2:
>> * remove call to pci_get_pdev()
>> ---
>> When I tested removing a PF with VFs enabled, then re-adding with VFs
>> disabled, I observed the following Xen crash when dom0 attempted to
>> access the config space of a stale VF:
>>
>> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1275
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
>> (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
>> (XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
>> (XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
>> (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
>> (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
>> (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
>> (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
>> (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
>> (XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c
>>
>> This crash is avoided by removing the VFs right away with the PF.
> 
> It's not clear to me, does this crash happen even without this change
> applied?

Yep, indeed it does

> If that's the case it seems like you need an extra Fixes
> tag (see above for my recommendation to split the change into two).

Yep

I'm taking a slightly different approach to fixing the
pci_msi_conf_write_intercept() crash, so I'll plan on having 3 patches
in v5:

1. Remove ASSERTs in pci_msi_conf_write_intercept() and pci_reset_msix_state()
    Fixes: 484d7c852e4f ("x86/MSI-X: track host and guest mask-all requests separately")
    Fixes: 575e18d54d19 ("pci: clear {host/guest}_maskall field on assign")
2. Introduce PF<->VF links
    No Fixes tag, but pre-requisite for the next patch
3. Fix the locking issue
    Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure")

>> ---
>>  xen/arch/x86/msi.c            | 33 +++++++++++----------
>>  xen/drivers/passthrough/pci.c | 55 ++++++++++++++++++++++++++++++++---
>>  xen/include/xen/pci.h         | 13 ++++++++-
>>  3 files changed, 80 insertions(+), 21 deletions(-)
>>
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index 0c97fbb3fc03..8d54f268dbbf 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -662,34 +662,31 @@ static int msi_capability_init(struct pci_dev *dev,
>>      return 0;
>>  }
>>  
>> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
>> +static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf,
> 
> Since you already have to touch this, I don't think it would very
> complicated to pass a pci_sbdf_t here instead of the seg, bus, slot
> and func parameters.  You can expand back to seg, bus, slot and func
> as local function parameters in order to avoid more code changes than
> strictly needed.

Sure thing. May as well change the return type and bir to stdint.h types
too.

>> +                            const struct pf_info *pf_info)
>>  {
>> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>>      u8 limit;
>>      u32 addr, base = PCI_BASE_ADDRESS_0;
>>      u64 disp = 0;
>>  
>>      if ( vf >= 0 )
>>      {
>> -        struct pci_dev *pdev = pci_get_pdev(NULL,
>> -                                            PCI_SBDF(seg, bus, slot, func));
>>          unsigned int pos;
>>          uint16_t ctrl, num_vf, offset, stride;
>>  
>> -        if ( !pdev )
>> -            return 0;
> 
> Is it worth adding a:
> 
> if ( !pf_info )
> {
>     ASSERT_UNREACHABLE();
>     return 0;
> }
> 
> As I think pf_info must be != NULL if vf >= 0?

We will need a "if ( !pf_info )" check in v5. In v4, it would be
ASSERT_UNREACHABLE(), but it won't be anymore in the next rev.

>> -
>> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
>> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
>> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
>> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
>> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
>> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
>> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
>> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>>  
>>          if ( !pos ||
>>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>>               bir >= PCI_SRIOV_NUM_BARS ||
>> -             !pdev->vf_rlen[bir] )
>> +             !pf_info->vf_rlen[bir] )
>>              return 0;
>>          base = pos + PCI_SRIOV_BAR;
>>          vf -= PCI_BDF(bus, slot, func) + offset;
>> @@ -703,8 +700,8 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
>>          }
>>          if ( vf >= num_vf )
>>              return 0;
>> -        BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
>> -        disp = vf * pdev->vf_rlen[bir];
>> +        BUILD_BUG_ON(ARRAY_SIZE(pf_info->vf_rlen) != PCI_SRIOV_NUM_BARS);
>> +        disp = vf * pf_info->vf_rlen[bir];
>>          limit = PCI_SRIOV_NUM_BARS;
>>      }
>>      else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
>> @@ -813,6 +810,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>          int vf;
>>          paddr_t pba_paddr;
>>          unsigned int pba_offset;
>> +        const struct pf_info *pf_info;
>>  
>>          if ( !dev->info.is_virtfn )
>>          {
>> @@ -820,6 +818,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>              pslot = slot;
>>              pfunc = func;
>>              vf = -1;
>> +            pf_info = NULL;
>>          }
>>          else
>>          {
>> @@ -827,9 +826,11 @@ static int msix_capability_init(struct pci_dev *dev,
>>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>>              vf = dev->sbdf.bdf;
>> +            pf_info = &dev->virtfn.pf_pdev->physfn;
>>          }
>>  
>> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
>> +        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf,
>> +                                       pf_info);
>>          WARN_ON(msi && msi->table_base != table_paddr);
>>          if ( !table_paddr )
>>          {
>> @@ -852,7 +853,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>  
>>          pba_offset = pci_conf_read32(dev->sbdf, msix_pba_offset_reg(pos));
>>          bir = (u8)(pba_offset & PCI_MSIX_BIRMASK);
>> -        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
>> +        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf, pf_info);
>>          WARN_ON(!pba_paddr);
>>          pba_paddr += pba_offset & ~PCI_MSIX_BIRMASK;
>>  
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 5a446d3dcee0..97b3417252f7 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>>  
>>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
>>  
>> +    INIT_LIST_HEAD(&pdev->physfn.vf_list);
>> +
>>      /* update bus2bridge */
>>      switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
>>      {
>> @@ -446,6 +448,10 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>>  
>>      list_del(&pdev->alldevs_list);
>>      pdev_msi_deinit(pdev);
>> +
>> +    if ( pdev->info.is_virtfn )
>> +        list_del(&pdev->virtfn.entry);
>> +
>>      xfree(pdev);
>>  }
>>  
>> @@ -700,10 +706,31 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> +        {
>> +            struct pci_dev *pf_pdev;
>> +
>>              pdev->info.is_extfn = pf_is_extfn;
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
>> +            if ( pf_pdev )
>> +            {
>> +                pdev->virtfn.pf_pdev = pf_pdev;
>> +                list_add(&pdev->virtfn.entry, &pf_pdev->physfn.vf_list);
>> +            }
>> +            else
>> +            {
>> +                printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
>> +                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
>> +                       &pdev->sbdf);
>> +                free_pdev(pseg, pdev);
>> +                ret = -ENODEV;
>> +                goto out;
>> +            }
>> +        }
>>      }
>>  
>> -    if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
>> +    if ( !pdev->info.is_virtfn && !pdev->physfn.vf_rlen[0] )
>>      {
>>          unsigned int pos = pci_find_ext_capability(pdev->sbdf,
>>                                                     PCI_EXT_CAP_ID_SRIOV);
>> @@ -715,7 +742,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>          {
>>              unsigned int i;
>>  
>> -            BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
>> +            BUILD_BUG_ON(ARRAY_SIZE(pdev->physfn.vf_rlen) !=
>> +                                    PCI_SRIOV_NUM_BARS);
>> +
>>              for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
>>              {
>>                  unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
>> @@ -730,7 +759,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>                      continue;
>>                  }
>>                  ret = pci_size_mem_bar(pdev->sbdf, idx, NULL,
>> -                                       &pdev->vf_rlen[i],
>> +                                       &pdev->physfn.vf_rlen[i],
>>                                         PCI_BAR_VF |
>>                                         ((i == PCI_SRIOV_NUM_BARS - 1) ?
>>                                          PCI_BAR_LAST : 0));
>> @@ -818,6 +847,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>>          if ( pdev->bus == bus && pdev->devfn == devfn )
>>          {
>> +            ret = 0;
>> +
>> +            if ( !pdev->info.is_virtfn )
>> +            {
>> +                struct pci_dev *vf_pdev, *tmp;
>> +
>> +                /*
>> +                 * If a PF with VFs enabled is removed, then re-added without
>> +                 * VFs enabled, the previously associated VFs would no longer be
>> +                 * backed by a device. Remove the associated VFs right away.
>> +                 */
>> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
>> +                                         virtfn.entry)
>> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
>> +                                            vf_pdev->sbdf.bus,
>> +                                            vf_pdev->sbdf.devfn) ?: ret;
>> +            }
>> +
>>              if ( pdev->domain )
>>              {
>>                  write_lock(&pdev->domain->pci_lock);
>> @@ -826,7 +873,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>                  write_unlock(&pdev->domain->pci_lock);
>>              }
>>              pci_cleanup_msi(pdev);
>> -            ret = iommu_remove_device(pdev);
>> +            ret = iommu_remove_device(pdev) ?: ret;
>>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>>              free_pdev(pseg, pdev);
>>              break;
>> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
>> index 63e49f0117e9..a24026d25148 100644
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -150,7 +150,18 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> -    u64 vf_rlen[6];
>> +    union {
>> +        struct pf_info {
>> +            /* Only populated for PFs (info.is_virtfn == false) */
>> +            struct list_head vf_list;             /* List head */
>> +            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
>> +        } physfn;
>> +        struct {
>> +            /* Only populated for VFs (info.is_virtfn == true) */
>> +            struct list_head entry;               /* Entry in vf_list */
>> +            struct pci_dev *pf_pdev;              /* Link from VF to PF */
> 
> With this pointer being introduced for VFs, is the information in
> pci_dev_info still required?  It would seem to me the physfn.bus and
> physfn.devfn fields can now be fetched directly from the pf_pdev
> pointer?

Hm, well, in v5, the pf_pdev pointer may be NULL. So every time we'd
need to access that info, we'd need an additional NULL check.

> 
> Thanks, Roger.