[v3] vpci/msix: handle accesses adjacent to the MSI-X table

[PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monne 2 years, 10 months ago

The handling of the MSI-X table accesses by Xen requires that any
pages part of the MSI-X related tables are not mapped into the domain
physmap.  As a result, any device registers in the same pages as the
start or the end of the MSIX or PBA tables is not currently
accessible, as the accesses are just dropped.

Note the spec forbids such placing of registers, as the MSIX and PBA
tables must be 4K isolated from any other registers:

"If a Base Address register that maps address space for the MSI-X
Table or MSI-X PBA also maps other usable address space that is not
associated with MSI-X structures, locations (e.g., for CSRs) used in
the other address space must not share any naturally aligned 4-KB
address range with one where either MSI-X structure resides."

Yet the 'Intel Wi-Fi 6 AX201' device on one of my boxes has registers
in the same page as the MSIX tables, and thus won't work on a PVH dom0
without this fix.

In order to cope with the behavior passthrough any accesses that fall
on the same page as the MSIX tables (but don't fall in between) to the
underlying hardware.  Such forwarding also takes care of the PBA
accesses, so it allows to remove the code doing this handling in
msix_{read,write}.  Note that as a result accesses to the PBA array
are no longer limited to 4 and 8 byte sizes, there's no access size
restriction for PBA accesses documented in the specification.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
 - Slightly adjust VMSIX_ADDR_SAME_PAGE().
 - Use IS_ALIGNED and unlikely for the non-aligned access checking.
 - Move the check for the page mapped before the aligned one.
 - Remove cast of data to uint8_t and instead use a mask in order to
   avoid undefined behaviour when shifting.
 - Remove Xen maps of the MSIX related regions when memory decoding
   for the device is enabled by dom0, in order to purge stale maps.

Changes since v1:
 - Properly handle the PBA also.
 - Merge the handlers for adjacent writes into the existing MSIX table
   ones.
---
 xen/drivers/vpci/msix.c | 347 ++++++++++++++++++++++++++++++----------
 xen/drivers/vpci/vpci.c |   7 +-
 xen/include/xen/vpci.h  |   8 +-
 3 files changed, 274 insertions(+), 88 deletions(-)

diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index bea0cc7aed..bcd61fcd8b 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -27,6 +27,11 @@
     ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
      (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
 
+#define VMSIX_ADDR_SAME_PAGE(addr, vpci, nr)                              \
+    (PFN_DOWN(addr) >= PFN_DOWN(vmsix_table_addr(vpci, nr)) &&            \
+     PFN_DOWN(addr) <= PFN_DOWN(vmsix_table_addr(vpci, nr) +              \
+                                vmsix_table_size(vpci, nr) - 1))
+
 static uint32_t cf_check control_read(
     const struct pci_dev *pdev, unsigned int reg, void *data)
 {
@@ -149,7 +154,7 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
         for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
             if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
-                 VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
+                 VMSIX_ADDR_SAME_PAGE(addr, msix->pdev->vpci, i) )
                 return msix;
     }
 
@@ -182,93 +187,201 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
     return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
 }
 
-static void __iomem *get_pba(struct vpci *vpci)
+static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
 {
     struct vpci_msix *msix = vpci->msix;
     /*
-     * PBA will only be unmapped when the device is deassigned, so access it
-     * without holding the vpci lock.
+     * Regions will only be unmapped when the device is deassigned, so access
+     * them without holding the vpci lock.
      */
-    void __iomem *pba = read_atomic(&msix->pba);
+    void __iomem *table = read_atomic(&msix->table[slot]);
+    paddr_t addr = 0;
+
+    if ( likely(table) )
+        return table;
+
+    switch ( slot )
+    {
+    case VPCI_MSIX_TBL_TAIL:
+        addr = vmsix_table_size(vpci, VPCI_MSIX_TABLE);
+        fallthrough;
+    case VPCI_MSIX_TBL_HEAD:
+        addr += vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
+        break;
 
-    if ( likely(pba) )
-        return pba;
+    case VPCI_MSIX_PBA_TAIL:
+        addr = vmsix_table_size(vpci, VPCI_MSIX_PBA);
+        fallthrough;
+    case VPCI_MSIX_PBA_HEAD:
+        addr += vmsix_table_addr(vpci, VPCI_MSIX_PBA);
+        break;
+    }
 
-    pba = ioremap(vmsix_table_addr(vpci, VPCI_MSIX_PBA),
-                  vmsix_table_size(vpci, VPCI_MSIX_PBA));
-    if ( !pba )
-        return read_atomic(&msix->pba);
+    table = ioremap(round_pgdown(addr), PAGE_SIZE);
+    if ( !table )
+        return read_atomic(&msix->table[slot]);
 
     spin_lock(&vpci->lock);
-    if ( !msix->pba )
+    if ( !msix->table[slot] )
     {
-        write_atomic(&msix->pba, pba);
+        write_atomic(&msix->table[slot], table);
         spin_unlock(&vpci->lock);
     }
     else
     {
         spin_unlock(&vpci->lock);
-        iounmap(pba);
+        iounmap(table);
     }
 
-    return read_atomic(&msix->pba);
+    return read_atomic(&msix->table[slot]);
 }
 
-static int cf_check msix_read(
-    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
+unsigned int get_slot(const struct vpci *vpci, unsigned long addr)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
-    const struct vpci_msix_entry *entry;
-    unsigned int offset;
+    unsigned long pfn = PFN_DOWN(addr);
+
+    /*
+     * The logic below relies on having the tables identity mapped to the guest
+     * address space, or for the `addr` parameter to be translated into its
+     * host physical memory address equivalent.
+     */
+
+    if ( pfn == PFN_DOWN(vmsix_table_addr(vpci, VPCI_MSIX_TABLE)) )
+        return VPCI_MSIX_TBL_HEAD;
+    if ( pfn == PFN_DOWN(vmsix_table_addr(vpci, VPCI_MSIX_TABLE) +
+                         vmsix_table_size(vpci, VPCI_MSIX_TABLE) - 1) )
+        return VPCI_MSIX_TBL_TAIL;
+    if ( pfn == PFN_DOWN(vmsix_table_addr(vpci, VPCI_MSIX_PBA)) )
+        return VPCI_MSIX_PBA_HEAD;
+    if ( pfn == PFN_DOWN(vmsix_table_addr(vpci, VPCI_MSIX_PBA) +
+                         vmsix_table_size(vpci, VPCI_MSIX_PBA) - 1) )
+        return VPCI_MSIX_PBA_TAIL;
+
+    ASSERT_UNREACHABLE();
+    return -1;
+}
+
+static bool adjacent_handle(const struct vpci_msix *msix, unsigned long addr)
+{
+    unsigned int i;
+
+    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
+        return true;
+
+    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_TABLE) )
+        return false;
+
+    for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
+        if ( VMSIX_ADDR_SAME_PAGE(addr, msix->pdev->vpci, i) )
+            return true;
+
+    return false;
+}
+
+static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
+                         unsigned long addr, unsigned int len,
+                         unsigned long *data)
+{
+    const void __iomem *mem;
+    struct vpci *vpci = msix->pdev->vpci;
+    unsigned int slot;
 
     *data = ~0ul;
 
-    if ( !msix )
-        return X86EMUL_RETRY;
+    if ( !adjacent_handle(msix, addr + len - 1) )
+        return X86EMUL_OKAY;
 
-    if ( !access_allowed(msix->pdev, addr, len) )
+    slot = get_slot(vpci, addr);
+    if ( slot >= ARRAY_SIZE(msix->table) )
         return X86EMUL_OKAY;
 
-    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
+    mem = get_table(vpci, slot);
+    if ( !mem )
+    {
+        gprintk(XENLOG_WARNING,
+                "%pp: unable to map MSI-X page, returning all bits set\n",
+                &msix->pdev->sbdf);
+        return X86EMUL_OKAY;
+    }
+
+    if ( unlikely(!IS_ALIGNED(addr, len)) )
     {
-        struct vpci *vpci = msix->pdev->vpci;
-        unsigned int idx = addr - vmsix_table_addr(vpci, VPCI_MSIX_PBA);
-        const void __iomem *pba = get_pba(vpci);
+        unsigned int i;
+
+        gprintk(XENLOG_DEBUG, "%pp: unaligned read to MSI-X related page\n",
+                &msix->pdev->sbdf);
 
         /*
-         * Access to PBA.
+         * Split unaligned accesses into byte sized ones. Shouldn't happen in
+         * the first place, but devices shouldn't have registers in the same 4K
+         * page as the MSIX tables either.
          *
-         * TODO: note that this relies on having the PBA identity mapped to the
-         * guest address space. If this changes the address will need to be
-         * translated.
+         * It's unclear whether this could cause issues if a guest expects a
+         * registers to be accessed atomically, it better use an aligned access
+         * if it has such expectations.
          */
-        if ( !pba )
-        {
-            gprintk(XENLOG_WARNING,
-                    "%pp: unable to map MSI-X PBA, report all pending\n",
-                    &msix->pdev->sbdf);
-            return X86EMUL_OKAY;
-        }
 
-        switch ( len )
+        for ( i = 0; i < len; i++ )
         {
-        case 4:
-            *data = readl(pba + idx);
-            break;
+            unsigned long partial = ~0ul;
+            int rc = adjacent_read(d, msix, addr + i, 1, &partial);
 
-        case 8:
-            *data = readq(pba + idx);
-            break;
+            if ( rc != X86EMUL_OKAY )
+                return rc;
 
-        default:
-            ASSERT_UNREACHABLE();
-            break;
+            *data &= ~(0xfful << (i * 8));
+            *data |= (partial & 0xff) << (i * 8);
         }
 
         return X86EMUL_OKAY;
     }
 
+    switch ( len )
+    {
+    case 1:
+        *data = readb(mem + PAGE_OFFSET(addr));
+        break;
+
+    case 2:
+        *data = readw(mem + PAGE_OFFSET(addr));
+        break;
+
+    case 4:
+        *data = readl(mem + PAGE_OFFSET(addr));
+        break;
+
+    case 8:
+        *data = readq(mem + PAGE_OFFSET(addr));
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static int cf_check msix_read(
+    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
+{
+    const struct domain *d = v->domain;
+    struct vpci_msix *msix = msix_find(d, addr);
+    const struct vpci_msix_entry *entry;
+    struct vpci *vpci;
+    unsigned int offset;
+
+    *data = ~0ul;
+
+    if ( !msix )
+        return X86EMUL_RETRY;
+
+    if ( adjacent_handle(msix, addr) )
+        return adjacent_read(d, msix, addr, len, data);
+
+    vpci = msix->pdev->vpci;
+    if ( !access_allowed(msix->pdev, addr, len) )
+        return X86EMUL_OKAY;
+
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
     offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
@@ -303,57 +416,103 @@ static int cf_check msix_read(
     return X86EMUL_OKAY;
 }
 
-static int cf_check msix_write(
-    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
+static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
+                          unsigned long addr, unsigned int len,
+                          unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
-    struct vpci_msix_entry *entry;
-    unsigned int offset;
+    void __iomem *mem;
+    struct vpci *vpci = msix->pdev->vpci;
+    unsigned int slot;
 
-    if ( !msix )
-        return X86EMUL_RETRY;
+    if ( !adjacent_handle(msix, addr + len - 1) )
+        return X86EMUL_OKAY;
 
-    if ( !access_allowed(msix->pdev, addr, len) )
+    /*
+     * Only check start and end of the access because the size of the PBA is
+     * assumed to be equal or bigger (8 bytes) than the length of any access
+     * handled here.
+     */
+    if ( (VMSIX_ADDR_IN_RANGE(addr, vpci, VPCI_MSIX_PBA) ||
+          VMSIX_ADDR_IN_RANGE(addr + len - 1, vpci, VPCI_MSIX_PBA)) &&
+         !is_hardware_domain(d) )
+        /* Ignore writes to PBA for DomUs, it's undefined behavior. */
         return X86EMUL_OKAY;
 
-    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
+    slot = get_slot(vpci, addr);
+    if ( slot >= ARRAY_SIZE(msix->table) )
+        return X86EMUL_OKAY;
+
+    mem = get_table(vpci, slot);
+    if ( !mem )
     {
-        struct vpci *vpci = msix->pdev->vpci;
-        unsigned int idx = addr - vmsix_table_addr(vpci, VPCI_MSIX_PBA);
-        const void __iomem *pba = get_pba(vpci);
+        gprintk(XENLOG_WARNING,
+                "%pp: unable to map MSI-X page, dropping write\n",
+                &msix->pdev->sbdf);
+        return X86EMUL_OKAY;
+    }
 
-        if ( !is_hardware_domain(d) )
-            /* Ignore writes to PBA for DomUs, it's behavior is undefined. */
-            return X86EMUL_OKAY;
+    if ( unlikely(!IS_ALIGNED(addr, len)) )
+    {
+        unsigned int i;
 
-        if ( !pba )
-        {
-            /* Unable to map the PBA, ignore write. */
-            gprintk(XENLOG_WARNING,
-                    "%pp: unable to map MSI-X PBA, write ignored\n",
-                    &msix->pdev->sbdf);
-            return X86EMUL_OKAY;
-        }
+        gprintk(XENLOG_DEBUG, "%pp: unaligned write to MSI-X related page\n",
+                &msix->pdev->sbdf);
 
-        switch ( len )
+        for ( i = 0; i < len; i++ )
         {
-        case 4:
-            writel(data, pba + idx);
-            break;
+            int rc = adjacent_write(d, msix, addr + i, 1, data >> (i * 8));
 
-        case 8:
-            writeq(data, pba + idx);
-            break;
-
-        default:
-            ASSERT_UNREACHABLE();
-            break;
+            if ( rc != X86EMUL_OKAY )
+                return rc;
         }
 
         return X86EMUL_OKAY;
     }
 
+    switch ( len )
+    {
+    case 1:
+        writeb(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    case 2:
+        writew(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    case 4:
+        writel(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    case 8:
+        writeq(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static int cf_check msix_write(
+    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
+{
+    const struct domain *d = v->domain;
+    struct vpci_msix *msix = msix_find(d, addr);
+    struct vpci_msix_entry *entry;
+    struct vpci *vpci;
+    unsigned int offset;
+
+    if ( !msix )
+        return X86EMUL_RETRY;
+
+    if ( adjacent_handle(msix, addr) )
+        return adjacent_write(d, msix, addr, len, data);
+
+    vpci = msix->pdev->vpci;
+    if ( !access_allowed(msix->pdev, addr, len) )
+        return X86EMUL_OKAY;
+
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
     offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
@@ -482,6 +641,26 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
         }
     }
 
+    if ( is_hardware_domain(d) )
+    {
+        unsigned int i;
+
+        /*
+         * For the hardware domain only remove any hypervisor mappings of the
+         * MSIX or PBA related areas, as dom0 is capable of moving the position
+         * of the BARs in the host address space.
+         *
+         * We rely on being called with the vPCI lock held in order to not race
+         * with get_table().
+         */
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->msix->table); i++ )
+            if ( pdev->vpci->msix->table[i] )
+            {
+                iounmap(pdev->vpci->msix->table[i]);
+                pdev->vpci->msix->table[i] = NULL;
+            }
+    }
+
     return 0;
 }
 
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 6d48d496bb..652807a4a4 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -54,9 +54,12 @@ void vpci_remove_device(struct pci_dev *pdev)
     spin_unlock(&pdev->vpci->lock);
     if ( pdev->vpci->msix )
     {
+        unsigned int i;
+
         list_del(&pdev->vpci->msix->next);
-        if ( pdev->vpci->msix->pba )
-            iounmap(pdev->vpci->msix->pba);
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->msix->table); i++ )
+            if ( pdev->vpci->msix->table[i] )
+                iounmap(pdev->vpci->msix->table[i]);
     }
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d8acfeba8a..0b8a2a3c74 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -133,8 +133,12 @@ struct vpci {
         bool enabled         : 1;
         /* Masked? */
         bool masked          : 1;
-        /* PBA map */
-        void __iomem *pba;
+        /* Partial table map. */
+#define VPCI_MSIX_TBL_HEAD 0
+#define VPCI_MSIX_TBL_TAIL 1
+#define VPCI_MSIX_PBA_HEAD 2
+#define VPCI_MSIX_PBA_TAIL 3
+        void __iomem *table[4];
         /* Entries. */
         struct vpci_msix_entry {
             uint64_t addr;
-- 
2.40.0

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Jan Beulich 2 years, 10 months ago

On 22.03.2023 15:30, Roger Pau Monne wrote:
> Changes since v2:
>  - Slightly adjust VMSIX_ADDR_SAME_PAGE().
>  - Use IS_ALIGNED and unlikely for the non-aligned access checking.
>  - Move the check for the page mapped before the aligned one.
>  - Remove cast of data to uint8_t and instead use a mask in order to
>    avoid undefined behaviour when shifting.
>  - Remove Xen maps of the MSIX related regions when memory decoding
>    for the device is enabled by dom0, in order to purge stale maps.

I'm glad you thought of this. The new code has issues, though:

> @@ -182,93 +187,201 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
>      return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
>  }
>  
> -static void __iomem *get_pba(struct vpci *vpci)
> +static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
>  {
>      struct vpci_msix *msix = vpci->msix;
>      /*
> -     * PBA will only be unmapped when the device is deassigned, so access it
> -     * without holding the vpci lock.
> +     * Regions will only be unmapped when the device is deassigned, so access
> +     * them without holding the vpci lock.

The first part of the sentence is now stale, and the second part is in
conflict ...

> @@ -482,6 +641,26 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
>          }
>      }
>  
> +    if ( is_hardware_domain(d) )
> +    {
> +        unsigned int i;
> +
> +        /*
> +         * For the hardware domain only remove any hypervisor mappings of the
> +         * MSIX or PBA related areas, as dom0 is capable of moving the position
> +         * of the BARs in the host address space.
> +         *
> +         * We rely on being called with the vPCI lock held in order to not race
> +         * with get_table().

... with what you say (and utilize) here. Furthermore this comment also wants
clarifying that apply_map() -> modify_decoding() not (afaics) holding the lock
when calling here is not a problem, as no mapping can exist yet that may need
tearing down. (I first wondered whether you wouldn't want to assert that the
lock is being held. You actually could, but only after finding a non-NULL
table entry.)

Jan

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monné 2 years, 10 months ago

On Wed, Mar 22, 2023 at 04:14:54PM +0100, Jan Beulich wrote:
> On 22.03.2023 15:30, Roger Pau Monne wrote:
> > Changes since v2:
> >  - Slightly adjust VMSIX_ADDR_SAME_PAGE().
> >  - Use IS_ALIGNED and unlikely for the non-aligned access checking.
> >  - Move the check for the page mapped before the aligned one.
> >  - Remove cast of data to uint8_t and instead use a mask in order to
> >    avoid undefined behaviour when shifting.
> >  - Remove Xen maps of the MSIX related regions when memory decoding
> >    for the device is enabled by dom0, in order to purge stale maps.
> 
> I'm glad you thought of this. The new code has issues, though:
> 
> > @@ -182,93 +187,201 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
> >      return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
> >  }
> >  
> > -static void __iomem *get_pba(struct vpci *vpci)
> > +static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
> >  {
> >      struct vpci_msix *msix = vpci->msix;
> >      /*
> > -     * PBA will only be unmapped when the device is deassigned, so access it
> > -     * without holding the vpci lock.
> > +     * Regions will only be unmapped when the device is deassigned, so access
> > +     * them without holding the vpci lock.
> 
> The first part of the sentence is now stale, and the second part is in
> conflict ...
> 
> > @@ -482,6 +641,26 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
> >          }
> >      }
> >  
> > +    if ( is_hardware_domain(d) )
> > +    {
> > +        unsigned int i;
> > +
> > +        /*
> > +         * For the hardware domain only remove any hypervisor mappings of the
> > +         * MSIX or PBA related areas, as dom0 is capable of moving the position
> > +         * of the BARs in the host address space.
> > +         *
> > +         * We rely on being called with the vPCI lock held in order to not race
> > +         * with get_table().
> 
> ... with what you say (and utilize) here. Furthermore this comment also wants
> clarifying that apply_map() -> modify_decoding() not (afaics) holding the lock
> when calling here is not a problem, as no mapping can exist yet that may need
> tearing down. (I first wondered whether you wouldn't want to assert that the
> lock is being held. You actually could, but only after finding a non-NULL
> table entry.)

Oh, yes, sorry, I should update those comments.  vpci_make_msix_hole()
gets called before bars[].enabled gets set, so there should be no
users of the mappings at that time because we don't handle accesses
when the BAR is not mapped.

Not sure whether we should consider an access from when the BAR was
actually enabled by a different thread could still continue while on
another thread the BAR has been disabled and enabled again (and thus
the mapping removed).  It's a theoretical race, so I guess I will look
into making sure we cannot hit it.

Thanks, Roger.

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monné 2 years, 10 months ago

On Wed, Mar 22, 2023 at 06:05:55PM +0100, Roger Pau Monné wrote:
> On Wed, Mar 22, 2023 at 04:14:54PM +0100, Jan Beulich wrote:
> > On 22.03.2023 15:30, Roger Pau Monne wrote:
> > > Changes since v2:
> > >  - Slightly adjust VMSIX_ADDR_SAME_PAGE().
> > >  - Use IS_ALIGNED and unlikely for the non-aligned access checking.
> > >  - Move the check for the page mapped before the aligned one.
> > >  - Remove cast of data to uint8_t and instead use a mask in order to
> > >    avoid undefined behaviour when shifting.
> > >  - Remove Xen maps of the MSIX related regions when memory decoding
> > >    for the device is enabled by dom0, in order to purge stale maps.
> > 
> > I'm glad you thought of this. The new code has issues, though:
> > 
> > > @@ -182,93 +187,201 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
> > >      return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
> > >  }
> > >  
> > > -static void __iomem *get_pba(struct vpci *vpci)
> > > +static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
> > >  {
> > >      struct vpci_msix *msix = vpci->msix;
> > >      /*
> > > -     * PBA will only be unmapped when the device is deassigned, so access it
> > > -     * without holding the vpci lock.
> > > +     * Regions will only be unmapped when the device is deassigned, so access
> > > +     * them without holding the vpci lock.
> > 
> > The first part of the sentence is now stale, and the second part is in
> > conflict ...
> > 
> > > @@ -482,6 +641,26 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
> > >          }
> > >      }
> > >  
> > > +    if ( is_hardware_domain(d) )
> > > +    {
> > > +        unsigned int i;
> > > +
> > > +        /*
> > > +         * For the hardware domain only remove any hypervisor mappings of the
> > > +         * MSIX or PBA related areas, as dom0 is capable of moving the position
> > > +         * of the BARs in the host address space.
> > > +         *
> > > +         * We rely on being called with the vPCI lock held in order to not race
> > > +         * with get_table().
> > 
> > ... with what you say (and utilize) here. Furthermore this comment also wants
> > clarifying that apply_map() -> modify_decoding() not (afaics) holding the lock
> > when calling here is not a problem, as no mapping can exist yet that may need
> > tearing down. (I first wondered whether you wouldn't want to assert that the
> > lock is being held. You actually could, but only after finding a non-NULL
> > table entry.)
> 
> Oh, yes, sorry, I should update those comments.  vpci_make_msix_hole()
> gets called before bars[].enabled gets set, so there should be no
> users of the mappings at that time because we don't handle accesses
> when the BAR is not mapped.
> 
> Not sure whether we should consider an access from when the BAR was
> actually enabled by a different thread could still continue while on
> another thread the BAR has been disabled and enabled again (and thus
> the mapping removed).  It's a theoretical race, so I guess I will look
> into making sure we cannot hit it.

Hm, maybe it doesn't matter much because such kind of trace could only
be triggered by the hardware domain anyway, and it has plenty of other
ways to mess with Xen.

Roger.

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Jan Beulich 2 years, 10 months ago

On 22.03.2023 18:08, Roger Pau Monné wrote:
> On Wed, Mar 22, 2023 at 06:05:55PM +0100, Roger Pau Monné wrote:
>> On Wed, Mar 22, 2023 at 04:14:54PM +0100, Jan Beulich wrote:
>>> On 22.03.2023 15:30, Roger Pau Monne wrote:
>>>> Changes since v2:
>>>>  - Slightly adjust VMSIX_ADDR_SAME_PAGE().
>>>>  - Use IS_ALIGNED and unlikely for the non-aligned access checking.
>>>>  - Move the check for the page mapped before the aligned one.
>>>>  - Remove cast of data to uint8_t and instead use a mask in order to
>>>>    avoid undefined behaviour when shifting.
>>>>  - Remove Xen maps of the MSIX related regions when memory decoding
>>>>    for the device is enabled by dom0, in order to purge stale maps.
>>>
>>> I'm glad you thought of this. The new code has issues, though:
>>>
>>>> @@ -182,93 +187,201 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
>>>>      return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
>>>>  }
>>>>  
>>>> -static void __iomem *get_pba(struct vpci *vpci)
>>>> +static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
>>>>  {
>>>>      struct vpci_msix *msix = vpci->msix;
>>>>      /*
>>>> -     * PBA will only be unmapped when the device is deassigned, so access it
>>>> -     * without holding the vpci lock.
>>>> +     * Regions will only be unmapped when the device is deassigned, so access
>>>> +     * them without holding the vpci lock.
>>>
>>> The first part of the sentence is now stale, and the second part is in
>>> conflict ...
>>>
>>>> @@ -482,6 +641,26 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
>>>>          }
>>>>      }
>>>>  
>>>> +    if ( is_hardware_domain(d) )
>>>> +    {
>>>> +        unsigned int i;
>>>> +
>>>> +        /*
>>>> +         * For the hardware domain only remove any hypervisor mappings of the
>>>> +         * MSIX or PBA related areas, as dom0 is capable of moving the position
>>>> +         * of the BARs in the host address space.
>>>> +         *
>>>> +         * We rely on being called with the vPCI lock held in order to not race
>>>> +         * with get_table().
>>>
>>> ... with what you say (and utilize) here. Furthermore this comment also wants
>>> clarifying that apply_map() -> modify_decoding() not (afaics) holding the lock
>>> when calling here is not a problem, as no mapping can exist yet that may need
>>> tearing down. (I first wondered whether you wouldn't want to assert that the
>>> lock is being held. You actually could, but only after finding a non-NULL
>>> table entry.)
>>
>> Oh, yes, sorry, I should update those comments.  vpci_make_msix_hole()
>> gets called before bars[].enabled gets set, so there should be no
>> users of the mappings at that time because we don't handle accesses
>> when the BAR is not mapped.
>>
>> Not sure whether we should consider an access from when the BAR was
>> actually enabled by a different thread could still continue while on
>> another thread the BAR has been disabled and enabled again (and thus
>> the mapping removed).  It's a theoretical race, so I guess I will look
>> into making sure we cannot hit it.
> 
> Hm, maybe it doesn't matter much because such kind of trace could only
> be triggered by the hardware domain anyway, and it has plenty of other
> ways to mess with Xen.

Preferably we should get things to use proper locking. If that turns out
too hard, properly justified such an exception for Dom0 might be
acceptable.

Jan

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monné 2 years, 10 months ago

On Thu, Mar 23, 2023 at 09:02:31AM +0100, Jan Beulich wrote:
> On 22.03.2023 18:08, Roger Pau Monné wrote:
> > On Wed, Mar 22, 2023 at 06:05:55PM +0100, Roger Pau Monné wrote:
> >> On Wed, Mar 22, 2023 at 04:14:54PM +0100, Jan Beulich wrote:
> >>> On 22.03.2023 15:30, Roger Pau Monne wrote:
> >>>> Changes since v2:
> >>>>  - Slightly adjust VMSIX_ADDR_SAME_PAGE().
> >>>>  - Use IS_ALIGNED and unlikely for the non-aligned access checking.
> >>>>  - Move the check for the page mapped before the aligned one.
> >>>>  - Remove cast of data to uint8_t and instead use a mask in order to
> >>>>    avoid undefined behaviour when shifting.
> >>>>  - Remove Xen maps of the MSIX related regions when memory decoding
> >>>>    for the device is enabled by dom0, in order to purge stale maps.
> >>>
> >>> I'm glad you thought of this. The new code has issues, though:
> >>>
> >>>> @@ -182,93 +187,201 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
> >>>>      return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
> >>>>  }
> >>>>  
> >>>> -static void __iomem *get_pba(struct vpci *vpci)
> >>>> +static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
> >>>>  {
> >>>>      struct vpci_msix *msix = vpci->msix;
> >>>>      /*
> >>>> -     * PBA will only be unmapped when the device is deassigned, so access it
> >>>> -     * without holding the vpci lock.
> >>>> +     * Regions will only be unmapped when the device is deassigned, so access
> >>>> +     * them without holding the vpci lock.
> >>>
> >>> The first part of the sentence is now stale, and the second part is in
> >>> conflict ...
> >>>
> >>>> @@ -482,6 +641,26 @@ int vpci_make_msix_hole(const struct pci_dev *pdev)
> >>>>          }
> >>>>      }
> >>>>  
> >>>> +    if ( is_hardware_domain(d) )
> >>>> +    {
> >>>> +        unsigned int i;
> >>>> +
> >>>> +        /*
> >>>> +         * For the hardware domain only remove any hypervisor mappings of the
> >>>> +         * MSIX or PBA related areas, as dom0 is capable of moving the position
> >>>> +         * of the BARs in the host address space.
> >>>> +         *
> >>>> +         * We rely on being called with the vPCI lock held in order to not race
> >>>> +         * with get_table().
> >>>
> >>> ... with what you say (and utilize) here. Furthermore this comment also wants
> >>> clarifying that apply_map() -> modify_decoding() not (afaics) holding the lock
> >>> when calling here is not a problem, as no mapping can exist yet that may need
> >>> tearing down. (I first wondered whether you wouldn't want to assert that the
> >>> lock is being held. You actually could, but only after finding a non-NULL
> >>> table entry.)
> >>
> >> Oh, yes, sorry, I should update those comments.  vpci_make_msix_hole()
> >> gets called before bars[].enabled gets set, so there should be no
> >> users of the mappings at that time because we don't handle accesses
> >> when the BAR is not mapped.
> >>
> >> Not sure whether we should consider an access from when the BAR was
> >> actually enabled by a different thread could still continue while on
> >> another thread the BAR has been disabled and enabled again (and thus
> >> the mapping removed).  It's a theoretical race, so I guess I will look
> >> into making sure we cannot hit it.
> > 
> > Hm, maybe it doesn't matter much because such kind of trace could only
> > be triggered by the hardware domain anyway, and it has plenty of other
> > ways to mess with Xen.
> 
> Preferably we should get things to use proper locking. If that turns out
> too hard, properly justified such an exception for Dom0 might be
> acceptable.

Right, one idea I have would be to use map_pages_to_xen() in
vpci_make_msix_hole() to remap any existing virtual addresses to point
to the new physical addresses, so that there's no unmap.  I think this
would be fine because map_pages_to_xen() does an atomic write of the
PTE, but not sure if it's abusing the interface.  Such remap would
avoid taking the vpci lock for the whole duration of the access.

Thanks, Roger.

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Jan Beulich 2 years, 10 months ago

On 23.03.2023 11:06, Roger Pau Monné wrote:
> On Thu, Mar 23, 2023 at 09:02:31AM +0100, Jan Beulich wrote:
>> On 22.03.2023 18:08, Roger Pau Monné wrote:
>>> On Wed, Mar 22, 2023 at 06:05:55PM +0100, Roger Pau Monné wrote:
>>>> Not sure whether we should consider an access from when the BAR was
>>>> actually enabled by a different thread could still continue while on
>>>> another thread the BAR has been disabled and enabled again (and thus
>>>> the mapping removed).  It's a theoretical race, so I guess I will look
>>>> into making sure we cannot hit it.
>>>
>>> Hm, maybe it doesn't matter much because such kind of trace could only
>>> be triggered by the hardware domain anyway, and it has plenty of other
>>> ways to mess with Xen.
>>
>> Preferably we should get things to use proper locking. If that turns out
>> too hard, properly justified such an exception for Dom0 might be
>> acceptable.
> 
> Right, one idea I have would be to use map_pages_to_xen() in
> vpci_make_msix_hole() to remap any existing virtual addresses to point
> to the new physical addresses, so that there's no unmap.  I think this
> would be fine because map_pages_to_xen() does an atomic write of the
> PTE, but not sure if it's abusing the interface.  Such remap would
> avoid taking the vpci lock for the whole duration of the access.

Hmm, no, I'm afraid that won't suffice for another reason: I think these
mappings need properly removing, and already at the time when memory
decoding is turned off (hence a potential new address isn't known yet).
Otherwise we leave too much of a risk that these mappings may actually
be used when the address range has already been re-purposed (even if I
think with how the code is written right now there wouldn't be any use
in reality, but relying on that going forward seems overly fragile to
me).

I can be convinced otherwise, so let me also consider that case: One
prereq would imo be that bar->enabled be cleared before memory decoding
is turned off and set only after it was turned back on. (Unless again
we want to be forgiving with Dom0.) (Looking at modify_decoding() I
also wonder why we invoke pci_check_bar() even when we're about to clear
{,rom_}enabled.)

As to (ab)using map_pages_to_xen() - I wouldn't be very happy about
playing with ioremap()-ed regions behind vmap.c's back, but I agree it
looks like technically it would work. In case of future issues we could
introduce something like vremap() to fiddle with an already mapped
region (and then perhaps also only changing addresses, but not
attributes).

Another option to avoid holding the lock across the entire access might
be for get_table() to actually acquire a ref on the mapping, with new
mappings starting out with a single ref, for vpci_make_msix_hole() to
drop (in place of explicitly unmapping). But that would require further
precautions to prevent stale mappings to have new refs obtained on
them, so might easily be getting too complicated for the purpose. As
would (likely) any other attempt at serializing this properly without
holding the lock. Which gets to ask: Would it really that bad to hold
the lock across the entire access? Or could we introduce a 2nd (per-
device) lock just for this purpose, which might then be an r/w one?

Jan

Re: [PATCH v3] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monné 2 years, 10 months ago

On Thu, Mar 23, 2023 at 11:58:26AM +0100, Jan Beulich wrote:
> On 23.03.2023 11:06, Roger Pau Monné wrote:
> > On Thu, Mar 23, 2023 at 09:02:31AM +0100, Jan Beulich wrote:
> >> On 22.03.2023 18:08, Roger Pau Monné wrote:
> >>> On Wed, Mar 22, 2023 at 06:05:55PM +0100, Roger Pau Monné wrote:
> >>>> Not sure whether we should consider an access from when the BAR was
> >>>> actually enabled by a different thread could still continue while on
> >>>> another thread the BAR has been disabled and enabled again (and thus
> >>>> the mapping removed).  It's a theoretical race, so I guess I will look
> >>>> into making sure we cannot hit it.
> >>>
> >>> Hm, maybe it doesn't matter much because such kind of trace could only
> >>> be triggered by the hardware domain anyway, and it has plenty of other
> >>> ways to mess with Xen.
> >>
> >> Preferably we should get things to use proper locking. If that turns out
> >> too hard, properly justified such an exception for Dom0 might be
> >> acceptable.
> > 
> > Right, one idea I have would be to use map_pages_to_xen() in
> > vpci_make_msix_hole() to remap any existing virtual addresses to point
> > to the new physical addresses, so that there's no unmap.  I think this
> > would be fine because map_pages_to_xen() does an atomic write of the
> > PTE, but not sure if it's abusing the interface.  Such remap would
> > avoid taking the vpci lock for the whole duration of the access.
> 
> Hmm, no, I'm afraid that won't suffice for another reason: I think these
> mappings need properly removing, and already at the time when memory
> decoding is turned off (hence a potential new address isn't known yet).
> Otherwise we leave too much of a risk that these mappings may actually
> be used when the address range has already been re-purposed (even if I
> think with how the code is written right now there wouldn't be any use
> in reality, but relying on that going forward seems overly fragile to
> me).

Keep in mind it's only dom0 that requires the mappings to be updated.
Hence dom0 playing tricks to read from such addresses when memory
decoding has been disabled seems like an unnecessary hassle, it likely
has plenty of other ways to do so that don't involve relying on races
between memory decoding and BAR accesses.

For domUs the address of the BARs on the real device can never change
while being used by the domU, and hence the mappings don't need
removing on memory decoding toggle.

> I can be convinced otherwise, so let me also consider that case: One
> prereq would imo be that bar->enabled be cleared before memory decoding
> is turned off and set only after it was turned back on. (Unless again
> we want to be forgiving with Dom0.) (Looking at modify_decoding() I
> also wonder why we invoke pci_check_bar() even when we're about to clear
> {,rom_}enabled.)

Hm, I see what you mean: it doesn't make sense to keep the BAR mapped
into the guest physmap if we then turn off memory decoding (or ROM BAR
enabling).  We need to prevent ROM BAR and/or memory decoding from
being turned off if pci_check_bar() returns false for any BAR of the
device.  Let's keep this separate from the issue at hand.

> 
> As to (ab)using map_pages_to_xen() - I wouldn't be very happy about
> playing with ioremap()-ed regions behind vmap.c's back, but I agree it
> looks like technically it would work. In case of future issues we could
> introduce something like vremap() to fiddle with an already mapped
> region (and then perhaps also only changing addresses, but not
> attributes).

I have such a version, which obviously looks very similar to v3.  Let
me try to finalize the comments and see how it looks overall.

> Another option to avoid holding the lock across the entire access might
> be for get_table() to actually acquire a ref on the mapping, with new
> mappings starting out with a single ref, for vpci_make_msix_hole() to
> drop (in place of explicitly unmapping). But that would require further
> precautions to prevent stale mappings to have new refs obtained on
> them, so might easily be getting too complicated for the purpose. As
> would (likely) any other attempt at serializing this properly without
> holding the lock. Which gets to ask: Would it really that bad to hold
> the lock across the entire access? Or could we introduce a 2nd (per-
> device) lock just for this purpose, which might then be an r/w one?

Adding a refcount or another lock seems like too much for my taste, I
would start by just using the vpci->lock for the whole access.

Thanks, Roger.