[v1] vpci/msix: handle accesses adjacent to the MSI-X table

[PATCH] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monne 1 year, 4 months ago

The handling of the MSI-X table accesses by Xen requires that any
pages part of the MSI-X table are not mapped into the domain physmap.
As a result, any device registers in the same pages as the start or
the end of the MSIX table is not currently accessible, as the accesses
are just dropped.

Note the spec forbids such placing of registers, as the MSIX and PBA
tables must be 4K isolated from any other registers:

"If a Base Address register that maps address space for the MSI-X
Table or MSI-X PBA also maps other usable address space that is not
associated with MSI-X structures, locations (e.g., for CSRs) used in
the other address space must not share any naturally aligned 4-KB
address range with one where either MSI-X structure resides."

Yet the 'Intel Wi-Fi 6 AX201' device on one of my boxes has registers
in the same page as the MSIX table, and thus won't work on a PVH dom0
without this fix.

In order to cope with the behavior passthrough any accesses that fall
on the same page as the MSIX table (but don't fall between) it to the
underlying hardware.  Such forwarding also takes care of the PBA
accesses in case the PBA is sharing a page with the MSIX table, so it
allows to remove the code doing this handling in msix_{read,write}.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/drivers/vpci/msix.c | 259 +++++++++++++++++++++++++---------------
 xen/drivers/vpci/vpci.c |   7 +-
 xen/include/xen/vpci.h  |   6 +-
 3 files changed, 175 insertions(+), 97 deletions(-)

diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index bea0cc7aed..1b59c7fc14 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -27,6 +27,13 @@
     ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
      (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
 
+#define VMSIX_ADDR_ADJACENT(addr, vpci, nr)                               \
+    ((PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr)) &&           \
+      (addr) < vmsix_table_addr(vpci, nr)) ||                             \
+     (PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr) +             \
+                                 vmsix_table_size(vpci, nr) - 1) &&       \
+      (addr) >= vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr)))
+
 static uint32_t cf_check control_read(
     const struct pci_dev *pdev, unsigned int reg, void *data)
 {
@@ -145,11 +152,9 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
-        unsigned int i;
 
-        for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
-            if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
-                 VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
+        if ( bars[msix->tables[VPCI_MSIX_TABLE] & PCI_MSIX_BIRMASK].enabled &&
+             VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_TABLE) )
                 return msix;
     }
 
@@ -182,36 +187,38 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix,
     return &msix->entries[(addr - start) / PCI_MSIX_ENTRY_SIZE];
 }
 
-static void __iomem *get_pba(struct vpci *vpci)
+static void __iomem *get_table(struct vpci *vpci, unsigned int slot)
 {
     struct vpci_msix *msix = vpci->msix;
     /*
-     * PBA will only be unmapped when the device is deassigned, so access it
+     * MSIX will only be unmapped when the device is deassigned, so access it
      * without holding the vpci lock.
      */
-    void __iomem *pba = read_atomic(&msix->pba);
+    void __iomem *table = read_atomic(&msix->table[slot]);
 
-    if ( likely(pba) )
-        return pba;
+    if ( likely(table) )
+        return table;
 
-    pba = ioremap(vmsix_table_addr(vpci, VPCI_MSIX_PBA),
-                  vmsix_table_size(vpci, VPCI_MSIX_PBA));
-    if ( !pba )
-        return read_atomic(&msix->pba);
+    table = ioremap(round_pgdown(vmsix_table_addr(vpci, VPCI_MSIX_TABLE) +
+                                 (slot == VPCI_MSIX_TBL_HEAD ?
+                                  0 : vmsix_table_size(vpci, VPCI_MSIX_TABLE))),
+                    PAGE_SIZE);
+    if ( !table )
+        return read_atomic(&msix->table[slot]);
 
     spin_lock(&vpci->lock);
-    if ( !msix->pba )
+    if ( !msix->table[slot] )
     {
-        write_atomic(&msix->pba, pba);
+        write_atomic(&msix->table[slot], table);
         spin_unlock(&vpci->lock);
     }
     else
     {
         spin_unlock(&vpci->lock);
-        iounmap(pba);
+        iounmap(table);
     }
 
-    return read_atomic(&msix->pba);
+    return read_atomic(&msix->table[slot]);
 }
 
 static int cf_check msix_read(
@@ -230,45 +237,6 @@ static int cf_check msix_read(
     if ( !access_allowed(msix->pdev, addr, len) )
         return X86EMUL_OKAY;
 
-    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
-    {
-        struct vpci *vpci = msix->pdev->vpci;
-        unsigned int idx = addr - vmsix_table_addr(vpci, VPCI_MSIX_PBA);
-        const void __iomem *pba = get_pba(vpci);
-
-        /*
-         * Access to PBA.
-         *
-         * TODO: note that this relies on having the PBA identity mapped to the
-         * guest address space. If this changes the address will need to be
-         * translated.
-         */
-        if ( !pba )
-        {
-            gprintk(XENLOG_WARNING,
-                    "%pp: unable to map MSI-X PBA, report all pending\n",
-                    &msix->pdev->sbdf);
-            return X86EMUL_OKAY;
-        }
-
-        switch ( len )
-        {
-        case 4:
-            *data = readl(pba + idx);
-            break;
-
-        case 8:
-            *data = readq(pba + idx);
-            break;
-
-        default:
-            ASSERT_UNREACHABLE();
-            break;
-        }
-
-        return X86EMUL_OKAY;
-    }
-
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
     offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
@@ -317,43 +285,6 @@ static int cf_check msix_write(
     if ( !access_allowed(msix->pdev, addr, len) )
         return X86EMUL_OKAY;
 
-    if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
-    {
-        struct vpci *vpci = msix->pdev->vpci;
-        unsigned int idx = addr - vmsix_table_addr(vpci, VPCI_MSIX_PBA);
-        const void __iomem *pba = get_pba(vpci);
-
-        if ( !is_hardware_domain(d) )
-            /* Ignore writes to PBA for DomUs, it's behavior is undefined. */
-            return X86EMUL_OKAY;
-
-        if ( !pba )
-        {
-            /* Unable to map the PBA, ignore write. */
-            gprintk(XENLOG_WARNING,
-                    "%pp: unable to map MSI-X PBA, write ignored\n",
-                    &msix->pdev->sbdf);
-            return X86EMUL_OKAY;
-        }
-
-        switch ( len )
-        {
-        case 4:
-            writel(data, pba + idx);
-            break;
-
-        case 8:
-            writeq(data, pba + idx);
-            break;
-
-        default:
-            ASSERT_UNREACHABLE();
-            break;
-        }
-
-        return X86EMUL_OKAY;
-    }
-
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
     offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
@@ -438,6 +369,145 @@ static const struct hvm_mmio_ops vpci_msix_table_ops = {
     .write = msix_write,
 };
 
+const static struct vpci_msix *adjacent_find(const struct domain *d,
+                                             unsigned long addr)
+{
+    const struct vpci_msix *msix;
+
+    list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
+        /*
+         * So far vPCI only traps accesses to the MSIX table, but not the PBA
+         * explicitly, and hence we only need to check for the hole created by
+         * the MSIX table.
+         *
+         * If the PBA table is also trapped, the check here should be expanded
+         * to take it into account.
+         */
+        if ( VMSIX_ADDR_ADJACENT(addr, msix->pdev->vpci, VPCI_MSIX_TABLE) )
+            return msix;
+
+    return NULL;
+}
+
+static int cf_check adjacent_accept(struct vcpu *v, unsigned long addr)
+{
+    return !!adjacent_find(v->domain, addr);
+}
+
+static int cf_check adjacent_read(
+    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
+{
+    const struct domain *d = v->domain;
+    const struct vpci_msix *msix = adjacent_find(d, addr);
+    const void __iomem *mem;
+    paddr_t msix_tbl;
+    struct vpci *vpci;
+
+    *data = ~0ul;
+
+    if ( !msix )
+        return X86EMUL_RETRY;
+
+    vpci = msix->pdev->vpci;
+    msix_tbl = vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
+
+    if ( addr + len > round_pgup(msix_tbl +
+                                 vmsix_table_size(vpci, VPCI_MSIX_TABLE)) )
+        return X86EMUL_OKAY;
+
+    mem = get_table(vpci,
+                    PFN_DOWN(addr) == PFN_DOWN(msix_tbl) ? VPCI_MSIX_TBL_HEAD
+                                                         : VPCI_MSIX_TBL_TAIL);
+    if ( !mem )
+        return X86EMUL_OKAY;
+
+    switch ( len )
+    {
+    case 1:
+        *data = readb(mem + PAGE_OFFSET(addr));
+        break;
+
+    case 2:
+        *data = readw(mem + PAGE_OFFSET(addr));
+        break;
+
+    case 4:
+        *data = readl(mem + PAGE_OFFSET(addr));
+        break;
+
+    case 8:
+        *data = readq(mem + PAGE_OFFSET(addr));
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static int cf_check adjacent_write(
+    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
+{
+    const struct domain *d = v->domain;
+    const struct vpci_msix *msix = adjacent_find(d, addr);
+    void __iomem *mem;
+    paddr_t msix_tbl;
+    struct vpci *vpci;
+
+    if ( !msix )
+        return X86EMUL_RETRY;
+
+    vpci = msix->pdev->vpci;
+    msix_tbl = vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
+
+    if ( addr + len > round_pgup(msix_tbl +
+                                 vmsix_table_size(vpci, VPCI_MSIX_TABLE)) )
+        return X86EMUL_OKAY;
+
+    if ( (VMSIX_ADDR_IN_RANGE(addr, vpci, VPCI_MSIX_PBA) ||
+          VMSIX_ADDR_IN_RANGE(addr + len - 1, vpci, VPCI_MSIX_PBA)) &&
+         !is_hardware_domain(d) )
+        /* Ignore writes to PBA for DomUs, it's undefined behavior. */
+        return X86EMUL_OKAY;
+
+    mem = get_table(vpci,
+                    PFN_DOWN(addr) == PFN_DOWN(msix_tbl) ? VPCI_MSIX_TBL_HEAD
+                                                         : VPCI_MSIX_TBL_TAIL);
+    if ( !mem )
+        return X86EMUL_OKAY;
+
+    switch ( len )
+    {
+    case 1:
+        writeb(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    case 2:
+        writew(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    case 4:
+        writel(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    case 8:
+        writeq(data, mem + PAGE_OFFSET(addr));
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_msix_adj_ops = {
+    .check = adjacent_accept,
+    .read = adjacent_read,
+    .write = adjacent_write,
+};
+
 int vpci_make_msix_hole(const struct pci_dev *pdev)
 {
     struct domain *d = pdev->domain;
@@ -530,7 +600,10 @@ static int cf_check init_msix(struct pci_dev *pdev)
     }
 
     if ( list_empty(&d->arch.hvm.msix_tables) )
+    {
         register_mmio_handler(d, &vpci_msix_table_ops);
+        register_mmio_handler(d, &vpci_msix_adj_ops);
+    }
 
     pdev->vpci->msix = msix;
     list_add(&msix->next, &d->arch.hvm.msix_tables);
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 6d48d496bb..652807a4a4 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -54,9 +54,12 @@ void vpci_remove_device(struct pci_dev *pdev)
     spin_unlock(&pdev->vpci->lock);
     if ( pdev->vpci->msix )
     {
+        unsigned int i;
+
         list_del(&pdev->vpci->msix->next);
-        if ( pdev->vpci->msix->pba )
-            iounmap(pdev->vpci->msix->pba);
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->msix->table); i++ )
+            if ( pdev->vpci->msix->table[i] )
+                iounmap(pdev->vpci->msix->table[i]);
     }
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d8acfeba8a..b1ea312778 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -133,8 +133,10 @@ struct vpci {
         bool enabled         : 1;
         /* Masked? */
         bool masked          : 1;
-        /* PBA map */
-        void __iomem *pba;
+        /* Partial table map. */
+#define VPCI_MSIX_TBL_HEAD 0
+#define VPCI_MSIX_TBL_TAIL 1
+        void __iomem *table[2];
         /* Entries. */
         struct vpci_msix_entry {
             uint64_t addr;
-- 
2.39.0

Re: [PATCH] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Jan Beulich 1 year, 4 months ago

On 14.03.2023 11:13, Roger Pau Monne wrote:
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -27,6 +27,13 @@
>      ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
>       (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
>  
> +#define VMSIX_ADDR_ADJACENT(addr, vpci, nr)                               \
> +    ((PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr)) &&           \
> +      (addr) < vmsix_table_addr(vpci, nr)) ||                             \
> +     (PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr) +             \
> +                                 vmsix_table_size(vpci, nr) - 1) &&       \
> +      (addr) >= vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr)))

While I realize this may impact performance a little, did you consider
using !VMSIX_ADDR_IN_RANGE() instead of open-coding it kind of? Then
again there's only a single use of the macro, and that's in code where
VMSIX_ADDR_IN_RANGE() was already checked (by the earlier invocation
of msix_find()), so the re-checking of the MSI-X table bounds isn't
strictly necessary anyway.

> @@ -438,6 +369,145 @@ static const struct hvm_mmio_ops vpci_msix_table_ops = {
>      .write = msix_write,
>  };
>  
> +const static struct vpci_msix *adjacent_find(const struct domain *d,
> +                                             unsigned long addr)
> +{
> +    const struct vpci_msix *msix;
> +
> +    list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
> +        /*
> +         * So far vPCI only traps accesses to the MSIX table, but not the PBA
> +         * explicitly, and hence we only need to check for the hole created by
> +         * the MSIX table.
> +         *
> +         * If the PBA table is also trapped, the check here should be expanded
> +         * to take it into account.
> +         */
> +        if ( VMSIX_ADDR_ADJACENT(addr, msix->pdev->vpci, VPCI_MSIX_TABLE) )
> +            return msix;

Is the comment really correct when considering that you don't change
vpci_make_msix_hole()? (I was actually puzzled by struct vpci_msix'es
table[] field remaining a 2-element array, despite the PBA now being
dealt with differently. But I realize you need to keep that for the
VMSIX_ADDR_IN_RANGE() in adjacent_write().)

> +static int cf_check adjacent_read(
> +    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
> +{
> +    const struct domain *d = v->domain;
> +    const struct vpci_msix *msix = adjacent_find(d, addr);
> +    const void __iomem *mem;
> +    paddr_t msix_tbl;
> +    struct vpci *vpci;
> +
> +    *data = ~0ul;
> +
> +    if ( !msix )
> +        return X86EMUL_RETRY;
> +
> +    vpci = msix->pdev->vpci;
> +    msix_tbl = vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
> +
> +    if ( addr + len > round_pgup(msix_tbl +
> +                                 vmsix_table_size(vpci, VPCI_MSIX_TABLE)) )
> +        return X86EMUL_OKAY;
> +
> +    mem = get_table(vpci,
> +                    PFN_DOWN(addr) == PFN_DOWN(msix_tbl) ? VPCI_MSIX_TBL_HEAD
> +                                                         : VPCI_MSIX_TBL_TAIL);
> +    if ( !mem )
> +        return X86EMUL_OKAY;

The respective PBA logic had a gprintk() on this path.

> +    switch ( len )
> +    {
> +    case 1:
> +        *data = readb(mem + PAGE_OFFSET(addr));
> +        break;
> +
> +    case 2:
> +        *data = readw(mem + PAGE_OFFSET(addr));
> +        break;
> +
> +    case 4:
> +        *data = readl(mem + PAGE_OFFSET(addr));
> +        break;
> +
> +    case 8:
> +        *data = readq(mem + PAGE_OFFSET(addr));
> +        break;

So far we have allowed only aligned 4- and 8-byte accesses to the PBA.
Shouldn't we continue to do so?

I'm also concerned of misaligned accesses: While we can't keep the
guest from doing such on pages we don't intercept, depending on the kind
of anomalies such may cause the effects there may be contained to that
guest. When doing the accesses from the hypervisor, bad effects could
affect the entire system. (FTAOD I don't mean to constrain guests, but I
do think we need to consider splitting misaligned accesses.)

> +    default:
> +        ASSERT_UNREACHABLE();

Is this correct? In msix_{read,write}() these assertions are valid
because of the earlier access_allowed() checks, but here you have
nothing like that. Yes, the emulator currently would only pass sizes
that fit what is being handled, but relying on no "unusual" insns
appearing down the road feels risky. Then again
hvmemul_phys_mmio_access() splits accesses accordingly, so perhaps
all is fine here.

> +static int cf_check adjacent_write(
> +    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
> +{
> +    const struct domain *d = v->domain;
> +    const struct vpci_msix *msix = adjacent_find(d, addr);
> +    void __iomem *mem;
> +    paddr_t msix_tbl;
> +    struct vpci *vpci;
> +
> +    if ( !msix )
> +        return X86EMUL_RETRY;
> +
> +    vpci = msix->pdev->vpci;
> +    msix_tbl = vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
> +
> +    if ( addr + len > round_pgup(msix_tbl +
> +                                 vmsix_table_size(vpci, VPCI_MSIX_TABLE)) )
> +        return X86EMUL_OKAY;
> +
> +    if ( (VMSIX_ADDR_IN_RANGE(addr, vpci, VPCI_MSIX_PBA) ||
> +          VMSIX_ADDR_IN_RANGE(addr + len - 1, vpci, VPCI_MSIX_PBA)) &&
> +         !is_hardware_domain(d) )
> +        /* Ignore writes to PBA for DomUs, it's undefined behavior. */
> +        return X86EMUL_OKAY;

Just as a remark: Checking only start and end is sufficient merely because
the PBA is a multiple of 8 bytes in size, and "len" currently cannot be
larger than 8. This feels somewhat fragile, but is - like above - presumably
okay.

> @@ -530,7 +600,10 @@ static int cf_check init_msix(struct pci_dev *pdev)
>      }
>  
>      if ( list_empty(&d->arch.hvm.msix_tables) )
> +    {
>          register_mmio_handler(d, &vpci_msix_table_ops);
> +        register_mmio_handler(d, &vpci_msix_adj_ops);
> +    }

Did you consider re-using the same ops by widening what their accept()
covers, and by having read/write recognize inside vs outside accesses,
dealing with them differently (much like the PBA was dealt with before)?
Besides my gut feeling of this ending up being less code, there's also
the aspect of NR_IO_HANDLERS being the upper bound to how many handlers
may be registered.

Or else did you consider registering this further handler only when
there actually is a device where the MSI-X table has "slack" at the
front and/or end?

Jan

Re: [PATCH] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monné 1 year, 4 months ago

On Tue, Mar 14, 2023 at 12:56:33PM +0100, Jan Beulich wrote:
> On 14.03.2023 11:13, Roger Pau Monne wrote:
> > --- a/xen/drivers/vpci/msix.c
> > +++ b/xen/drivers/vpci/msix.c
> > @@ -27,6 +27,13 @@
> >      ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
> >       (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
> >  
> > +#define VMSIX_ADDR_ADJACENT(addr, vpci, nr)                               \
> > +    ((PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr)) &&           \
> > +      (addr) < vmsix_table_addr(vpci, nr)) ||                             \
> > +     (PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr) +             \
> > +                                 vmsix_table_size(vpci, nr) - 1) &&       \
> > +      (addr) >= vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr)))
> 
> While I realize this may impact performance a little, did you consider
> using !VMSIX_ADDR_IN_RANGE() instead of open-coding it kind of? Then
> again there's only a single use of the macro, and that's in code where
> VMSIX_ADDR_IN_RANGE() was already checked (by the earlier invocation
> of msix_find()), so the re-checking of the MSI-X table bounds isn't
> strictly necessary anyway.

I didn't want to rely on the order of execution of the MMIO handlers,
so I would rather make sure that the newly added one would work
correctly if it turns to be checked for before the MSIX table handling
one.

I could indeed use !VMSIX_ADDR_IN_RANGE() if that is clearer.

> > @@ -438,6 +369,145 @@ static const struct hvm_mmio_ops vpci_msix_table_ops = {
> >      .write = msix_write,
> >  };
> >  
> > +const static struct vpci_msix *adjacent_find(const struct domain *d,
> > +                                             unsigned long addr)
> > +{
> > +    const struct vpci_msix *msix;
> > +
> > +    list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
> > +        /*
> > +         * So far vPCI only traps accesses to the MSIX table, but not the PBA
> > +         * explicitly, and hence we only need to check for the hole created by
> > +         * the MSIX table.
> > +         *
> > +         * If the PBA table is also trapped, the check here should be expanded
> > +         * to take it into account.
> > +         */
> > +        if ( VMSIX_ADDR_ADJACENT(addr, msix->pdev->vpci, VPCI_MSIX_TABLE) )
> > +            return msix;
> 
> Is the comment really correct when considering that you don't change
> vpci_make_msix_hole()?

Urg, I was really confused then, as I didn't remember (and didn't
check) that we also punch a hole for the PBA.  That's not really
needed for dom0, as we allow both reads and writes on that case anyway.

> (I was actually puzzled by struct vpci_msix'es
> table[] field remaining a 2-element array, despite the PBA now being
> dealt with differently. But I realize you need to keep that for the
> VMSIX_ADDR_IN_RANGE() in adjacent_write().)

If we need to handle the PBA I would need to take it into account for
the array handling, since the PBA can be in a different set of page(s)
than the MSIX table.

> > +static int cf_check adjacent_read(
> > +    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
> > +{
> > +    const struct domain *d = v->domain;
> > +    const struct vpci_msix *msix = adjacent_find(d, addr);
> > +    const void __iomem *mem;
> > +    paddr_t msix_tbl;
> > +    struct vpci *vpci;
> > +
> > +    *data = ~0ul;
> > +
> > +    if ( !msix )
> > +        return X86EMUL_RETRY;
> > +
> > +    vpci = msix->pdev->vpci;
> > +    msix_tbl = vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
> > +
> > +    if ( addr + len > round_pgup(msix_tbl +
> > +                                 vmsix_table_size(vpci, VPCI_MSIX_TABLE)) )
> > +        return X86EMUL_OKAY;
> > +
> > +    mem = get_table(vpci,
> > +                    PFN_DOWN(addr) == PFN_DOWN(msix_tbl) ? VPCI_MSIX_TBL_HEAD
> > +                                                         : VPCI_MSIX_TBL_TAIL);
> > +    if ( !mem )
> > +        return X86EMUL_OKAY;
> 
> The respective PBA logic had a gprintk() on this path.

Ack, will add one.

> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        *data = readb(mem + PAGE_OFFSET(addr));
> > +        break;
> > +
> > +    case 2:
> > +        *data = readw(mem + PAGE_OFFSET(addr));
> > +        break;
> > +
> > +    case 4:
> > +        *data = readl(mem + PAGE_OFFSET(addr));
> > +        break;
> > +
> > +    case 8:
> > +        *data = readq(mem + PAGE_OFFSET(addr));
> > +        break;
> 
> So far we have allowed only aligned 4- and 8-byte accesses to the PBA.
> Shouldn't we continue to do so?

I've read the spec about this and wasn't able to find a reference
about having to access the PBA using 4 and 8 byte accesses.  There's
one for the MSI-X table, but it's my understanding it only applies to
the table.

> I'm also concerned of misaligned accesses: While we can't keep the
> guest from doing such on pages we don't intercept, depending on the kind
> of anomalies such may cause the effects there may be contained to that
> guest. When doing the accesses from the hypervisor, bad effects could
> affect the entire system. (FTAOD I don't mean to constrain guests, but I
> do think we need to consider splitting misaligned accesses.)

I was also wondering about misaligned accesses.  Should be allow dom0
any kind of access, while limiting domUs to aligned only?

> 
> > +    default:
> > +        ASSERT_UNREACHABLE();
> 
> Is this correct? In msix_{read,write}() these assertions are valid
> because of the earlier access_allowed() checks, but here you have
> nothing like that. Yes, the emulator currently would only pass sizes
> that fit what is being handled, but relying on no "unusual" insns
> appearing down the road feels risky. Then again
> hvmemul_phys_mmio_access() splits accesses accordingly, so perhaps
> all is fine here.

It's only on debug builds, so on production ones the access would just
be silently dropped without any ill effect on the hypervisor state.

> > +static int cf_check adjacent_write(
> > +    struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
> > +{
> > +    const struct domain *d = v->domain;
> > +    const struct vpci_msix *msix = adjacent_find(d, addr);
> > +    void __iomem *mem;
> > +    paddr_t msix_tbl;
> > +    struct vpci *vpci;
> > +
> > +    if ( !msix )
> > +        return X86EMUL_RETRY;
> > +
> > +    vpci = msix->pdev->vpci;
> > +    msix_tbl = vmsix_table_addr(vpci, VPCI_MSIX_TABLE);
> > +
> > +    if ( addr + len > round_pgup(msix_tbl +
> > +                                 vmsix_table_size(vpci, VPCI_MSIX_TABLE)) )
> > +        return X86EMUL_OKAY;
> > +
> > +    if ( (VMSIX_ADDR_IN_RANGE(addr, vpci, VPCI_MSIX_PBA) ||
> > +          VMSIX_ADDR_IN_RANGE(addr + len - 1, vpci, VPCI_MSIX_PBA)) &&
> > +         !is_hardware_domain(d) )
> > +        /* Ignore writes to PBA for DomUs, it's undefined behavior. */
> > +        return X86EMUL_OKAY;
> 
> Just as a remark: Checking only start and end is sufficient merely because
> the PBA is a multiple of 8 bytes in size, and "len" currently cannot be
> larger than 8. This feels somewhat fragile, but is - like above - presumably
> okay.

Indeed, I was relying on len <= min PBA size.  I can add a comment to
clarify this.

> > @@ -530,7 +600,10 @@ static int cf_check init_msix(struct pci_dev *pdev)
> >      }
> >  
> >      if ( list_empty(&d->arch.hvm.msix_tables) )
> > +    {
> >          register_mmio_handler(d, &vpci_msix_table_ops);
> > +        register_mmio_handler(d, &vpci_msix_adj_ops);
> > +    }
> 
> Did you consider re-using the same ops by widening what their accept()
> covers, and by having read/write recognize inside vs outside accesses,
> dealing with them differently (much like the PBA was dealt with before)?

I had the feeling it would be clearer to have the MSIX handler only
deal with the MSIX table accesses (and so have it's scope properly
limited in the accept hook), and deal with the fallout from having to
poke a 4K hole in the physmap using a different handler.

I will play a bit with both options and see which one I prefer.

> Besides my gut feeling of this ending up being less code, there's also
> the aspect of NR_IO_HANDLERS being the upper bound to how many handlers
> may be registered.
> 
> Or else did you consider registering this further handler only when
> there actually is a device where the MSI-X table has "slack" at the
> front and/or end?

Hm, no, I didn't consider this conditional registering.

Thanks, Roger.

Re: [PATCH] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Jan Beulich 1 year, 4 months ago

On 14.03.2023 14:54, Roger Pau Monné wrote:
> On Tue, Mar 14, 2023 at 12:56:33PM +0100, Jan Beulich wrote:
>> On 14.03.2023 11:13, Roger Pau Monne wrote:
>>> --- a/xen/drivers/vpci/msix.c
>>> +++ b/xen/drivers/vpci/msix.c
>>> @@ -27,6 +27,13 @@
>>>      ((addr) >= vmsix_table_addr(vpci, nr) &&                              \
>>>       (addr) < vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr))
>>>  
>>> +#define VMSIX_ADDR_ADJACENT(addr, vpci, nr)                               \
>>> +    ((PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr)) &&           \
>>> +      (addr) < vmsix_table_addr(vpci, nr)) ||                             \
>>> +     (PFN_DOWN(addr) == PFN_DOWN(vmsix_table_addr(vpci, nr) +             \
>>> +                                 vmsix_table_size(vpci, nr) - 1) &&       \
>>> +      (addr) >= vmsix_table_addr(vpci, nr) + vmsix_table_size(vpci, nr)))
>>
>> While I realize this may impact performance a little, did you consider
>> using !VMSIX_ADDR_IN_RANGE() instead of open-coding it kind of? Then
>> again there's only a single use of the macro, and that's in code where
>> VMSIX_ADDR_IN_RANGE() was already checked (by the earlier invocation
>> of msix_find()), so the re-checking of the MSI-X table bounds isn't
>> strictly necessary anyway.
> 
> I didn't want to rely on the order of execution of the MMIO handlers,
> so I would rather make sure that the newly added one would work
> correctly if it turns to be checked for before the MSIX table handling
> one.
> 
> I could indeed use !VMSIX_ADDR_IN_RANGE() if that is clearer.

I think it would make the source easier to follow, even if the generated
code is slightly worse.

>>> @@ -438,6 +369,145 @@ static const struct hvm_mmio_ops vpci_msix_table_ops = {
>>>      .write = msix_write,
>>>  };
>>>  
>>> +const static struct vpci_msix *adjacent_find(const struct domain *d,
>>> +                                             unsigned long addr)
>>> +{
>>> +    const struct vpci_msix *msix;
>>> +
>>> +    list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>>> +        /*
>>> +         * So far vPCI only traps accesses to the MSIX table, but not the PBA
>>> +         * explicitly, and hence we only need to check for the hole created by
>>> +         * the MSIX table.
>>> +         *
>>> +         * If the PBA table is also trapped, the check here should be expanded
>>> +         * to take it into account.
>>> +         */
>>> +        if ( VMSIX_ADDR_ADJACENT(addr, msix->pdev->vpci, VPCI_MSIX_TABLE) )
>>> +            return msix;
>>
>> Is the comment really correct when considering that you don't change
>> vpci_make_msix_hole()?
> 
> Urg, I was really confused then, as I didn't remember (and didn't
> check) that we also punch a hole for the PBA.  That's not really
> needed for dom0, as we allow both reads and writes on that case anyway.

Well, until now we restricted the kinds of writes. But ...

>>> +    switch ( len )
>>> +    {
>>> +    case 1:
>>> +        *data = readb(mem + PAGE_OFFSET(addr));
>>> +        break;
>>> +
>>> +    case 2:
>>> +        *data = readw(mem + PAGE_OFFSET(addr));
>>> +        break;
>>> +
>>> +    case 4:
>>> +        *data = readl(mem + PAGE_OFFSET(addr));
>>> +        break;
>>> +
>>> +    case 8:
>>> +        *data = readq(mem + PAGE_OFFSET(addr));
>>> +        break;
>>
>> So far we have allowed only aligned 4- and 8-byte accesses to the PBA.
>> Shouldn't we continue to do so?
> 
> I've read the spec about this and wasn't able to find a reference
> about having to access the PBA using 4 and 8 byte accesses.  There's
> one for the MSI-X table, but it's my understanding it only applies to
> the table.

... you make a good point here: Perhaps we were too restrictive. Such
a change in behavior wants calling out in the description, though.

>> I'm also concerned of misaligned accesses: While we can't keep the
>> guest from doing such on pages we don't intercept, depending on the kind
>> of anomalies such may cause the effects there may be contained to that
>> guest. When doing the accesses from the hypervisor, bad effects could
>> affect the entire system. (FTAOD I don't mean to constrain guests, but I
>> do think we need to consider splitting misaligned accesses.)
> 
> I was also wondering about misaligned accesses.  Should be allow dom0
> any kind of access, while limiting domUs to aligned only?

I guess the code would be simpler we we treated both equally. As said,
my goal is not to refuse misaligned accesses, but to break them up. To
keep things simple we might even use purely byte accesses (and then
perhaps simply REP MOVSB). Special casing Dom0 would only add code for
no real gain.

Jan

Re: [PATCH] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Roger Pau Monné 1 year, 4 months ago

On Tue, Mar 14, 2023 at 04:46:19PM +0100, Jan Beulich wrote:
> On 14.03.2023 14:54, Roger Pau Monné wrote:
> > On Tue, Mar 14, 2023 at 12:56:33PM +0100, Jan Beulich wrote:
> >> On 14.03.2023 11:13, Roger Pau Monne wrote:
> >> I'm also concerned of misaligned accesses: While we can't keep the
> >> guest from doing such on pages we don't intercept, depending on the kind
> >> of anomalies such may cause the effects there may be contained to that
> >> guest. When doing the accesses from the hypervisor, bad effects could
> >> affect the entire system. (FTAOD I don't mean to constrain guests, but I
> >> do think we need to consider splitting misaligned accesses.)
> > 
> > I was also wondering about misaligned accesses.  Should be allow dom0
> > any kind of access, while limiting domUs to aligned only?
> 
> I guess the code would be simpler we we treated both equally. As said,
> my goal is not to refuse misaligned accesses, but to break them up. To
> keep things simple we might even use purely byte accesses (and then
> perhaps simply REP MOVSB). Special casing Dom0 would only add code for
> no real gain.

Hm, I would be worried about then breaking the requirement of some
registers being accessed using a specific size, but again we are
dealing with misaligned accesses to a region that shouldn't contain
registers in the first place.

FWIW, the device I currently have that has registers in the same page
as the MSIX and the PBA tables is fine with limiting such accesses to
aligned only.

What is it that worries you about Xen relying unaligned accesses
instead of just the domain itself doing it on any other BAR MMIO it
has directly mapped into the p2m?  Any error generated by the device
in such setup would likely have the same effect, regardless of whether
the access is in Xen or domain context.

Thanks, Roger.

Re: [PATCH] vpci/msix: handle accesses adjacent to the MSI-X table

Posted by Jan Beulich 1 year, 4 months ago

On 14.03.2023 17:34, Roger Pau Monné wrote:
> On Tue, Mar 14, 2023 at 04:46:19PM +0100, Jan Beulich wrote:
>> On 14.03.2023 14:54, Roger Pau Monné wrote:
>>> On Tue, Mar 14, 2023 at 12:56:33PM +0100, Jan Beulich wrote:
>>>> On 14.03.2023 11:13, Roger Pau Monne wrote:
>>>> I'm also concerned of misaligned accesses: While we can't keep the
>>>> guest from doing such on pages we don't intercept, depending on the kind
>>>> of anomalies such may cause the effects there may be contained to that
>>>> guest. When doing the accesses from the hypervisor, bad effects could
>>>> affect the entire system. (FTAOD I don't mean to constrain guests, but I
>>>> do think we need to consider splitting misaligned accesses.)
>>>
>>> I was also wondering about misaligned accesses.  Should be allow dom0
>>> any kind of access, while limiting domUs to aligned only?
>>
>> I guess the code would be simpler we we treated both equally. As said,
>> my goal is not to refuse misaligned accesses, but to break them up. To
>> keep things simple we might even use purely byte accesses (and then
>> perhaps simply REP MOVSB). Special casing Dom0 would only add code for
>> no real gain.
> 
> Hm, I would be worried about then breaking the requirement of some
> registers being accessed using a specific size, but again we are
> dealing with misaligned accesses to a region that shouldn't contain
> registers in the first place.

Well, you name it: We're talking about a specific-width access which
at the same is misaligned. That doesn't sound very likely, and imo
wants caring about the earliest when such a case actually arises.

> FWIW, the device I currently have that has registers in the same page
> as the MSIX and the PBA tables is fine with limiting such accesses to
> aligned only.
> 
> What is it that worries you about Xen relying unaligned accesses
> instead of just the domain itself doing it on any other BAR MMIO it
> has directly mapped into the p2m?  Any error generated by the device
> in such setup would likely have the same effect, regardless of whether
> the access is in Xen or domain context.

It's more that doing this in Xen doesn't feel well. If I should provide
a contrived scenario, I'd point at #MC potentially being raised for
such an access. If further we'd assume we had better machine check
handling, then dealing with #MC coming from a guest can be expected to
be more likely to keep the host alive than if the #MC was raised in Xen
(and e.g. - didn't check here - possibly with some lock held).

Jan