[PATCH v2] vfio/pci: Static Resizable BAR capability

Alex Williamson posted 1 patch 12 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230505232308.2869912-1-alex.williamson@redhat.com
Maintainers: Alex Williamson <alex.williamson@redhat.com>, "Cédric Le Goater" <clg@redhat.com>
hw/vfio/pci.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)
[PATCH v2] vfio/pci: Static Resizable BAR capability
Posted by Alex Williamson 12 months ago
The PCI Resizable BAR (ReBAR) capability is currently hidden from the
VM because the protocol for interacting with the capability does not
support a mechanism for the device to reject an advertised supported
BAR size.  However, when assigned to a VM, the act of resizing the
BAR requires adjustment of host resources for the device, which
absolutely can fail.  Linux does not currently allow us to reserve
resources for the device independent of the current usage.

The only writable field within the ReBAR capability is the BAR Size
register.  The PCIe spec indicates that when written, the device
should immediately begin to operate with the provided BAR size.  The
spec however also notes that software must only write values
corresponding to supported sizes as indicated in the capability and
control registers.  Writing unsupported sizes produces undefined
results.  Therefore, if the hypervisor were to virtualize the
capability and control registers such that the current size is the
only indicated available size, then a write of anything other than
the current size falls into the category of undefined behavior,
where we can essentially expose the modified ReBAR capability as
read-only.

This may seem pointless, but users have reported that virtualizing
the capability in this way not only allows guest software to expose
related features as available (even if only cosmetic), but in some
scenarios can resolve guest driver issues.  Additionally, no
regressions in behavior have been reported for this change.

A caveat here is that the PCIe spec requires for compatibility that
devices report support for a size in the range of 1MB to 512GB,
therefore if the current BAR size falls outside that range we revert
to hiding the capability.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
v2:
 - Add spec reference
 - Use PCI_REBAR_CAP_SIZES to check sizes in range
 - Try to clarify capability bit generation
 - Rename s/bars/nbar/ to match #defines
 - More complete masking of NBAR value

 hw/vfio/pci.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cf27f28936cb..3ab849767a92 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2066,6 +2066,54 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
     return 0;
 }
 
+static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
+{
+    uint32_t ctrl;
+    int i, nbar;
+
+    ctrl = pci_get_long(vdev->pdev.config + pos + PCI_REBAR_CTRL);
+    nbar = (ctrl & PCI_REBAR_CTRL_NBAR_MASK) >> PCI_REBAR_CTRL_NBAR_SHIFT;
+
+    for (i = 0; i < nbar; i++) {
+        uint32_t cap;
+        int size;
+
+        ctrl = pci_get_long(vdev->pdev.config + pos + PCI_REBAR_CTRL + (i * 8));
+        size = (ctrl & PCI_REBAR_CTRL_BAR_SIZE) >> PCI_REBAR_CTRL_BAR_SHIFT;
+
+        /* The cap register reports sizes 1MB to 127TB, with 4 reserved bits */
+        cap = size <= 27 ? 1U << (size + 4) : 0;
+
+        /*
+         * The PCIe spec (v6.0.1, 7.8.6) requires HW to support at least one
+         * size in the range 1MB to 512GB.  We intend to mask all sizes except
+         * the one currently enabled in the size field, therefore if it's
+         * outside the range, hide the whole capability as this virtualization
+         * trick won't work.  If >512GB resizable BARs start to appear, we
+         * might need an opt-in or reservation scheme in the kernel.
+         */
+        if (!(cap & PCI_REBAR_CAP_SIZES)) {
+            return -EINVAL;
+        }
+
+        /* Hide all sizes reported in the ctrl reg per above requirement. */
+        ctrl &= (PCI_REBAR_CTRL_BAR_SIZE |
+                 PCI_REBAR_CTRL_NBAR_MASK |
+                 PCI_REBAR_CTRL_BAR_IDX);
+
+        /*
+         * The BAR size field is RW, however we've mangled the capability
+         * register such that we only report a single size, ie. the current
+         * BAR size.  A write of an unsupported value is undefined, therefore
+         * the register field is essentially RO.
+         */
+        vfio_add_emulated_long(vdev, pos + PCI_REBAR_CAP + (i * 8), cap, ~0);
+        vfio_add_emulated_long(vdev, pos + PCI_REBAR_CTRL + (i * 8), ctrl, ~0);
+    }
+
+    return 0;
+}
+
 static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
@@ -2139,9 +2187,13 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
         case 0: /* kernel masked capability */
         case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
         case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
-        case PCI_EXT_CAP_ID_REBAR: /* Can't expose read-only */
             trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
             break;
+        case PCI_EXT_CAP_ID_REBAR:
+            if (!vfio_setup_rebar_ecap(vdev, next)) {
+                pcie_add_capability(pdev, cap_id, cap_ver, next, size);
+            }
+            break;
         default:
             pcie_add_capability(pdev, cap_id, cap_ver, next, size);
         }
-- 
2.39.2
Re: [PATCH v2] vfio/pci: Static Resizable BAR capability
Posted by Cédric Le Goater 12 months ago
On 5/6/23 01:23, Alex Williamson wrote:
> The PCI Resizable BAR (ReBAR) capability is currently hidden from the
> VM because the protocol for interacting with the capability does not
> support a mechanism for the device to reject an advertised supported
> BAR size.  However, when assigned to a VM, the act of resizing the
> BAR requires adjustment of host resources for the device, which
> absolutely can fail.  Linux does not currently allow us to reserve
> resources for the device independent of the current usage.
> 
> The only writable field within the ReBAR capability is the BAR Size
> register.  The PCIe spec indicates that when written, the device
> should immediately begin to operate with the provided BAR size.  The
> spec however also notes that software must only write values
> corresponding to supported sizes as indicated in the capability and
> control registers.  Writing unsupported sizes produces undefined
> results.  Therefore, if the hypervisor were to virtualize the
> capability and control registers such that the current size is the
> only indicated available size, then a write of anything other than
> the current size falls into the category of undefined behavior,
> where we can essentially expose the modified ReBAR capability as
> read-only.
> 
> This may seem pointless, but users have reported that virtualizing
> the capability in this way not only allows guest software to expose
> related features as available (even if only cosmetic), but in some
> scenarios can resolve guest driver issues.  Additionally, no
> regressions in behavior have been reported for this change.
> 
> A caveat here is that the PCIe spec requires for compatibility that
> devices report support for a size in the range of 1MB to 512GB,
> therefore if the current BAR size falls outside that range we revert
> to hiding the capability.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Cédric Le Goater <clg@redhat.com>

> ---
> v2:
>   - Add spec reference
>   - Use PCI_REBAR_CAP_SIZES to check sizes in range
>   - Try to clarify capability bit generation
>   - Rename s/bars/nbar/ to match #defines
>   - More complete masking of NBAR value
> 
>   hw/vfio/pci.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index cf27f28936cb..3ab849767a92 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2066,6 +2066,54 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>       return 0;
>   }
>   
> +static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
> +{
> +    uint32_t ctrl;
> +    int i, nbar;
> +
> +    ctrl = pci_get_long(vdev->pdev.config + pos + PCI_REBAR_CTRL);
> +    nbar = (ctrl & PCI_REBAR_CTRL_NBAR_MASK) >> PCI_REBAR_CTRL_NBAR_SHIFT;
> +
> +    for (i = 0; i < nbar; i++) {
> +        uint32_t cap;
> +        int size;
> +
> +        ctrl = pci_get_long(vdev->pdev.config + pos + PCI_REBAR_CTRL + (i * 8));
> +        size = (ctrl & PCI_REBAR_CTRL_BAR_SIZE) >> PCI_REBAR_CTRL_BAR_SHIFT;
> +
> +        /* The cap register reports sizes 1MB to 127TB, with 4 reserved bits */

s/127/128/

C.

> +        cap = size <= 27 ? 1U << (size + 4) : 0;
> +
> +        /*
> +         * The PCIe spec (v6.0.1, 7.8.6) requires HW to support at least one
> +         * size in the range 1MB to 512GB.  We intend to mask all sizes except
> +         * the one currently enabled in the size field, therefore if it's
> +         * outside the range, hide the whole capability as this virtualization
> +         * trick won't work.  If >512GB resizable BARs start to appear, we
> +         * might need an opt-in or reservation scheme in the kernel.
> +         */
> +        if (!(cap & PCI_REBAR_CAP_SIZES)) {
> +            return -EINVAL;
> +        }
> +
> +        /* Hide all sizes reported in the ctrl reg per above requirement. */
> +        ctrl &= (PCI_REBAR_CTRL_BAR_SIZE |
> +                 PCI_REBAR_CTRL_NBAR_MASK |
> +                 PCI_REBAR_CTRL_BAR_IDX);
> +
> +        /*
> +         * The BAR size field is RW, however we've mangled the capability
> +         * register such that we only report a single size, ie. the current
> +         * BAR size.  A write of an unsupported value is undefined, therefore
> +         * the register field is essentially RO.
> +         */
> +        vfio_add_emulated_long(vdev, pos + PCI_REBAR_CAP + (i * 8), cap, ~0);
> +        vfio_add_emulated_long(vdev, pos + PCI_REBAR_CTRL + (i * 8), ctrl, ~0);
> +    }
> +
> +    return 0;
> +}
> +
>   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>   {
>       PCIDevice *pdev = &vdev->pdev;
> @@ -2139,9 +2187,13 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>           case 0: /* kernel masked capability */
>           case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
>           case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> -        case PCI_EXT_CAP_ID_REBAR: /* Can't expose read-only */
>               trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
>               break;
> +        case PCI_EXT_CAP_ID_REBAR:
> +            if (!vfio_setup_rebar_ecap(vdev, next)) {
> +                pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> +            }
> +            break;
>           default:
>               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>           }


Re: [PATCH v2] vfio/pci: Static Resizable BAR capability
Posted by Alex Williamson 12 months ago
On Tue, 9 May 2023 11:39:57 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> On 5/6/23 01:23, Alex Williamson wrote:
> > The PCI Resizable BAR (ReBAR) capability is currently hidden from the
> > VM because the protocol for interacting with the capability does not
> > support a mechanism for the device to reject an advertised supported
> > BAR size.  However, when assigned to a VM, the act of resizing the
> > BAR requires adjustment of host resources for the device, which
> > absolutely can fail.  Linux does not currently allow us to reserve
> > resources for the device independent of the current usage.
> > 
> > The only writable field within the ReBAR capability is the BAR Size
> > register.  The PCIe spec indicates that when written, the device
> > should immediately begin to operate with the provided BAR size.  The
> > spec however also notes that software must only write values
> > corresponding to supported sizes as indicated in the capability and
> > control registers.  Writing unsupported sizes produces undefined
> > results.  Therefore, if the hypervisor were to virtualize the
> > capability and control registers such that the current size is the
> > only indicated available size, then a write of anything other than
> > the current size falls into the category of undefined behavior,
> > where we can essentially expose the modified ReBAR capability as
> > read-only.
> > 
> > This may seem pointless, but users have reported that virtualizing
> > the capability in this way not only allows guest software to expose
> > related features as available (even if only cosmetic), but in some
> > scenarios can resolve guest driver issues.  Additionally, no
> > regressions in behavior have been reported for this change.
> > 
> > A caveat here is that the PCIe spec requires for compatibility that
> > devices report support for a size in the range of 1MB to 512GB,
> > therefore if the current BAR size falls outside that range we revert
> > to hiding the capability.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> 
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> 
> > ---
> > v2:
> >   - Add spec reference
> >   - Use PCI_REBAR_CAP_SIZES to check sizes in range
> >   - Try to clarify capability bit generation
> >   - Rename s/bars/nbar/ to match #defines
> >   - More complete masking of NBAR value
> > 
> >   hw/vfio/pci.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 53 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index cf27f28936cb..3ab849767a92 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2066,6 +2066,54 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
> >       return 0;
> >   }
> >   
> > +static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
> > +{
> > +    uint32_t ctrl;
> > +    int i, nbar;
> > +
> > +    ctrl = pci_get_long(vdev->pdev.config + pos + PCI_REBAR_CTRL);
> > +    nbar = (ctrl & PCI_REBAR_CTRL_NBAR_MASK) >> PCI_REBAR_CTRL_NBAR_SHIFT;
> > +
> > +    for (i = 0; i < nbar; i++) {
> > +        uint32_t cap;
> > +        int size;
> > +
> > +        ctrl = pci_get_long(vdev->pdev.config + pos + PCI_REBAR_CTRL + (i * 8));
> > +        size = (ctrl & PCI_REBAR_CTRL_BAR_SIZE) >> PCI_REBAR_CTRL_BAR_SHIFT;
> > +
> > +        /* The cap register reports sizes 1MB to 127TB, with 4 reserved bits */  
> 
> s/127/128/

Yes, fixed.  Thanks!

Alex

> > +        cap = size <= 27 ? 1U << (size + 4) : 0;
> > +
> > +        /*
> > +         * The PCIe spec (v6.0.1, 7.8.6) requires HW to support at least one
> > +         * size in the range 1MB to 512GB.  We intend to mask all sizes except
> > +         * the one currently enabled in the size field, therefore if it's
> > +         * outside the range, hide the whole capability as this virtualization
> > +         * trick won't work.  If >512GB resizable BARs start to appear, we
> > +         * might need an opt-in or reservation scheme in the kernel.
> > +         */
> > +        if (!(cap & PCI_REBAR_CAP_SIZES)) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        /* Hide all sizes reported in the ctrl reg per above requirement. */
> > +        ctrl &= (PCI_REBAR_CTRL_BAR_SIZE |
> > +                 PCI_REBAR_CTRL_NBAR_MASK |
> > +                 PCI_REBAR_CTRL_BAR_IDX);
> > +
> > +        /*
> > +         * The BAR size field is RW, however we've mangled the capability
> > +         * register such that we only report a single size, ie. the current
> > +         * BAR size.  A write of an unsupported value is undefined, therefore
> > +         * the register field is essentially RO.
> > +         */
> > +        vfio_add_emulated_long(vdev, pos + PCI_REBAR_CAP + (i * 8), cap, ~0);
> > +        vfio_add_emulated_long(vdev, pos + PCI_REBAR_CTRL + (i * 8), ctrl, ~0);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >   {
> >       PCIDevice *pdev = &vdev->pdev;
> > @@ -2139,9 +2187,13 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >           case 0: /* kernel masked capability */
> >           case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> >           case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
> > -        case PCI_EXT_CAP_ID_REBAR: /* Can't expose read-only */
> >               trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
> >               break;
> > +        case PCI_EXT_CAP_ID_REBAR:
> > +            if (!vfio_setup_rebar_ecap(vdev, next)) {
> > +                pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> > +            }
> > +            break;
> >           default:
> >               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> >           }  
>