[Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

Alex Williamson posted 1 patch 5 years ago
Test docker-mingw@fedora passed
Test docker-clang@ubuntu passed
Test asan passed
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/155364082689.15803.7062874513041742278.stgit@gimli.home
Maintainers: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Michael S. Tsirkin" <mst@redhat.com>
hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
[Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
distinguish by devfn & bus between devices in a conventional PCI
topology and therefore we cannot assign them separate AddressSpaces.
By taking this requester ID aliasing into account, QEMU better matches
the bare metal behavior and restrictions, and enables shared
AddressSpace configurations that are otherwise not possible with
guest IOMMU support.

For the latter case, given any example where an IOMMU group on the
host includes multiple devices:

  $ ls  /sys/kernel/iommu_groups/1/devices/
  0000:00:01.0  0000:01:00.0  0000:01:00.1

If we incorporate a vIOMMU into the VM configuration, we're restricted
that we can only assign one of the endpoints to the guest because a
second endpoint will attempt to use a different AddressSpace.  VFIO
only supports IOMMU group level granularity at the container level,
preventing this second endpoint from being assigned:

qemu-system-x86_64 -machine q35... \
  -device intel-iommu,intremap=on \
  -device pcie-root-port,addr=1e.0,id=pcie.1 \
  -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
  -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1

qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
0000:01:00.1: group 1 used in multiple address spaces

However, when QEMU incorporates proper aliasing, we can make use of a
PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
provides the downstream devices with the same AddressSpace, ex:

qemu-system-x86_64 -machine q35... \
  -device intel-iommu,intremap=on \
  -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
  -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
  -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

While the utility of this hack may be limited, this AddressSpace
aliasing is the correct behavior for QEMU to emulate bare metal.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 35451c1e9987..38467e676f1f 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
+    uint8_t devfn = dev->devfn;
 
     while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
-        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
+        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
+
+        /*
+         * Determine which requester ID alias should be used for the device
+         * based on the PCI topology.  There are no requester IDs on convetional
+         * PCI buses, therefore we push the alias up to the parent on each non-
+         * express bus.  Which alias we use depends on whether this is a legacy
+         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
+         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
+         * because the resulting BDF depends on the secondary bridge register
+         * programming.  We also cannot lookup the PCIBus from the bus number
+         * at this point for the iommu_fn.  Also, requester_id_cache is the
+         * alias to the root bus, which is usually, but not necessarily always
+         * where we'll find our iommu_fn.
+         */
+        if (!pci_bus_is_express(iommu_bus)) {
+            PCIDevice *parent = iommu_bus->parent_dev;
+
+            if (pci_is_express(parent) &&
+                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
+                devfn = PCI_DEVFN(0, 0);
+                bus = iommu_bus;
+            } else {
+                devfn = parent->devfn;
+                bus = parent_bus;
+            }
+        }
+
+        iommu_bus = parent_bus;
     }
     if (iommu_bus && iommu_bus->iommu_fn) {
-        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
+        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
     }
     return &address_space_memory;
 }


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Auger Eric 5 years ago
Hi Alex,

On 3/26/19 11:55 PM, Alex Williamson wrote:
> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> distinguish by devfn & bus between devices in a conventional PCI
> topology and therefore we cannot assign them separate AddressSpaces.
> By taking this requester ID aliasing into account, QEMU better matches
> the bare metal behavior and restrictions, and enables shared
> AddressSpace configurations that are otherwise not possible with
> guest IOMMU support.
> 
> For the latter case, given any example where an IOMMU group on the
> host includes multiple devices:
> 
>   $ ls  /sys/kernel/iommu_groups/1/devices/
>   0000:00:01.0  0000:01:00.0  0000:01:00.1
> 
> If we incorporate a vIOMMU into the VM configuration, we're restricted
> that we can only assign one of the endpoints to the guest because a
> second endpoint will attempt to use a different AddressSpace.  VFIO
> only supports IOMMU group level granularity at the container level,
> preventing this second endpoint from being assigned:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> 
> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> 0000:01:00.1: group 1 used in multiple address spaces
> 
> However, when QEMU incorporates proper aliasing, we can make use of a
> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> provides the downstream devices with the same AddressSpace, ex:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> 
> While the utility of this hack may be limited, this AddressSpace
> aliasing is the correct behavior for QEMU to emulate bare metal.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 35451c1e9987..38467e676f1f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> +    uint8_t devfn = dev->devfn;
>  
>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> +
> +        /*
> +         * Determine which requester ID alias should be used for the device
> +         * based on the PCI topology.  There are no requester IDs on convetional
conventional
> +         * PCI buses, therefore we push the alias up to the parent on each non-
> +         * express bus.  Which alias we use depends on whether this is a legacy
> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> +         * because the resulting BDF depends on the secondary bridge register
Didn't you mean secondary bus number register?
> +         * programming.  We also cannot lookup the PCIBus from the bus number
> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> +         * alias to the root bus, which is usually, but not necessarily always
> +         * where we'll find our iommu_fn.
> +         */
> +        if (!pci_bus_is_express(iommu_bus)) {
> +            PCIDevice *parent = iommu_bus->parent_dev;
> +
> +            if (pci_is_express(parent) &&
> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> +                devfn = PCI_DEVFN(0, 0);
> +                bus = iommu_bus;
> +            } else {
> +                devfn = parent->devfn;
> +                bus = parent_bus;
> +            }
> +        }
> +
> +        iommu_bus = parent_bus;
>      }
>      if (iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
I think it would make sense to comment this iommu_fn() callback's role
somewhere.
>      }
>      return &address_space_memory;
>  }
> 
>

Thanks

Eric

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
On Wed, 27 Mar 2019 12:46:41 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 3/26/19 11:55 PM, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   0000:00:01.0  0000:01:00.0  0000:01:00.1
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > 0000:01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 35451c1e9987..38467e676f1f 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >  {
> >      PCIBus *bus = pci_get_bus(dev);
> >      PCIBus *iommu_bus = bus;
> > +    uint8_t devfn = dev->devfn;
> >  
> >      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > +
> > +        /*
> > +         * Determine which requester ID alias should be used for the device
> > +         * based on the PCI topology.  There are no requester IDs on convetional  
> conventional

Oops, fixed

> > +         * PCI buses, therefore we push the alias up to the parent on each non-
> > +         * express bus.  Which alias we use depends on whether this is a legacy
> > +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> > +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> > +         * because the resulting BDF depends on the secondary bridge register  
> Didn't you mean secondary bus number register?

Yes, fixed

> > +         * programming.  We also cannot lookup the PCIBus from the bus number
> > +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> > +         * alias to the root bus, which is usually, but not necessarily always
> > +         * where we'll find our iommu_fn.
> > +         */
> > +        if (!pci_bus_is_express(iommu_bus)) {
> > +            PCIDevice *parent = iommu_bus->parent_dev;
> > +
> > +            if (pci_is_express(parent) &&
> > +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +                devfn = PCI_DEVFN(0, 0);
> > +                bus = iommu_bus;
> > +            } else {
> > +                devfn = parent->devfn;
> > +                bus = parent_bus;
> > +            }
> > +        }
> > +
> > +        iommu_bus = parent_bus;
> >      }
> >      if (iommu_bus && iommu_bus->iommu_fn) {
> > -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> > +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);  
> I think it would make sense to comment this iommu_fn() callback's role
> somewhere.

While I agree, it seems a bit out of scope of this patch.  Or are you
suggesting that this patch fundamentally changing the role rather than
trying to make it work as intended?  Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Auger Eric 5 years ago
Hi Alex,

On 3/27/19 7:02 PM, Alex Williamson wrote:
> On Wed, 27 Mar 2019 12:46:41 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 3/26/19 11:55 PM, Alex Williamson wrote:
>>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>>> distinguish by devfn & bus between devices in a conventional PCI
>>> topology and therefore we cannot assign them separate AddressSpaces.
>>> By taking this requester ID aliasing into account, QEMU better matches
>>> the bare metal behavior and restrictions, and enables shared
>>> AddressSpace configurations that are otherwise not possible with
>>> guest IOMMU support.
>>>
>>> For the latter case, given any example where an IOMMU group on the
>>> host includes multiple devices:
>>>
>>>   $ ls  /sys/kernel/iommu_groups/1/devices/
>>>   0000:00:01.0  0000:01:00.0  0000:01:00.1
>>>
>>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>>> that we can only assign one of the endpoints to the guest because a
>>> second endpoint will attempt to use a different AddressSpace.  VFIO
>>> only supports IOMMU group level granularity at the container level,
>>> preventing this second endpoint from being assigned:
>>>
>>> qemu-system-x86_64 -machine q35... \
>>>   -device intel-iommu,intremap=on \
>>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>>
>>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>>> 0000:01:00.1: group 1 used in multiple address spaces
>>>
>>> However, when QEMU incorporates proper aliasing, we can make use of a
>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>>> provides the downstream devices with the same AddressSpace, ex:
>>>
>>> qemu-system-x86_64 -machine q35... \
>>>   -device intel-iommu,intremap=on \
>>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>>
>>> While the utility of this hack may be limited, this AddressSpace
>>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index 35451c1e9987..38467e676f1f 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>  {
>>>      PCIBus *bus = pci_get_bus(dev);
>>>      PCIBus *iommu_bus = bus;
>>> +    uint8_t devfn = dev->devfn;
>>>  
>>>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>>> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
>>> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>>> +
>>> +        /*
>>> +         * Determine which requester ID alias should be used for the device
>>> +         * based on the PCI topology.  There are no requester IDs on convetional  
>> conventional
> 
> Oops, fixed
> 
>>> +         * PCI buses, therefore we push the alias up to the parent on each non-
>>> +         * express bus.  Which alias we use depends on whether this is a legacy
>>> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
>>> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
>>> +         * because the resulting BDF depends on the secondary bridge register  
>> Didn't you mean secondary bus number register?
> 
> Yes, fixed
> 
>>> +         * programming.  We also cannot lookup the PCIBus from the bus number
>>> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
>>> +         * alias to the root bus, which is usually, but not necessarily always
>>> +         * where we'll find our iommu_fn.
>>> +         */
>>> +        if (!pci_bus_is_express(iommu_bus)) {
>>> +            PCIDevice *parent = iommu_bus->parent_dev;
>>> +
>>> +            if (pci_is_express(parent) &&
>>> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
>>> +                devfn = PCI_DEVFN(0, 0);
>>> +                bus = iommu_bus;
>>> +            } else {
>>> +                devfn = parent->devfn;
>>> +                bus = parent_bus;
>>> +            }
>>> +        }
>>> +
>>> +        iommu_bus = parent_bus;
>>>      }
>>>      if (iommu_bus && iommu_bus->iommu_fn) {
>>> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>>> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);  
>> I think it would make sense to comment this iommu_fn() callback's role
>> somewhere.
> 
> While I agree, it seems a bit out of scope of this patch.  Or are you
> suggesting that this patch fundamentally changing the role rather than
> trying to make it work as intended?

The last question is what I tried to figure out when reviewing the patch
as you change the @bus and @devfn arg values passed to the cb. Given the
lack of documentation about the role of this function, it is not obvious
to see if the patch does not break anything without reading the cb
implementations.

Thanks

Eric

> 
> Alex
> 

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
On Wed, 27 Mar 2019 19:07:44 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 3/27/19 7:02 PM, Alex Williamson wrote:
> > On Wed, 27 Mar 2019 12:46:41 +0100
> > Auger Eric <eric.auger@redhat.com> wrote:
> >   
> >> Hi Alex,
> >>
> >> On 3/26/19 11:55 PM, Alex Williamson wrote:  
> >>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >>> distinguish by devfn & bus between devices in a conventional PCI
> >>> topology and therefore we cannot assign them separate AddressSpaces.
> >>> By taking this requester ID aliasing into account, QEMU better matches
> >>> the bare metal behavior and restrictions, and enables shared
> >>> AddressSpace configurations that are otherwise not possible with
> >>> guest IOMMU support.
> >>>
> >>> For the latter case, given any example where an IOMMU group on the
> >>> host includes multiple devices:
> >>>
> >>>   $ ls  /sys/kernel/iommu_groups/1/devices/
> >>>   0000:00:01.0  0000:01:00.0  0000:01:00.1
> >>>
> >>> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >>> that we can only assign one of the endpoints to the guest because a
> >>> second endpoint will attempt to use a different AddressSpace.  VFIO
> >>> only supports IOMMU group level granularity at the container level,
> >>> preventing this second endpoint from being assigned:
> >>>
> >>> qemu-system-x86_64 -machine q35... \
> >>>   -device intel-iommu,intremap=on \
> >>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>>
> >>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> >>> 0000:01:00.1: group 1 used in multiple address spaces
> >>>
> >>> However, when QEMU incorporates proper aliasing, we can make use of a
> >>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >>> provides the downstream devices with the same AddressSpace, ex:
> >>>
> >>> qemu-system-x86_64 -machine q35... \
> >>>   -device intel-iommu,intremap=on \
> >>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> >>>
> >>> While the utility of this hack may be limited, this AddressSpace
> >>> aliasing is the correct behavior for QEMU to emulate bare metal.
> >>>
> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>> ---
> >>>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >>>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>> index 35451c1e9987..38467e676f1f 100644
> >>> --- a/hw/pci/pci.c
> >>> +++ b/hw/pci/pci.c
> >>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>>  {
> >>>      PCIBus *bus = pci_get_bus(dev);
> >>>      PCIBus *iommu_bus = bus;
> >>> +    uint8_t devfn = dev->devfn;
> >>>  
> >>>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >>> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >>> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >>> +
> >>> +        /*
> >>> +         * Determine which requester ID alias should be used for the device
> >>> +         * based on the PCI topology.  There are no requester IDs on convetional    
> >> conventional  
> > 
> > Oops, fixed
> >   
> >>> +         * PCI buses, therefore we push the alias up to the parent on each non-
> >>> +         * express bus.  Which alias we use depends on whether this is a legacy
> >>> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> >>> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> >>> +         * because the resulting BDF depends on the secondary bridge register    
> >> Didn't you mean secondary bus number register?  
> > 
> > Yes, fixed
> >   
> >>> +         * programming.  We also cannot lookup the PCIBus from the bus number
> >>> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> >>> +         * alias to the root bus, which is usually, but not necessarily always
> >>> +         * where we'll find our iommu_fn.
> >>> +         */
> >>> +        if (!pci_bus_is_express(iommu_bus)) {
> >>> +            PCIDevice *parent = iommu_bus->parent_dev;
> >>> +
> >>> +            if (pci_is_express(parent) &&
> >>> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >>> +                devfn = PCI_DEVFN(0, 0);
> >>> +                bus = iommu_bus;
> >>> +            } else {
> >>> +                devfn = parent->devfn;
> >>> +                bus = parent_bus;
> >>> +            }
> >>> +        }
> >>> +
> >>> +        iommu_bus = parent_bus;
> >>>      }
> >>>      if (iommu_bus && iommu_bus->iommu_fn) {
> >>> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> >>> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);    
> >> I think it would make sense to comment this iommu_fn() callback's role
> >> somewhere.  
> > 
> > While I agree, it seems a bit out of scope of this patch.  Or are you
> > suggesting that this patch fundamentally changing the role rather than
> > trying to make it work as intended?  
> 
> The last question is what I tried to figure out when reviewing the patch
> as you change the @bus and @devfn arg values passed to the cb. Given the
> lack of documentation about the role of this function, it is not obvious
> to see if the patch does not break anything without reading the cb
> implementations.

AIUI, the original semantics are simply "return an AddressSpace for this
{PCIBus, devfn}".  I don't think that fundamentally changes, it's just
that we know the bus hosting the IOMMU (ie. the one with iommu_fn) and
we know the common algorithm for determining the effective requester ID
the IOMMU will see for that device from its bus.  We're therefore
simply providing the topology defined alias to the device rather than
the device itself.  We're not actually changing the semantics of the
iommu_fn, we're just limiting the set of {PCIBus, devfn}s that we'll
request.  It seems like more of a semantic change to make the callee
responsible for determining the visibility of the requested device.
Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 5 years ago
On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> distinguish by devfn & bus between devices in a conventional PCI
> topology and therefore we cannot assign them separate AddressSpaces.
> By taking this requester ID aliasing into account, QEMU better matches
> the bare metal behavior and restrictions, and enables shared
> AddressSpace configurations that are otherwise not possible with
> guest IOMMU support.
> 
> For the latter case, given any example where an IOMMU group on the
> host includes multiple devices:
> 
>   $ ls  /sys/kernel/iommu_groups/1/devices/
>   0000:00:01.0  0000:01:00.0  0000:01:00.1
> 
> If we incorporate a vIOMMU into the VM configuration, we're restricted
> that we can only assign one of the endpoints to the guest because a
> second endpoint will attempt to use a different AddressSpace.  VFIO
> only supports IOMMU group level granularity at the container level,
> preventing this second endpoint from being assigned:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> 
> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> 0000:01:00.1: group 1 used in multiple address spaces
> 
> However, when QEMU incorporates proper aliasing, we can make use of a
> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> provides the downstream devices with the same AddressSpace, ex:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1


This will break extended address space access though, won't it?
things like attempts to take PCI Express badnwidth into
consideration by drivers for E2E credit math will also
not work ...

> 
> While the utility of this hack may be limited, this AddressSpace
> aliasing is the correct behavior for QEMU to emulate bare metal.


This one's a valid point.

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 35451c1e9987..38467e676f1f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> +    uint8_t devfn = dev->devfn;
>  
>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> +
> +        /*
> +         * Determine which requester ID alias should be used for the device
> +         * based on the PCI topology.  There are no requester IDs on convetional
> +         * PCI buses, therefore we push the alias up to the parent on each non-
> +         * express bus.  Which alias we use depends on whether this is a legacy
> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> +         * because the resulting BDF depends on the secondary bridge register
> +         * programming.

Could you clarify this part a bit pls?

>  We also cannot lookup the PCIBus from the bus number
> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> +         * alias to the root bus, which is usually, but not necessarily always
> +         * where we'll find our iommu_fn.
> +         */
> +        if (!pci_bus_is_express(iommu_bus)) {
> +            PCIDevice *parent = iommu_bus->parent_dev;
> +
> +            if (pci_is_express(parent) &&
> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> +                devfn = PCI_DEVFN(0, 0);

Why 0,0?

> +                bus = iommu_bus;
> +            } else {
> +                devfn = parent->devfn;
> +                bus = parent_bus;
> +            }
> +        }
> +
> +        iommu_bus = parent_bus;
>      }
>      if (iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>  }

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Auger Eric 5 years ago
Hi,

On 3/27/19 4:32 PM, Michael S. Tsirkin wrote:
> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>> distinguish by devfn & bus between devices in a conventional PCI
>> topology and therefore we cannot assign them separate AddressSpaces.
>> By taking this requester ID aliasing into account, QEMU better matches
>> the bare metal behavior and restrictions, and enables shared
>> AddressSpace configurations that are otherwise not possible with
>> guest IOMMU support.
>>
>> For the latter case, given any example where an IOMMU group on the
>> host includes multiple devices:
>>
>>   $ ls  /sys/kernel/iommu_groups/1/devices/
>>   0000:00:01.0  0000:01:00.0  0000:01:00.1
>>
>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>> that we can only assign one of the endpoints to the guest because a
>> second endpoint will attempt to use a different AddressSpace.  VFIO
>> only supports IOMMU group level granularity at the container level,
>> preventing this second endpoint from being assigned:
>>
>> qemu-system-x86_64 -machine q35... \
>>   -device intel-iommu,intremap=on \
>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>
>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>> 0000:01:00.1: group 1 used in multiple address spaces
>>
>> However, when QEMU incorporates proper aliasing, we can make use of a
>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>> provides the downstream devices with the same AddressSpace, ex:
>>
>> qemu-system-x86_64 -machine q35... \
>>   -device intel-iommu,intremap=on \
>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> 
> 
> This will break extended address space access though, won't it?
> things like attempts to take PCI Express badnwidth into
> consideration by drivers for E2E credit math will also
> not work ...
> 
>>
>> While the utility of this hack may be limited, this AddressSpace
>> aliasing is the correct behavior for QEMU to emulate bare metal.
> 
> 
> This one's a valid point.
> 
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 35451c1e9987..38467e676f1f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>  {
>>      PCIBus *bus = pci_get_bus(dev);
>>      PCIBus *iommu_bus = bus;
>> +    uint8_t devfn = dev->devfn;
>>  
>>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
>> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>> +
>> +        /*
>> +         * Determine which requester ID alias should be used for the device
>> +         * based on the PCI topology.  There are no requester IDs on convetional
>> +         * PCI buses, therefore we push the alias up to the parent on each non-
>> +         * express bus.  Which alias we use depends on whether this is a legacy
>> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
>> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
>> +         * because the resulting BDF depends on the secondary bridge register
>> +         * programming.
> 
> Could you clarify this part a bit pls?
> 
>>  We also cannot lookup the PCIBus from the bus number
>> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
>> +         * alias to the root bus, which is usually, but not necessarily always
>> +         * where we'll find our iommu_fn.
>> +         */
>> +        if (!pci_bus_is_express(iommu_bus)) {
>> +            PCIDevice *parent = iommu_bus->parent_dev;
>> +
>> +            if (pci_is_express(parent) &&
>> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
>> +                devfn = PCI_DEVFN(0, 0);
> 
> Why 0,0?

The spec says: "If the bridge generates a new Requester ID for a
transaction forwarded from the secondary interface to the primary
interface, the bridge assigns the PCI Express Requester ID using its
secondary interface’s Bus Number and sets both the Device Number and
Function Number fields to zero.". I guess it is the reason?

> 
>> +                bus = iommu_bus;
>> +            } else {
>> +                devfn = parent->devfn;
>> +                bus = parent_bus;
Alex, please could you clarify this case as well , comment?

Thanks

Eric
>> +            }
>> +        }
>> +
>> +        iommu_bus = parent_bus;
>>      }
>>      if (iommu_bus && iommu_bus->iommu_fn) {
>> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>      }
>>      return &address_space_memory;
>>  }

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 5 years ago
On Wed, Mar 27, 2019 at 05:43:45PM +0100, Auger Eric wrote:
> Hi,
> 
> On 3/27/19 4:32 PM, Michael S. Tsirkin wrote:
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >> distinguish by devfn & bus between devices in a conventional PCI
> >> topology and therefore we cannot assign them separate AddressSpaces.
> >> By taking this requester ID aliasing into account, QEMU better matches
> >> the bare metal behavior and restrictions, and enables shared
> >> AddressSpace configurations that are otherwise not possible with
> >> guest IOMMU support.
> >>
> >> For the latter case, given any example where an IOMMU group on the
> >> host includes multiple devices:
> >>
> >>   $ ls  /sys/kernel/iommu_groups/1/devices/
> >>   0000:00:01.0  0000:01:00.0  0000:01:00.1
> >>
> >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >> that we can only assign one of the endpoints to the guest because a
> >> second endpoint will attempt to use a different AddressSpace.  VFIO
> >> only supports IOMMU group level granularity at the container level,
> >> preventing this second endpoint from being assigned:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> >> 0000:01:00.1: group 1 used in multiple address spaces
> >>
> >> However, when QEMU incorporates proper aliasing, we can make use of a
> >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >> provides the downstream devices with the same AddressSpace, ex:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > 
> > This will break extended address space access though, won't it?
> > things like attempts to take PCI Express badnwidth into
> > consideration by drivers for E2E credit math will also
> > not work ...
> > 
> >>
> >> While the utility of this hack may be limited, this AddressSpace
> >> aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > 
> > This one's a valid point.
> > 
> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >> ---
> >>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 35451c1e9987..38467e676f1f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>  {
> >>      PCIBus *bus = pci_get_bus(dev);
> >>      PCIBus *iommu_bus = bus;
> >> +    uint8_t devfn = dev->devfn;
> >>  
> >>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +
> >> +        /*
> >> +         * Determine which requester ID alias should be used for the device
> >> +         * based on the PCI topology.  There are no requester IDs on convetional
> >> +         * PCI buses, therefore we push the alias up to the parent on each non-
> >> +         * express bus.  Which alias we use depends on whether this is a legacy
> >> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> >> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> >> +         * because the resulting BDF depends on the secondary bridge register
> >> +         * programming.
> > 
> > Could you clarify this part a bit pls?
> > 
> >>  We also cannot lookup the PCIBus from the bus number
> >> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> >> +         * alias to the root bus, which is usually, but not necessarily always
> >> +         * where we'll find our iommu_fn.
> >> +         */
> >> +        if (!pci_bus_is_express(iommu_bus)) {
> >> +            PCIDevice *parent = iommu_bus->parent_dev;
> >> +
> >> +            if (pci_is_express(parent) &&
> >> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >> +                devfn = PCI_DEVFN(0, 0);
> > 
> > Why 0,0?
> 
> The spec says: "If the bridge generates a new Requester ID for a
> transaction forwarded from the secondary interface to the primary
> interface, the bridge assigns the PCI Express Requester ID using its
> secondary interface’s Bus Number and sets both the Device Number and
> Function Number fields to zero.". I guess it is the reason?

Ah.

> > 
> >> +                bus = iommu_bus;
> >> +            } else {
> >> +                devfn = parent->devfn;
> >> +                bus = parent_bus;
> Alex, please could you clarify this case as well , comment?
> 
> Thanks
> 
> Eric


right.

> >> +            }
> >> +        }
> >> +
> >> +        iommu_bus = parent_bus;
> >>      }
> >>      if (iommu_bus && iommu_bus->iommu_fn) {
> >> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> >> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >>      }
> >>      return &address_space_memory;
> >>  }

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
On Wed, 27 Mar 2019 17:43:45 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi,
> 
> On 3/27/19 4:32 PM, Michael S. Tsirkin wrote:
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:  
> >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >> distinguish by devfn & bus between devices in a conventional PCI
> >> topology and therefore we cannot assign them separate AddressSpaces.
> >> By taking this requester ID aliasing into account, QEMU better matches
> >> the bare metal behavior and restrictions, and enables shared
> >> AddressSpace configurations that are otherwise not possible with
> >> guest IOMMU support.
> >>
> >> For the latter case, given any example where an IOMMU group on the
> >> host includes multiple devices:
> >>
> >>   $ ls  /sys/kernel/iommu_groups/1/devices/
> >>   0000:00:01.0  0000:01:00.0  0000:01:00.1
> >>
> >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >> that we can only assign one of the endpoints to the guest because a
> >> second endpoint will attempt to use a different AddressSpace.  VFIO
> >> only supports IOMMU group level granularity at the container level,
> >> preventing this second endpoint from being assigned:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> >> 0000:01:00.1: group 1 used in multiple address spaces
> >>
> >> However, when QEMU incorporates proper aliasing, we can make use of a
> >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >> provides the downstream devices with the same AddressSpace, ex:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1  
> > 
> > 
> > This will break extended address space access though, won't it?
> > things like attempts to take PCI Express badnwidth into
> > consideration by drivers for E2E credit math will also
> > not work ...
> >   
> >>
> >> While the utility of this hack may be limited, this AddressSpace
> >> aliasing is the correct behavior for QEMU to emulate bare metal.  
> > 
> > 
> > This one's a valid point.
> >   
> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >> ---
> >>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 35451c1e9987..38467e676f1f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>  {
> >>      PCIBus *bus = pci_get_bus(dev);
> >>      PCIBus *iommu_bus = bus;
> >> +    uint8_t devfn = dev->devfn;
> >>  
> >>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +
> >> +        /*
> >> +         * Determine which requester ID alias should be used for the device
> >> +         * based on the PCI topology.  There are no requester IDs on convetional
> >> +         * PCI buses, therefore we push the alias up to the parent on each non-
> >> +         * express bus.  Which alias we use depends on whether this is a legacy
> >> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> >> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> >> +         * because the resulting BDF depends on the secondary bridge register
> >> +         * programming.  
> > 
> > Could you clarify this part a bit pls?
> >   
> >>  We also cannot lookup the PCIBus from the bus number
> >> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> >> +         * alias to the root bus, which is usually, but not necessarily always
> >> +         * where we'll find our iommu_fn.
> >> +         */
> >> +        if (!pci_bus_is_express(iommu_bus)) {
> >> +            PCIDevice *parent = iommu_bus->parent_dev;
> >> +
> >> +            if (pci_is_express(parent) &&
> >> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >> +                devfn = PCI_DEVFN(0, 0);  
> > 
> > Why 0,0?  
> 
> The spec says: "If the bridge generates a new Requester ID for a
> transaction forwarded from the secondary interface to the primary
> interface, the bridge assigns the PCI Express Requester ID using its
> secondary interface’s Bus Number and sets both the Device Number and
> Function Number fields to zero.". I guess it is the reason?

Yup, PCI Express to PCI/PCI-X Bridge Spec, rev 1.0, sec 2.3.
 
> >   
> >> +                bus = iommu_bus;
> >> +            } else {
> >> +                devfn = parent->devfn;
> >> +                bus = parent_bus;  
> Alex, please could you clarify this case as well , comment?

IIRC, this one is more of a "because that's how it's implemented"
situation.  pci_req_id_cache_get() also handles this scenario and
refers to it as legacy PCI and the same algorithm works here.  So long
as we're on a non-express bus, we continue to push the DMA alias device
up to the parent device.  If we end on a bridge that does not have a
PCIe capability to expose itself as a PCIe-to-PCI bridge, then it
probably uses itself as the requester ID.  This is not without
exception though, see drivers/pci/search.c:pci_for_each_dma_alias in
the kernel to see that there are some (few) quirked bridges that use
the PCIe-to-PCI algorithm.  In our case here it barely matters, it's
only a difference of whether the bridge itself is considered within the
same AddressSpace as the downstream device and we use the algorithm
that is more generally correct to match bare metal.  Ideally we'd not
see this case in QEMU, but IIRC it's possible to put a PCI-to-PCI
bridge on a PCIe bus to generate it.  Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
On Wed, 27 Mar 2019 11:32:55 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   0000:00:01.0  0000:01:00.0  0000:01:00.1
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > 0000:01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1  
> 
> 
> This will break extended address space access though, won't it?
> things like attempts to take PCI Express badnwidth into
> consideration by drivers for E2E credit math will also
> not work ...

Correct.  As I explain in my reply to Peter, we're forcing the IOMMU to
use a common address space for these devices, and the only hack we have
to do that is to introduce a conventional PCI bus.  PCIe-to-PCI bridges
are required to respond with an unsupported request to extended config
space on the downstream devices.  This is part of the loss of utility I
mention below.  Perhaps we can add a couple more layers of hacks if
those sorts of things are important.  PCI-X supports extended config
space, so (without digging through the spec) I assume a PCIe-to-PCIX
bridge could pass extended config space.  Additionally, if we had the
same bridge in reverse mode, for something like this:

[RC]-[PCIe-to-PCIX]-[PCIX-to-PCIe]-[PCIe endpoint]

Then we would have access to link and extended config space while
maintaining the single address space.  Hack^3

> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.  
> 
> 
> This one's a valid point.
> 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 35451c1e9987..38467e676f1f 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >  {
> >      PCIBus *bus = pci_get_bus(dev);
> >      PCIBus *iommu_bus = bus;
> > +    uint8_t devfn = dev->devfn;
> >  
> >      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > +
> > +        /*
> > +         * Determine which requester ID alias should be used for the device
> > +         * based on the PCI topology.  There are no requester IDs on convetional
> > +         * PCI buses, therefore we push the alias up to the parent on each non-
> > +         * express bus.  Which alias we use depends on whether this is a legacy
> > +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> > +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> > +         * because the resulting BDF depends on the secondary bridge register
> > +         * programming.  
> 
> Could you clarify this part a bit pls?

We use the same algorithm here as we do to come up with
requester_id_cache on the PCIDevice (modulo stopping at the iommu_fn
bus), but pci_req_id_cache_extract() is only useful once the bridge
secondary bus number register is programmed.  Until then, the bus part
of the BDF is always zero, which means we can't even lookup the PCIBus
from the bus number.  Since we're setting up the IOMMU AddressSpace
during device realize, we cannot presume any bus number programming.
NB. None of the IOMMUs actually seem to care about the PCIBus at this
point, the pointer is simply used as an opaque token for a hash for the
bus, but that also means there's really no good substitute for it
either.

> >  We also cannot lookup the PCIBus from the bus number
> > +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> > +         * alias to the root bus, which is usually, but not necessarily always
> > +         * where we'll find our iommu_fn.
> > +         */
> > +        if (!pci_bus_is_express(iommu_bus)) {
> > +            PCIDevice *parent = iommu_bus->parent_dev;
> > +
> > +            if (pci_is_express(parent) &&
> > +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +                devfn = PCI_DEVFN(0, 0);  
> 
> Why 0,0?

See also pci_req_id_cache_get(), this is defined in the PCIe-to-PCI
bridge spec.  Bridges following this spec use a fixed requester ID for
all downstream devices in order to differentiate requests from the
bridge itself vs downstream devices.  pci_req_id_cache_extract() is
where this gets applied for the PCIReqIDCache version.  Thanks,

Alex

> > +                bus = iommu_bus;
> > +            } else {
> > +                devfn = parent->devfn;
> > +                bus = parent_bus;
> > +            }
> > +        }
> > +
> > +        iommu_bus = parent_bus;
> >      }
> >      if (iommu_bus && iommu_bus->iommu_fn) {
> > -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> > +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >      }
> >      return &address_space_memory;
> >  }  


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
[Cc +Brijesh]

Hi Brijesh, will the change below require the IVRS to be updated to
include aliases for all BDF ranges behind a conventional bridge?  I
think the Linux code handles this regardless of the firmware provided
aliases, but is it required per spec for the ACPI tables to include
bridge aliases?  Thanks,

Alex

On Tue, 26 Mar 2019 16:55:19 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> distinguish by devfn & bus between devices in a conventional PCI
> topology and therefore we cannot assign them separate AddressSpaces.
> By taking this requester ID aliasing into account, QEMU better matches
> the bare metal behavior and restrictions, and enables shared
> AddressSpace configurations that are otherwise not possible with
> guest IOMMU support.
> 
> For the latter case, given any example where an IOMMU group on the
> host includes multiple devices:
> 
>   $ ls  /sys/kernel/iommu_groups/1/devices/
>   0000:00:01.0  0000:01:00.0  0000:01:00.1
> 
> If we incorporate a vIOMMU into the VM configuration, we're restricted
> that we can only assign one of the endpoints to the guest because a
> second endpoint will attempt to use a different AddressSpace.  VFIO
> only supports IOMMU group level granularity at the container level,
> preventing this second endpoint from being assigned:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> 
> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> 0000:01:00.1: group 1 used in multiple address spaces
> 
> However, when QEMU incorporates proper aliasing, we can make use of a
> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> provides the downstream devices with the same AddressSpace, ex:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> 
> While the utility of this hack may be limited, this AddressSpace
> aliasing is the correct behavior for QEMU to emulate bare metal.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 35451c1e9987..38467e676f1f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
> +    uint8_t devfn = dev->devfn;
>  
>      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> +
> +        /*
> +         * Determine which requester ID alias should be used for the device
> +         * based on the PCI topology.  There are no requester IDs on convetional
> +         * PCI buses, therefore we push the alias up to the parent on each non-
> +         * express bus.  Which alias we use depends on whether this is a legacy
> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> +         * because the resulting BDF depends on the secondary bridge register
> +         * programming.  We also cannot lookup the PCIBus from the bus number
> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> +         * alias to the root bus, which is usually, but not necessarily always
> +         * where we'll find our iommu_fn.
> +         */
> +        if (!pci_bus_is_express(iommu_bus)) {
> +            PCIDevice *parent = iommu_bus->parent_dev;
> +
> +            if (pci_is_express(parent) &&
> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> +                devfn = PCI_DEVFN(0, 0);
> +                bus = iommu_bus;
> +            } else {
> +                devfn = parent->devfn;
> +                bus = parent_bus;
> +            }
> +        }
> +
> +        iommu_bus = parent_bus;
>      }
>      if (iommu_bus && iommu_bus->iommu_fn) {
> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>      }
>      return &address_space_memory;
>  }
> 


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Singh, Brijesh 5 years ago
Thanks for adding Alex.

Adding Suravee.


On 3/29/19 11:49 AM, Alex Williamson wrote:
> [Cc +Brijesh]
> 
> Hi Brijesh, will the change below require the IVRS to be updated to
> include aliases for all BDF ranges behind a conventional bridge?  I
> think the Linux code handles this regardless of the firmware provided
> aliases, but is it required per spec for the ACPI tables to include
> bridge aliases?  Thanks,
> 

We do need to includes aliases in ACPI table. We need to populate the
IVHD type 0x43 and 0x4 for alias range start and end. I believe host
IVRS would contain similar information.

Suravee, please correct me if I am missing something?

thanks

> Alex
> 
> On Tue, 26 Mar 2019 16:55:19 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>> distinguish by devfn & bus between devices in a conventional PCI
>> topology and therefore we cannot assign them separate AddressSpaces.
>> By taking this requester ID aliasing into account, QEMU better matches
>> the bare metal behavior and restrictions, and enables shared
>> AddressSpace configurations that are otherwise not possible with
>> guest IOMMU support.
>>
>> For the latter case, given any example where an IOMMU group on the
>> host includes multiple devices:
>>
>>    $ ls  /sys/kernel/iommu_groups/1/devices/
>>    0000:00:01.0  0000:01:00.0  0000:01:00.1
>>
>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>> that we can only assign one of the endpoints to the guest because a
>> second endpoint will attempt to use a different AddressSpace.  VFIO
>> only supports IOMMU group level granularity at the container level,
>> preventing this second endpoint from being assigned:
>>
>> qemu-system-x86_64 -machine q35... \
>>    -device intel-iommu,intremap=on \
>>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>
>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>> 0000:01:00.1: group 1 used in multiple address spaces
>>
>> However, when QEMU incorporates proper aliasing, we can make use of a
>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>> provides the downstream devices with the same AddressSpace, ex:
>>
>> qemu-system-x86_64 -machine q35... \
>>    -device intel-iommu,intremap=on \
>>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>
>> While the utility of this hack may be limited, this AddressSpace
>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>   hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>>   1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 35451c1e9987..38467e676f1f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>   {
>>       PCIBus *bus = pci_get_bus(dev);
>>       PCIBus *iommu_bus = bus;
>> +    uint8_t devfn = dev->devfn;
>>   
>>       while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
>> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>> +
>> +        /*
>> +         * Determine which requester ID alias should be used for the device
>> +         * based on the PCI topology.  There are no requester IDs on convetional
>> +         * PCI buses, therefore we push the alias up to the parent on each non-
>> +         * express bus.  Which alias we use depends on whether this is a legacy
>> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
>> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
>> +         * because the resulting BDF depends on the secondary bridge register
>> +         * programming.  We also cannot lookup the PCIBus from the bus number
>> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
>> +         * alias to the root bus, which is usually, but not necessarily always
>> +         * where we'll find our iommu_fn.
>> +         */
>> +        if (!pci_bus_is_express(iommu_bus)) {
>> +            PCIDevice *parent = iommu_bus->parent_dev;
>> +
>> +            if (pci_is_express(parent) &&
>> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
>> +                devfn = PCI_DEVFN(0, 0);
>> +                bus = iommu_bus;
>> +            } else {
>> +                devfn = parent->devfn;
>> +                bus = parent_bus;
>> +            }
>> +        }
>> +
>> +        iommu_bus = parent_bus;
>>       }
>>       if (iommu_bus && iommu_bus->iommu_fn) {
>> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>       }
>>       return &address_space_memory;
>>   }
>>
> 
Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 4 years, 8 months ago
On Mon, 1 Apr 2019 13:41:39 +0000
"Singh, Brijesh" <brijesh.singh@amd.com> wrote:

> Thanks for adding Alex.
> 
> Adding Suravee.
> 
> 
> On 3/29/19 11:49 AM, Alex Williamson wrote:
> > [Cc +Brijesh]
> > 
> > Hi Brijesh, will the change below require the IVRS to be updated to
> > include aliases for all BDF ranges behind a conventional bridge?  I
> > think the Linux code handles this regardless of the firmware provided
> > aliases, but is it required per spec for the ACPI tables to include
> > bridge aliases?  Thanks,
> >   
> 
> We do need to includes aliases in ACPI table. We need to populate the
> IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> IVRS would contain similar information.
> 
> Suravee, please correct me if I am missing something?

I finally found some time to investigate this a little further, yes the
types mentioned are correct for defining start and end of an alias
range.  The challenge here is that these entries require a DeviceID,
which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
numbers are defined by the guest firmware, and potentially redefined by
the guest OS.  This makes it non-trivial to insert a few IVHDs into the
IVRS to describe alias ranges.  I'm wondering if the solution here is
to define a new linker-loader command that would instruct the guest to
write a bus number byte to a given offset for a described device.
These commands would be inserted before the checksum command, such that
these bus number updates are calculated as part of the checksum.

I'm imagining the command format would need to be able to distinguish
between the actual bus number of a described device, the secondary bus
number of the device, and the subordinate bus number of the device.
For describing the device, I'm envisioning stealing from the DMAR
definition, which already includes a bus number invariant mechanism to
describe a device, starting with a segment and root bus, follow a chain
of devfns to get to the target device.  Therefore the guest firmware
would follow the path to the described device, pick the desired bus
number, and write it to the indicated table offset.

Does this seem like a reasonable approach?  Better ideas?  I'm not
thrilled with the increased scope demanded by IVRS support, but so long
as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,

Alex

> > On Tue, 26 Mar 2019 16:55:19 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >> distinguish by devfn & bus between devices in a conventional PCI
> >> topology and therefore we cannot assign them separate AddressSpaces.
> >> By taking this requester ID aliasing into account, QEMU better matches
> >> the bare metal behavior and restrictions, and enables shared
> >> AddressSpace configurations that are otherwise not possible with
> >> guest IOMMU support.
> >>
> >> For the latter case, given any example where an IOMMU group on the
> >> host includes multiple devices:
> >>
> >>    $ ls  /sys/kernel/iommu_groups/1/devices/
> >>    0000:00:01.0  0000:01:00.0  0000:01:00.1
> >>
> >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >> that we can only assign one of the endpoints to the guest because a
> >> second endpoint will attempt to use a different AddressSpace.  VFIO
> >> only supports IOMMU group level granularity at the container level,
> >> preventing this second endpoint from being assigned:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>    -device intel-iommu,intremap=on \
> >>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> >> 0000:01:00.1: group 1 used in multiple address spaces
> >>
> >> However, when QEMU incorporates proper aliasing, we can make use of a
> >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >> provides the downstream devices with the same AddressSpace, ex:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>    -device intel-iommu,intremap=on \
> >>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> >>
> >> While the utility of this hack may be limited, this AddressSpace
> >> aliasing is the correct behavior for QEMU to emulate bare metal.
> >>
> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >> ---
> >>   hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >>   1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 35451c1e9987..38467e676f1f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>   {
> >>       PCIBus *bus = pci_get_bus(dev);
> >>       PCIBus *iommu_bus = bus;
> >> +    uint8_t devfn = dev->devfn;
> >>   
> >>       while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +
> >> +        /*
> >> +         * Determine which requester ID alias should be used for the device
> >> +         * based on the PCI topology.  There are no requester IDs on convetional
> >> +         * PCI buses, therefore we push the alias up to the parent on each non-
> >> +         * express bus.  Which alias we use depends on whether this is a legacy
> >> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> >> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> >> +         * because the resulting BDF depends on the secondary bridge register
> >> +         * programming.  We also cannot lookup the PCIBus from the bus number
> >> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> >> +         * alias to the root bus, which is usually, but not necessarily always
> >> +         * where we'll find our iommu_fn.
> >> +         */
> >> +        if (!pci_bus_is_express(iommu_bus)) {
> >> +            PCIDevice *parent = iommu_bus->parent_dev;
> >> +
> >> +            if (pci_is_express(parent) &&
> >> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >> +                devfn = PCI_DEVFN(0, 0);
> >> +                bus = iommu_bus;
> >> +            } else {
> >> +                devfn = parent->devfn;
> >> +                bus = parent_bus;
> >> +            }
> >> +        }
> >> +
> >> +        iommu_bus = parent_bus;
> >>       }
> >>       if (iommu_bus && iommu_bus->iommu_fn) {
> >> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> >> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >>       }
> >>       return &address_space_memory;
> >>   }
> >>  
> >   


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 4 years, 8 months ago
On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> On Mon, 1 Apr 2019 13:41:39 +0000
> "Singh, Brijesh" <brijesh.singh@amd.com> wrote:
> 
> > Thanks for adding Alex.
> > 
> > Adding Suravee.
> > 
> > 
> > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > [Cc +Brijesh]
> > > 
> > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > think the Linux code handles this regardless of the firmware provided
> > > aliases, but is it required per spec for the ACPI tables to include
> > > bridge aliases?  Thanks,
> > >   
> > 
> > We do need to includes aliases in ACPI table. We need to populate the
> > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > IVRS would contain similar information.
> > 
> > Suravee, please correct me if I am missing something?
> 
> I finally found some time to investigate this a little further, yes the
> types mentioned are correct for defining start and end of an alias
> range.  The challenge here is that these entries require a DeviceID,
> which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> numbers are defined by the guest firmware, and potentially redefined by
> the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> IVRS to describe alias ranges.  I'm wondering if the solution here is
> to define a new linker-loader command that would instruct the guest to
> write a bus number byte to a given offset for a described device.
> These commands would be inserted before the checksum command, such that
> these bus number updates are calculated as part of the checksum.
> 
> I'm imagining the command format would need to be able to distinguish
> between the actual bus number of a described device, the secondary bus
> number of the device, and the subordinate bus number of the device.
> For describing the device, I'm envisioning stealing from the DMAR
> definition, which already includes a bus number invariant mechanism to
> describe a device, starting with a segment and root bus, follow a chain
> of devfns to get to the target device.  Therefore the guest firmware
> would follow the path to the described device, pick the desired bus
> number, and write it to the indicated table offset.
> 
> Does this seem like a reasonable approach?  Better ideas?  I'm not
> thrilled with the increased scope demanded by IVRS support, but so long
> as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,
> 
> Alex

We generally just got the bus #s as programmed into virtual
bridges/devices and had qemu program them into acpi
tables.

If guest os changes bus#s it's responsible for updating acpi
tables :)


> > > On Tue, 26 Mar 2019 16:55:19 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >   
> > >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > >> distinguish by devfn & bus between devices in a conventional PCI
> > >> topology and therefore we cannot assign them separate AddressSpaces.
> > >> By taking this requester ID aliasing into account, QEMU better matches
> > >> the bare metal behavior and restrictions, and enables shared
> > >> AddressSpace configurations that are otherwise not possible with
> > >> guest IOMMU support.
> > >>
> > >> For the latter case, given any example where an IOMMU group on the
> > >> host includes multiple devices:
> > >>
> > >>    $ ls  /sys/kernel/iommu_groups/1/devices/
> > >>    0000:00:01.0  0000:01:00.0  0000:01:00.1
> > >>
> > >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> > >> that we can only assign one of the endpoints to the guest because a
> > >> second endpoint will attempt to use a different AddressSpace.  VFIO
> > >> only supports IOMMU group level granularity at the container level,
> > >> preventing this second endpoint from being assigned:
> > >>
> > >> qemu-system-x86_64 -machine q35... \
> > >>    -device intel-iommu,intremap=on \
> > >>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
> > >>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> > >>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > >>
> > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > >> 0000:01:00.1: group 1 used in multiple address spaces
> > >>
> > >> However, when QEMU incorporates proper aliasing, we can make use of a
> > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > >> provides the downstream devices with the same AddressSpace, ex:
> > >>
> > >> qemu-system-x86_64 -machine q35... \
> > >>    -device intel-iommu,intremap=on \
> > >>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> > >>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> > >>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > >>
> > >> While the utility of this hack may be limited, this AddressSpace
> > >> aliasing is the correct behavior for QEMU to emulate bare metal.
> > >>
> > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > >> ---
> > >>   hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> > >>   1 file changed, 31 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > >> index 35451c1e9987..38467e676f1f 100644
> > >> --- a/hw/pci/pci.c
> > >> +++ b/hw/pci/pci.c
> > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> > >>   {
> > >>       PCIBus *bus = pci_get_bus(dev);
> > >>       PCIBus *iommu_bus = bus;
> > >> +    uint8_t devfn = dev->devfn;
> > >>   
> > >>       while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > >> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > >> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > >> +
> > >> +        /*
> > >> +         * Determine which requester ID alias should be used for the device
> > >> +         * based on the PCI topology.  There are no requester IDs on convetional
> > >> +         * PCI buses, therefore we push the alias up to the parent on each non-
> > >> +         * express bus.  Which alias we use depends on whether this is a legacy
> > >> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> > >> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> > >> +         * because the resulting BDF depends on the secondary bridge register
> > >> +         * programming.  We also cannot lookup the PCIBus from the bus number
> > >> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> > >> +         * alias to the root bus, which is usually, but not necessarily always
> > >> +         * where we'll find our iommu_fn.
> > >> +         */
> > >> +        if (!pci_bus_is_express(iommu_bus)) {
> > >> +            PCIDevice *parent = iommu_bus->parent_dev;
> > >> +
> > >> +            if (pci_is_express(parent) &&
> > >> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > >> +                devfn = PCI_DEVFN(0, 0);
> > >> +                bus = iommu_bus;
> > >> +            } else {
> > >> +                devfn = parent->devfn;
> > >> +                bus = parent_bus;
> > >> +            }
> > >> +        }
> > >> +
> > >> +        iommu_bus = parent_bus;
> > >>       }
> > >>       if (iommu_bus && iommu_bus->iommu_fn) {
> > >> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> > >> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> > >>       }
> > >>       return &address_space_memory;
> > >>   }
> > >>  
> > >   

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Peter Xu 4 years, 8 months ago
On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > [Cc +Brijesh]
> > > 
> > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > think the Linux code handles this regardless of the firmware provided
> > > aliases, but is it required per spec for the ACPI tables to include
> > > bridge aliases?  Thanks,
> > >   
> > 
> > We do need to includes aliases in ACPI table. We need to populate the
> > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > IVRS would contain similar information.
> > 
> > Suravee, please correct me if I am missing something?
> 
> I finally found some time to investigate this a little further, yes the
> types mentioned are correct for defining start and end of an alias
> range.  The challenge here is that these entries require a DeviceID,
> which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> numbers are defined by the guest firmware, and potentially redefined by
> the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> IVRS to describe alias ranges.  I'm wondering if the solution here is
> to define a new linker-loader command that would instruct the guest to
> write a bus number byte to a given offset for a described device.
> These commands would be inserted before the checksum command, such that
> these bus number updates are calculated as part of the checksum.
> 
> I'm imagining the command format would need to be able to distinguish
> between the actual bus number of a described device, the secondary bus
> number of the device, and the subordinate bus number of the device.
> For describing the device, I'm envisioning stealing from the DMAR
> definition, which already includes a bus number invariant mechanism to
> describe a device, starting with a segment and root bus, follow a chain
> of devfns to get to the target device.  Therefore the guest firmware
> would follow the path to the described device, pick the desired bus
> number, and write it to the indicated table offset.
> 
> Does this seem like a reasonable approach?  Better ideas?  I'm not
> thrilled with the increased scope demanded by IVRS support, but so long
> as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,

I don't have a better idea yet, but just want to say that accidentally
I was trying to look into this as well starting from this week and I'd
say that's mostly what I thought about too (I was still reading a bit
seabios when I saw this email)... so at least this idea makes sense to
me.

Would the guest OS still change the PCI bus number even after the
firmware (BIOS/UEFI)?  Could I ask in what case would that happen?

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 4 years, 8 months ago
On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:
> On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> > > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > > [Cc +Brijesh]
> > > > 
> > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > think the Linux code handles this regardless of the firmware provided
> > > > aliases, but is it required per spec for the ACPI tables to include
> > > > bridge aliases?  Thanks,
> > > >   
> > > 
> > > We do need to includes aliases in ACPI table. We need to populate the
> > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > IVRS would contain similar information.
> > > 
> > > Suravee, please correct me if I am missing something?
> > 
> > I finally found some time to investigate this a little further, yes the
> > types mentioned are correct for defining start and end of an alias
> > range.  The challenge here is that these entries require a DeviceID,
> > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > numbers are defined by the guest firmware, and potentially redefined by
> > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > to define a new linker-loader command that would instruct the guest to
> > write a bus number byte to a given offset for a described device.
> > These commands would be inserted before the checksum command, such that
> > these bus number updates are calculated as part of the checksum.
> > 
> > I'm imagining the command format would need to be able to distinguish
> > between the actual bus number of a described device, the secondary bus
> > number of the device, and the subordinate bus number of the device.
> > For describing the device, I'm envisioning stealing from the DMAR
> > definition, which already includes a bus number invariant mechanism to
> > describe a device, starting with a segment and root bus, follow a chain
> > of devfns to get to the target device.  Therefore the guest firmware
> > would follow the path to the described device, pick the desired bus
> > number, and write it to the indicated table offset.
> > 
> > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > thrilled with the increased scope demanded by IVRS support, but so long
> > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,
> 
> I don't have a better idea yet, but just want to say that accidentally
> I was trying to look into this as well starting from this week and I'd
> say that's mostly what I thought about too (I was still reading a bit
> seabios when I saw this email)... so at least this idea makes sense to
> me.
> 
> Would the guest OS still change the PCI bus number even after the
> firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> 
> Thanks,

Guest OSes can in theory rebalance resources. Changing bus numbers
would be useful if new bridges are added by hotplug.
In practice at least Linux doesn't do the rebalancing.
I think that if we start reporting PNP OS support in BIOS then windows
might start doing that more aggressively.


> -- 
> Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Peter Xu 4 years, 8 months ago
On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:
> > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> > > > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > > > [Cc +Brijesh]
> > > > > 
> > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > think the Linux code handles this regardless of the firmware provided
> > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > bridge aliases?  Thanks,
> > > > >   
> > > > 
> > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > IVRS would contain similar information.
> > > > 
> > > > Suravee, please correct me if I am missing something?
> > > 
> > > I finally found some time to investigate this a little further, yes the
> > > types mentioned are correct for defining start and end of an alias
> > > range.  The challenge here is that these entries require a DeviceID,
> > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > numbers are defined by the guest firmware, and potentially redefined by
> > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > to define a new linker-loader command that would instruct the guest to
> > > write a bus number byte to a given offset for a described device.
> > > These commands would be inserted before the checksum command, such that
> > > these bus number updates are calculated as part of the checksum.
> > > 
> > > I'm imagining the command format would need to be able to distinguish
> > > between the actual bus number of a described device, the secondary bus
> > > number of the device, and the subordinate bus number of the device.
> > > For describing the device, I'm envisioning stealing from the DMAR
> > > definition, which already includes a bus number invariant mechanism to
> > > describe a device, starting with a segment and root bus, follow a chain
> > > of devfns to get to the target device.  Therefore the guest firmware
> > > would follow the path to the described device, pick the desired bus
> > > number, and write it to the indicated table offset.
> > > 
> > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > thrilled with the increased scope demanded by IVRS support, but so long
> > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,
> > 
> > I don't have a better idea yet, but just want to say that accidentally
> > I was trying to look into this as well starting from this week and I'd
> > say that's mostly what I thought about too (I was still reading a bit
> > seabios when I saw this email)... so at least this idea makes sense to
> > me.
> > 
> > Would the guest OS still change the PCI bus number even after the
> > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > 
> > Thanks,
> 
> Guest OSes can in theory rebalance resources. Changing bus numbers
> would be useful if new bridges are added by hotplug.
> In practice at least Linux doesn't do the rebalancing.
> I think that if we start reporting PNP OS support in BIOS then windows
> might start doing that more aggressively.

It's surprising me a bit...  IMHO if we allow the bus number to change
then at least many scripts can even fail which might work before.
E.g. , a very common script can run "lspci-like" program to list each
device and then do "lspci-like -vvv" again upon the BDF it fetched
from previous commands.  Any kind of BDF caching would be invalid
since that from either userspace or kernel.

Also, obviously the data to be stored in IVRS is closely bound to how
bus number is defined.  Even if we can add a new linker-loader command
to all the open firmwares like seabios or OVMF but still we can't do
that to Windows (or, could we?...).

Now one step back, I'm also curious on the reason behind on why AMD
spec required the IVRS with BDF information, rather than the scope
information like what Intel DMAR spec was asking for.

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 4 years, 8 months ago
On Wed, 24 Jul 2019 18:03:31 +0800
Peter Xu <zhexu@redhat.com> wrote:

> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:  
> > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:  
> > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:  
> > > > > > [Cc +Brijesh]
> > > > > > 
> > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > bridge aliases?  Thanks,
> > > > > >     
> > > > > 
> > > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > > IVRS would contain similar information.
> > > > > 
> > > > > Suravee, please correct me if I am missing something?  
> > > > 
> > > > I finally found some time to investigate this a little further, yes the
> > > > types mentioned are correct for defining start and end of an alias
> > > > range.  The challenge here is that these entries require a DeviceID,
> > > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > > numbers are defined by the guest firmware, and potentially redefined by
> > > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > > to define a new linker-loader command that would instruct the guest to
> > > > write a bus number byte to a given offset for a described device.
> > > > These commands would be inserted before the checksum command, such that
> > > > these bus number updates are calculated as part of the checksum.
> > > > 
> > > > I'm imagining the command format would need to be able to distinguish
> > > > between the actual bus number of a described device, the secondary bus
> > > > number of the device, and the subordinate bus number of the device.
> > > > For describing the device, I'm envisioning stealing from the DMAR
> > > > definition, which already includes a bus number invariant mechanism to
> > > > describe a device, starting with a segment and root bus, follow a chain
> > > > of devfns to get to the target device.  Therefore the guest firmware
> > > > would follow the path to the described device, pick the desired bus
> > > > number, and write it to the indicated table offset.
> > > > 
> > > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > > thrilled with the increased scope demanded by IVRS support, but so long
> > > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,  
> > > 
> > > I don't have a better idea yet, but just want to say that accidentally
> > > I was trying to look into this as well starting from this week and I'd
> > > say that's mostly what I thought about too (I was still reading a bit
> > > seabios when I saw this email)... so at least this idea makes sense to
> > > me.
> > > 
> > > Would the guest OS still change the PCI bus number even after the
> > > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > > 
> > > Thanks,  
> > 
> > Guest OSes can in theory rebalance resources. Changing bus numbers
> > would be useful if new bridges are added by hotplug.
> > In practice at least Linux doesn't do the rebalancing.
> > I think that if we start reporting PNP OS support in BIOS then windows
> > might start doing that more aggressively.  
> 
> It's surprising me a bit...  IMHO if we allow the bus number to change
> then at least many scripts can even fail which might work before.
> E.g. , a very common script can run "lspci-like" program to list each
> device and then do "lspci-like -vvv" again upon the BDF it fetched
> from previous commands.  Any kind of BDF caching would be invalid
> since that from either userspace or kernel.
> 
> Also, obviously the data to be stored in IVRS is closely bound to how
> bus number is defined.  Even if we can add a new linker-loader command
> to all the open firmwares like seabios or OVMF but still we can't do
> that to Windows (or, could we?...).
> 
> Now one step back, I'm also curious on the reason behind on why AMD
> spec required the IVRS with BDF information, rather than the scope
> information like what Intel DMAR spec was asking for.

It's a deficiency of the IVRS spec, but it's really out of scope here.
It's not the responsibility of the hypervisor to resolve this sort of
design issue, we should simply maintain the bare metal behavior and the
bare metal limitations of the design.

Michael did invoke some interesting ideas regarding QEMU updating the
IRVS table though.  QEMU does know when bus apertures are programmed on
devices and the config writes for these updates could trigger IVRS
updates.  I think we'd want to allow such updates only between machine
reset and the guest firmware writing the table checksum.  This reduces
the scope of the necessary changes, though still feels a little messy
to have these config writes making table updates.

Another approach, and maybe what Michael was really suggesting, is that
we essentially create the ACPI tables twice AFAICT.  Once initially,
then again via a select callback in fw_cfg.  For SeaBIOS, it looks like
this second generation would be created after the PCI bus has been
enumerated and initialized.  I've been trying to see if the same is
likely for OVMF, though it's not clear to me that this is a reasonable
ordering to rely on.  It would be entirely reasonable that firmware
could process ACPI tables in advance of enumerating PCI, even
potentially as a prerequisite to enumerating PCI.  So ultimately I'm not
sure if there are valid ordering assumptions to use these callbacks
this way, though I'd appreciate any further discussion.  Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 4 years, 8 months ago
On Wed, 24 Jul 2019 08:43:55 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 24 Jul 2019 18:03:31 +0800
> Peter Xu <zhexu@redhat.com> wrote:
> 
> > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:  
> > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:    
> > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:    
> > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:    
> > > > > > > [Cc +Brijesh]
> > > > > > > 
> > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > > bridge aliases?  Thanks,
[snip]

For a data point, I fired up an old 990FX system which includes a
PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep
things interesting... guess how many alias ranges are in the IVRS...
Yep, just the one built into the motherboard:

AMD-Vi: Using IVHD type 0x10
AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
AMD-Vi:        mmio-addr: 00000000fec30000
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:00.2
AMD-Vi:   DEV_SELECT			 devid: 00:09.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 01:00.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:0a.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 02:00.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:11.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:12.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:12.2
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:13.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:13.2
AMD-Vi:   DEV_SELECT			 devid: 00:14.0 flags: d7
AMD-Vi:   DEV_SELECT			 devid: 00:14.2 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:14.3 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:14.4 flags: 00
AMD-Vi:   DEV_ALIAS_RANGE		 devid: 03:00.0 flags: 00 devid_to: 00:14.4
AMD-Vi:   DEV_RANGE_END		 devid: 03:1f.7

[Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge]

AMD-Vi:   DEV_SELECT			 devid: 00:14.5 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:15.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 04:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 04:1f.7
AMD-Vi:   DEV_SELECT			 devid: 00:15.1 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 05:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 05:1f.7
AMD-Vi:   DEV_SELECT			 devid: 00:15.2 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 06:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 06:1f.7
AMD-Vi:   DEV_SELECT			 devid: 00:15.3 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 08:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 08:1f.7
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:16.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:16.2
AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:14.0
AMD-Vi:   DEV_SPECIAL(HPET[0])		devid: 00:14.0
AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:00.1

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge
           +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU)
           +-09.0-[01]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
           +-0a.0-[02]----00.0  Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller
           +-11.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
           +-12.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
           +-12.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
           +-13.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
           +-13.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
           +-14.0  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller
           +-14.2  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA)
           +-14.3  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller
           +-14.4-[03]--+-06.0  NVidia / SGS Thomson (Joint Venture) Riva128
           |            \-0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
           +-14.5  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
           +-15.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-15.1-[05]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
           +-15.2-[06-07]----00.0-[07]----00.0  Realtek Semiconductor Co., Ltd. Device 8149
           +-15.3-[08]--
           +-16.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
           +-16.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration
           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map
           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller
           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control
           \-18.4  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control

00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode])
	Bus: primary=00, secondary=03, subordinate=03, sec-latency=64

06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode])
	Bus: primary=06, secondary=07, subordinate=07, sec-latency=32
	Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00

I realized as I was writing, that the alias caused by 06:00.0 would be
devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to
07:00.0, so technically there's really not an alias for this small
case.  So I replaced the NIC with this:

           +-15.2-[06-07]----00.0-[07]--+-00.0  NEC Corporation OHCI USB Controller
                                        +-00.1  NEC Corporation OHCI USB Controller
                                        \-00.2  NEC Corporation uPD72010x USB 2.0 Controller

Now functions 07:00.[12] also alias to 07:00.0.  The IVRS table is
unaffected.

I'm tempted to say that QEMU would be no worse than bare metal if we
simply ignore IVHD device alias entries.  I know that Linux will figure
out the aliasing regardless of the IVRS.  Will Windows guests?  I'd
love to hear from Bijesh or Suravee regarding the behavior above versus
what AMD expects from system vendors.  Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Singh, Brijesh 4 years, 8 months ago

On 7/24/19 2:42 PM, Alex Williamson wrote:
> On Wed, 24 Jul 2019 08:43:55 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Wed, 24 Jul 2019 18:03:31 +0800
>> Peter Xu <zhexu@redhat.com> wrote:
>>
>>> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
>>>> On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:
>>>>> On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
>>>>>>> On 3/29/19 11:49 AM, Alex Williamson wrote:
>>>>>>>> [Cc +Brijesh]
>>>>>>>>
>>>>>>>> Hi Brijesh, will the change below require the IVRS to be updated to
>>>>>>>> include aliases for all BDF ranges behind a conventional bridge?  I
>>>>>>>> think the Linux code handles this regardless of the firmware provided
>>>>>>>> aliases, but is it required per spec for the ACPI tables to include
>>>>>>>> bridge aliases?  Thanks,
> [snip]
> 
> For a data point, I fired up an old 990FX system which includes a
> PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep
> things interesting... guess how many alias ranges are in the IVRS...
> Yep, just the one built into the motherboard:
> 
> AMD-Vi: Using IVHD type 0x10
> AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
> AMD-Vi:        mmio-addr: 00000000fec30000
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:00.2
> AMD-Vi:   DEV_SELECT			 devid: 00:09.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 01:00.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:0a.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 02:00.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:11.0 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:12.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:12.2
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:13.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:13.2
> AMD-Vi:   DEV_SELECT			 devid: 00:14.0 flags: d7
> AMD-Vi:   DEV_SELECT			 devid: 00:14.2 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:14.3 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:14.4 flags: 00
> AMD-Vi:   DEV_ALIAS_RANGE		 devid: 03:00.0 flags: 00 devid_to: 00:14.4
> AMD-Vi:   DEV_RANGE_END		 devid: 03:1f.7
> 
> [Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge]
> 
> AMD-Vi:   DEV_SELECT			 devid: 00:14.5 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:15.0 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 04:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 04:1f.7
> AMD-Vi:   DEV_SELECT			 devid: 00:15.1 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 05:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 05:1f.7
> AMD-Vi:   DEV_SELECT			 devid: 00:15.2 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 06:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 06:1f.7
> AMD-Vi:   DEV_SELECT			 devid: 00:15.3 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 08:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 08:1f.7
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:16.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:16.2
> AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:14.0
> AMD-Vi:   DEV_SPECIAL(HPET[0])		devid: 00:14.0
> AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:00.1
> 
> -[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge
>             +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU)
>             +-09.0-[01]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>             +-0a.0-[02]----00.0  Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller
>             +-11.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
>             +-12.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
>             +-12.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
>             +-13.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
>             +-13.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
>             +-14.0  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller
>             +-14.2  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA)
>             +-14.3  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller
>             +-14.4-[03]--+-06.0  NVidia / SGS Thomson (Joint Venture) Riva128
>             |            \-0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>             +-14.5  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
>             +-15.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>             +-15.1-[05]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>             +-15.2-[06-07]----00.0-[07]----00.0  Realtek Semiconductor Co., Ltd. Device 8149
>             +-15.3-[08]--
>             +-16.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
>             +-16.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
>             +-18.0  Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration
>             +-18.1  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map
>             +-18.2  Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller
>             +-18.3  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control
>             \-18.4  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control
> 
> 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode])
> 	Bus: primary=00, secondary=03, subordinate=03, sec-latency=64
> 
> 06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode])
> 	Bus: primary=06, secondary=07, subordinate=07, sec-latency=32
> 	Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00
> 
> I realized as I was writing, that the alias caused by 06:00.0 would be
> devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to
> 07:00.0, so technically there's really not an alias for this small
> case.  So I replaced the NIC with this:
> 
>             +-15.2-[06-07]----00.0-[07]--+-00.0  NEC Corporation OHCI USB Controller
>                                          +-00.1  NEC Corporation OHCI USB Controller
>                                          \-00.2  NEC Corporation uPD72010x USB 2.0 Controller
> 
> Now functions 07:00.[12] also alias to 07:00.0.  The IVRS table is
> unaffected.
> 
> I'm tempted to say that QEMU would be no worse than bare metal if we
> simply ignore IVHD device alias entries.  I know that Linux will figure
> out the aliasing regardless of the IVRS.  Will Windows guests?  I'd
> love to hear from Bijesh or Suravee regarding the behavior above versus
> what AMD expects from system vendors.  Thanks,
> 

Alex,

I am not well versed on the expected IOMMU address space behavior yet,
Suravee will know more about. I will ask him to take a look at this
thread.

thanks

> Alex
> 
Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Peter Xu 4 years, 8 months ago
On Wed, Jul 24, 2019 at 08:43:55AM -0600, Alex Williamson wrote:
> On Wed, 24 Jul 2019 18:03:31 +0800
> Peter Xu <zhexu@redhat.com> wrote:
> 
> > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:  
> > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:  
> > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:  
> > > > > > > [Cc +Brijesh]
> > > > > > > 
> > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > > bridge aliases?  Thanks,
> > > > > > >     
> > > > > > 
> > > > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > > > IVRS would contain similar information.
> > > > > > 
> > > > > > Suravee, please correct me if I am missing something?  
> > > > > 
> > > > > I finally found some time to investigate this a little further, yes the
> > > > > types mentioned are correct for defining start and end of an alias
> > > > > range.  The challenge here is that these entries require a DeviceID,
> > > > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > > > numbers are defined by the guest firmware, and potentially redefined by
> > > > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > > > to define a new linker-loader command that would instruct the guest to
> > > > > write a bus number byte to a given offset for a described device.
> > > > > These commands would be inserted before the checksum command, such that
> > > > > these bus number updates are calculated as part of the checksum.
> > > > > 
> > > > > I'm imagining the command format would need to be able to distinguish
> > > > > between the actual bus number of a described device, the secondary bus
> > > > > number of the device, and the subordinate bus number of the device.
> > > > > For describing the device, I'm envisioning stealing from the DMAR
> > > > > definition, which already includes a bus number invariant mechanism to
> > > > > describe a device, starting with a segment and root bus, follow a chain
> > > > > of devfns to get to the target device.  Therefore the guest firmware
> > > > > would follow the path to the described device, pick the desired bus
> > > > > number, and write it to the indicated table offset.
> > > > > 
> > > > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > > > thrilled with the increased scope demanded by IVRS support, but so long
> > > > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,  
> > > > 
> > > > I don't have a better idea yet, but just want to say that accidentally
> > > > I was trying to look into this as well starting from this week and I'd
> > > > say that's mostly what I thought about too (I was still reading a bit
> > > > seabios when I saw this email)... so at least this idea makes sense to
> > > > me.
> > > > 
> > > > Would the guest OS still change the PCI bus number even after the
> > > > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > > > 
> > > > Thanks,  
> > > 
> > > Guest OSes can in theory rebalance resources. Changing bus numbers
> > > would be useful if new bridges are added by hotplug.
> > > In practice at least Linux doesn't do the rebalancing.
> > > I think that if we start reporting PNP OS support in BIOS then windows
> > > might start doing that more aggressively.  
> > 
> > It's surprising me a bit...  IMHO if we allow the bus number to change
> > then at least many scripts can even fail which might work before.
> > E.g. , a very common script can run "lspci-like" program to list each
> > device and then do "lspci-like -vvv" again upon the BDF it fetched
> > from previous commands.  Any kind of BDF caching would be invalid
> > since that from either userspace or kernel.
> > 
> > Also, obviously the data to be stored in IVRS is closely bound to how
> > bus number is defined.  Even if we can add a new linker-loader command
> > to all the open firmwares like seabios or OVMF but still we can't do
> > that to Windows (or, could we?...).
> > 
> > Now one step back, I'm also curious on the reason behind on why AMD
> > spec required the IVRS with BDF information, rather than the scope
> > information like what Intel DMAR spec was asking for.
> 
> It's a deficiency of the IVRS spec, but it's really out of scope here.
> It's not the responsibility of the hypervisor to resolve this sort of
> design issue, we should simply maintain the bare metal behavior and the
> bare metal limitations of the design.

Yes this is a better perspective.  It's not the first time I totally
forgot to go back to reference the bare-metal as that's what we're
emulating and normally that'll have very similar problem.  And the
"data point" email does give a proper supporting material for this.

> 
> Michael did invoke some interesting ideas regarding QEMU updating the
> IRVS table though.  QEMU does know when bus apertures are programmed on
> devices and the config writes for these updates could trigger IVRS
> updates.  I think we'd want to allow such updates only between machine
> reset and the guest firmware writing the table checksum.  This reduces
> the scope of the necessary changes, though still feels a little messy
> to have these config writes making table updates.
> 
> Another approach, and maybe what Michael was really suggesting, is that
> we essentially create the ACPI tables twice AFAICT.  Once initially,
> then again via a select callback in fw_cfg.  For SeaBIOS, it looks like
> this second generation would be created after the PCI bus has been
> enumerated and initialized.  I've been trying to see if the same is
> likely for OVMF, though it's not clear to me that this is a reasonable
> ordering to rely on.  It would be entirely reasonable that firmware
> could process ACPI tables in advance of enumerating PCI, even
> potentially as a prerequisite to enumerating PCI.  So ultimately I'm not
> sure if there are valid ordering assumptions to use these callbacks
> this way, though I'd appreciate any further discussion.  Thanks,

After re-read Michael's reply, I feel like what Michael suggested is
that we can simply ignore the bus-number-change case by the guest OS
for now, but I might be wrong.  In all cases, this will be a problem
only if "we still need to fill in the BDF information" somehow, while
that statement seems to be questionable now.

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 4 years, 8 months ago
On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote:
> After re-read Michael's reply, I feel like what Michael suggested is
> that we can simply ignore the bus-number-change case by the guest OS
> for now, but I might be wrong.
That's what I suggested, yes.

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 4 years, 8 months ago
On Thu, 25 Jul 2019 06:43:04 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote:
> > After re-read Michael's reply, I feel like what Michael suggested is
> > that we can simply ignore the bus-number-change case by the guest OS
> > for now, but I might be wrong.  
> That's what I suggested, yes.

"by the guest OS", yes, that's the part that's beyond the scope of this
effort.  If we deem it necessary to support IVHD DMA alias though, it's
the guest firmware that determines the initial bus numbers, which is
part of the DeviceID used to defined IVHD entries and we would not be
able to ignore that initial programming.  Everything related to the
guest OS renumber PCI buses is out of scope, the guest firmware
programming initial bus numbers is not.  Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 4 years, 8 months ago
On Thu, Jul 25, 2019 at 08:00:23AM -0600, Alex Williamson wrote:
> On Thu, 25 Jul 2019 06:43:04 -0400
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote:
> > > After re-read Michael's reply, I feel like what Michael suggested is
> > > that we can simply ignore the bus-number-change case by the guest OS
> > > for now, but I might be wrong.  
> > That's what I suggested, yes.
> 
> "by the guest OS", yes, that's the part that's beyond the scope of this
> effort.  If we deem it necessary to support IVHD DMA alias though, it's
> the guest firmware that determines the initial bus numbers, which is
> part of the DeviceID used to defined IVHD entries and we would not be
> able to ignore that initial programming.  Everything related to the
> guest OS renumber PCI buses is out of scope, the guest firmware
> programming initial bus numbers is not.  

Right. That's par for the course, we have same issues with MCFG
and others. bios programs it and then we generate acpi based on
that.

>Thanks,
> 
> Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Peter Xu 5 years ago
On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> distinguish by devfn & bus between devices in a conventional PCI
> topology and therefore we cannot assign them separate AddressSpaces.
> By taking this requester ID aliasing into account, QEMU better matches
> the bare metal behavior and restrictions, and enables shared
> AddressSpace configurations that are otherwise not possible with
> guest IOMMU support.
> 
> For the latter case, given any example where an IOMMU group on the
> host includes multiple devices:
> 
>   $ ls  /sys/kernel/iommu_groups/1/devices/
>   0000:00:01.0  0000:01:00.0  0000:01:00.1

[1]

> 
> If we incorporate a vIOMMU into the VM configuration, we're restricted
> that we can only assign one of the endpoints to the guest because a
> second endpoint will attempt to use a different AddressSpace.  VFIO
> only supports IOMMU group level granularity at the container level,
> preventing this second endpoint from being assigned:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> 
> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> 0000:01:00.1: group 1 used in multiple address spaces
> 
> However, when QEMU incorporates proper aliasing, we can make use of a
> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> provides the downstream devices with the same AddressSpace, ex:
> 
> qemu-system-x86_64 -machine q35... \
>   -device intel-iommu,intremap=on \
>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> 
> While the utility of this hack may be limited, this AddressSpace
> aliasing is the correct behavior for QEMU to emulate bare metal.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

The patch looks sane to me even as a bug fix since otherwise the DMA
address spaces used under misc kinds of PCI bridges can be wrong, so:

Reviewed-by: Peter Xu <peterx@redhat.com>

Though I have a question that confused me even before: Alex, do you
know why all the context entry of the devices in the IOMMU root table
will be programmed even if the devices are under a pcie-to-pci bridge?
I'm giving an example with above [1] to be clear: in that case IIUC
we'll program context entries for all the three devices (00:01.0,
01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
bare metal and then why we bother to program the context entry of
01:00.1?  It seems never used.

(It should be used for current QEMU to work with pcie-to-pci bridges
 if without this patch, but I feel like I don't know the real answer
 behind)

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Michael S. Tsirkin 5 years ago
On Wed, Mar 27, 2019 at 02:25:00PM +0800, Peter Xu wrote:
> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   0000:00:01.0  0000:01:00.0  0000:01:00.1
> 
> [1]
> 
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > 0000:01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> The patch looks sane to me even as a bug fix since otherwise the DMA
> address spaces used under misc kinds of PCI bridges can be wrong, so:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Though I have a question that confused me even before: Alex, do you
> know why all the context entry of the devices in the IOMMU root table
> will be programmed even if the devices are under a pcie-to-pci bridge?
> I'm giving an example with above [1] to be clear: in that case IIUC
> we'll program context entries for all the three devices (00:01.0,
> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> bare metal

What makes you think so?

PCI Express spec says:

Requester ID

The combination of a Requester's Bus Number, Device Number, and Function
Number that uniquely identifies the Requester. With an ARI Requester ID, bits
traditionally used for the Device Number field are used instead to expand the
Function Number field, and the Device Number is implied to be 0.



> and then why we bother to program the context entry of
> 01:00.1?  It seems never used.
> 
> (It should be used for current QEMU to work with pcie-to-pci bridges
>  if without this patch, but I feel like I don't know the real answer
>  behind)
> 
> Thanks,
> 
> -- 
> Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Peter Xu 5 years ago
On Wed, Mar 27, 2019 at 11:35:35AM -0400, Michael S. Tsirkin wrote:
> On Wed, Mar 27, 2019 at 02:25:00PM +0800, Peter Xu wrote:
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > > distinguish by devfn & bus between devices in a conventional PCI
> > > topology and therefore we cannot assign them separate AddressSpaces.
> > > By taking this requester ID aliasing into account, QEMU better matches
> > > the bare metal behavior and restrictions, and enables shared
> > > AddressSpace configurations that are otherwise not possible with
> > > guest IOMMU support.
> > > 
> > > For the latter case, given any example where an IOMMU group on the
> > > host includes multiple devices:
> > > 
> > >   $ ls  /sys/kernel/iommu_groups/1/devices/
> > >   0000:00:01.0  0000:01:00.0  0000:01:00.1
> > 
> > [1]
> > 
> > > 
> > > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > > that we can only assign one of the endpoints to the guest because a
> > > second endpoint will attempt to use a different AddressSpace.  VFIO
> > > only supports IOMMU group level granularity at the container level,
> > > preventing this second endpoint from being assigned:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > > 
> > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > > 0000:01:00.1: group 1 used in multiple address spaces
> > > 
> > > However, when QEMU incorporates proper aliasing, we can make use of a
> > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > > provides the downstream devices with the same AddressSpace, ex:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > > 
> > > While the utility of this hack may be limited, this AddressSpace
> > > aliasing is the correct behavior for QEMU to emulate bare metal.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > The patch looks sane to me even as a bug fix since otherwise the DMA
> > address spaces used under misc kinds of PCI bridges can be wrong, so:
> > 
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > 
> > Though I have a question that confused me even before: Alex, do you
> > know why all the context entry of the devices in the IOMMU root table
> > will be programmed even if the devices are under a pcie-to-pci bridge?
> > I'm giving an example with above [1] to be clear: in that case IIUC
> > we'll program context entries for all the three devices (00:01.0,
> > 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> > bare metal
> 
> What makes you think so?
> 
> PCI Express spec says:
> 
> Requester ID
> 
> The combination of a Requester's Bus Number, Device Number, and Function
> Number that uniquely identifies the Requester. With an ARI Requester ID, bits
> traditionally used for the Device Number field are used instead to expand the
> Function Number field, and the Device Number is implied to be 0.

It can be something special when there're bridges like pcie-to-pci
bridge because the bridge can take the ownership of the transaction
and modify the requester ID (which should be part of the transaction
ID).  I believe it's clearer now since I saw a lot of discussions
happening in the other thread about this too...

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Alex Williamson 5 years ago
On Wed, 27 Mar 2019 14:25:00 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   0000:00:01.0  0000:01:00.0  0000:01:00.1  
> 
> [1]
> 
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > 0000:01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> 
> The patch looks sane to me even as a bug fix since otherwise the DMA
> address spaces used under misc kinds of PCI bridges can be wrong, so:

I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
I'd be cautious about this if so.  Eric Auger noted that he's seen an
SMMU VM hit a guest kernel bug-on, which needs further
investigation.  It's not clear if it's just an untested or
unimplemented scenario for SMMU to see a conventional PCI bus or if
there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
only VT-d to a very limited degree, thus RFC.
 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Though I have a question that confused me even before: Alex, do you
> know why all the context entry of the devices in the IOMMU root table
> will be programmed even if the devices are under a pcie-to-pci bridge?
> I'm giving an example with above [1] to be clear: in that case IIUC
> we'll program context entries for all the three devices (00:01.0,
> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> bare metal and then why we bother to program the context entry of
> 01:00.1?  It seems never used.
> 
> (It should be used for current QEMU to work with pcie-to-pci bridges
>  if without this patch, but I feel like I don't know the real answer
>  behind)

We actually have two different scenarios that could be represented by
[1], the group can be formed by lack of isolation or by lack of
visibility.  In the group above, it's the former, isolation.  The PCIe
root port does not support ACS, so while the IOMMU has visibility of
the individual devices, peer-to-peer between devices may also be
possible.  Native, trusted, in-kernel drivers for these devices could
still make use of separate IOMMU domains per device, but in order to
expose the devices to a userspace driver we need to consider them a
non-isolated set to prevent side-channel attacks between devices.  We
therefore consider them as a group within the IOMMU API and it's
required that each context entry maps to the same domain as the IOMMU
will see transactions for each requester ID.

If we had the visibility case, such as if [1] represented a PCIe-to-PCI
bridge subgroup, then the IOMMU really does only see the bridge
requester ID and there may not be a reason to populate the context
entries for the downstream aliased devices.  Perhaps the IOMMU might
still choose to do so, particularly if the bridge is actually a PCI-X
bridge as PCI-X does incorporate a requester ID, but also has strange
rules about the bridge being able to claim ownership of the
transaction.  So it might be paranoia or simplification that causes all
the context entries to be programmed, or for alias quirks, uncertainty
if a device exclusively uses a quirked requester ID or might sometimes
use the proper requester ID.

In the example I present, we're taking [1], which could be either case
above, and converting it into the visibility case in order to force the
IOMMU to handle the devices within a single address space.  Thanks,

Alex

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Auger Eric 5 years ago
Hi Alex,

[+ Robin]

On 3/27/19 5:37 PM, Alex Williamson wrote:
> On Wed, 27 Mar 2019 14:25:00 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
>>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>>> distinguish by devfn & bus between devices in a conventional PCI
>>> topology and therefore we cannot assign them separate AddressSpaces.
>>> By taking this requester ID aliasing into account, QEMU better matches
>>> the bare metal behavior and restrictions, and enables shared
>>> AddressSpace configurations that are otherwise not possible with
>>> guest IOMMU support.
>>>
>>> For the latter case, given any example where an IOMMU group on the
>>> host includes multiple devices:
>>>
>>>   $ ls  /sys/kernel/iommu_groups/1/devices/
>>>   0000:00:01.0  0000:01:00.0  0000:01:00.1  
>>
>> [1]
>>
>>>
>>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>>> that we can only assign one of the endpoints to the guest because a
>>> second endpoint will attempt to use a different AddressSpace.  VFIO
>>> only supports IOMMU group level granularity at the container level,
>>> preventing this second endpoint from being assigned:
>>>
>>> qemu-system-x86_64 -machine q35... \
>>>   -device intel-iommu,intremap=on \
>>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>>
>>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>>> 0000:01:00.1: group 1 used in multiple address spaces
>>>
>>> However, when QEMU incorporates proper aliasing, we can make use of a
>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>>> provides the downstream devices with the same AddressSpace, ex:
>>>
>>> qemu-system-x86_64 -machine q35... \
>>>   -device intel-iommu,intremap=on \
>>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>>
>>> While the utility of this hack may be limited, this AddressSpace
>>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
>>
>> The patch looks sane to me even as a bug fix since otherwise the DMA
>> address spaces used under misc kinds of PCI bridges can be wrong, so:
> 
> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
> I'd be cautious about this if so.  Eric Auger noted that he's seen an
> SMMU VM hit a guest kernel bug-on, which needs further
> investigation.  It's not clear if it's just an untested or
> unimplemented scenario for SMMU to see a conventional PCI bus or if
> there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
> only VT-d to a very limited degree, thus RFC.

So I have tracked this further and here is what I can see.

On guest side, the 2 assigned devices that I have put downstream to the
pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one
corresponding to the requester id of the very device and the second one
corresponding to the rid matching the same bus number and devfn=0

dev0 =  0000:02:01.0
        0000:02:00.0

dev1 = 0000:02:01.1
       0000:02:00.0

Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1.
Each time it iterates over the associated ids and we call add_device
twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver
recognizes a context is already alive for 0000:02:00.0 and triggers a
BUG_ON().

At the origin of the creation of 2 ids for each device,
iort_iommu_configure is called on each downstream device which calls
pci_for_each_dma_alias(). We enter the
pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and
iort_pci_iommu_init is called with bus number 2 and devfn=0.

Thanks

Eric
>  
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>
>> Though I have a question that confused me even before: Alex, do you
>> know why all the context entry of the devices in the IOMMU root table
>> will be programmed even if the devices are under a pcie-to-pci bridge?
>> I'm giving an example with above [1] to be clear: in that case IIUC
>> we'll program context entries for all the three devices (00:01.0,
>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
>> bare metal and then why we bother to program the context entry of
>> 01:00.1?  It seems never used.
>>
>> (It should be used for current QEMU to work with pcie-to-pci bridges
>>  if without this patch, but I feel like I don't know the real answer
>>  behind)
> 
> We actually have two different scenarios that could be represented by
> [1], the group can be formed by lack of isolation or by lack of
> visibility.  In the group above, it's the former, isolation.  The PCIe
> root port does not support ACS, so while the IOMMU has visibility of
> the individual devices, peer-to-peer between devices may also be
> possible.  Native, trusted, in-kernel drivers for these devices could
> still make use of separate IOMMU domains per device, but in order to
> expose the devices to a userspace driver we need to consider them a
> non-isolated set to prevent side-channel attacks between devices.  We
> therefore consider them as a group within the IOMMU API and it's
> required that each context entry maps to the same domain as the IOMMU
> will see transactions for each requester ID.
> 
> If we had the visibility case, such as if [1] represented a PCIe-to-PCI
> bridge subgroup, then the IOMMU really does only see the bridge
> requester ID and there may not be a reason to populate the context
> entries for the downstream aliased devices.  Perhaps the IOMMU might
> still choose to do so, particularly if the bridge is actually a PCI-X
> bridge as PCI-X does incorporate a requester ID, but also has strange
> rules about the bridge being able to claim ownership of the
> transaction.  So it might be paranoia or simplification that causes all
> the context entries to be programmed, or for alias quirks, uncertainty
> if a device exclusively uses a quirked requester ID or might sometimes
> use the proper requester ID.
> 
> In the example I present, we're taking [1], which could be either case
> above, and converting it into the visibility case in order to force the
> IOMMU to handle the devices within a single address space.  Thanks,
> 
> Alex
> 

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Robin Murphy 5 years ago
On 28/03/2019 10:38, Auger Eric wrote:
> Hi Alex,
> 
> [+ Robin]
> 
> On 3/27/19 5:37 PM, Alex Williamson wrote:
>> On Wed, 27 Mar 2019 14:25:00 +0800
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
>>>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>>>> distinguish by devfn & bus between devices in a conventional PCI
>>>> topology and therefore we cannot assign them separate AddressSpaces.
>>>> By taking this requester ID aliasing into account, QEMU better matches
>>>> the bare metal behavior and restrictions, and enables shared
>>>> AddressSpace configurations that are otherwise not possible with
>>>> guest IOMMU support.
>>>>
>>>> For the latter case, given any example where an IOMMU group on the
>>>> host includes multiple devices:
>>>>
>>>>    $ ls  /sys/kernel/iommu_groups/1/devices/
>>>>    0000:00:01.0  0000:01:00.0  0000:01:00.1
>>>
>>> [1]
>>>
>>>>
>>>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>>>> that we can only assign one of the endpoints to the guest because a
>>>> second endpoint will attempt to use a different AddressSpace.  VFIO
>>>> only supports IOMMU group level granularity at the container level,
>>>> preventing this second endpoint from being assigned:
>>>>
>>>> qemu-system-x86_64 -machine q35... \
>>>>    -device intel-iommu,intremap=on \
>>>>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>>>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>>>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>>>> 0000:01:00.1: group 1 used in multiple address spaces
>>>>
>>>> However, when QEMU incorporates proper aliasing, we can make use of a
>>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>>>> provides the downstream devices with the same AddressSpace, ex:
>>>>
>>>> qemu-system-x86_64 -machine q35... \
>>>>    -device intel-iommu,intremap=on \
>>>>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>>>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>>>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>>>
>>>> While the utility of this hack may be limited, this AddressSpace
>>>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>>>
>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>
>>> The patch looks sane to me even as a bug fix since otherwise the DMA
>>> address spaces used under misc kinds of PCI bridges can be wrong, so:
>>
>> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
>> I'd be cautious about this if so.  Eric Auger noted that he's seen an
>> SMMU VM hit a guest kernel bug-on, which needs further
>> investigation.  It's not clear if it's just an untested or
>> unimplemented scenario for SMMU to see a conventional PCI bus or if
>> there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
>> only VT-d to a very limited degree, thus RFC.
> 
> So I have tracked this further and here is what I can see.
> 
> On guest side, the 2 assigned devices that I have put downstream to the
> pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one
> corresponding to the requester id of the very device and the second one
> corresponding to the rid matching the same bus number and devfn=0
> 
> dev0 =  0000:02:01.0
>          0000:02:00.0
> 
> dev1 = 0000:02:01.1
>         0000:02:00.0
> 
> Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1.
> Each time it iterates over the associated ids and we call add_device
> twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver
> recognizes a context is already alive for 0000:02:00.0 and triggers a
> BUG_ON().

Hmm, aliasing bridges are supposed to be handled as of commit 
563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") - 
what's changed since then?

Robin.

> At the origin of the creation of 2 ids for each device,
> iort_iommu_configure is called on each downstream device which calls
> pci_for_each_dma_alias(). We enter the
> pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and
> iort_pci_iommu_init is called with bus number 2 and devfn=0.
> 
> Thanks
> 
> Eric
>>   
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>
>>> Though I have a question that confused me even before: Alex, do you
>>> know why all the context entry of the devices in the IOMMU root table
>>> will be programmed even if the devices are under a pcie-to-pci bridge?
>>> I'm giving an example with above [1] to be clear: in that case IIUC
>>> we'll program context entries for all the three devices (00:01.0,
>>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
>>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
>>> bare metal and then why we bother to program the context entry of
>>> 01:00.1?  It seems never used.
>>>
>>> (It should be used for current QEMU to work with pcie-to-pci bridges
>>>   if without this patch, but I feel like I don't know the real answer
>>>   behind)
>>
>> We actually have two different scenarios that could be represented by
>> [1], the group can be formed by lack of isolation or by lack of
>> visibility.  In the group above, it's the former, isolation.  The PCIe
>> root port does not support ACS, so while the IOMMU has visibility of
>> the individual devices, peer-to-peer between devices may also be
>> possible.  Native, trusted, in-kernel drivers for these devices could
>> still make use of separate IOMMU domains per device, but in order to
>> expose the devices to a userspace driver we need to consider them a
>> non-isolated set to prevent side-channel attacks between devices.  We
>> therefore consider them as a group within the IOMMU API and it's
>> required that each context entry maps to the same domain as the IOMMU
>> will see transactions for each requester ID.
>>
>> If we had the visibility case, such as if [1] represented a PCIe-to-PCI
>> bridge subgroup, then the IOMMU really does only see the bridge
>> requester ID and there may not be a reason to populate the context
>> entries for the downstream aliased devices.  Perhaps the IOMMU might
>> still choose to do so, particularly if the bridge is actually a PCI-X
>> bridge as PCI-X does incorporate a requester ID, but also has strange
>> rules about the bridge being able to claim ownership of the
>> transaction.  So it might be paranoia or simplification that causes all
>> the context entries to be programmed, or for alias quirks, uncertainty
>> if a device exclusively uses a quirked requester ID or might sometimes
>> use the proper requester ID.
>>
>> In the example I present, we're taking [1], which could be either case
>> above, and converting it into the visibility case in order to force the
>> IOMMU to handle the devices within a single address space.  Thanks,
>>
>> Alex
>>

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Auger Eric 5 years ago
Hi Robin,

On 3/28/19 11:56 AM, Robin Murphy wrote:
> On 28/03/2019 10:38, Auger Eric wrote:
>> Hi Alex,
>>
>> [+ Robin]
>>
>> On 3/27/19 5:37 PM, Alex Williamson wrote:
>>> On Wed, 27 Mar 2019 14:25:00 +0800
>>> Peter Xu <peterx@redhat.com> wrote:
>>>
>>>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
>>>>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>>>>> distinguish by devfn & bus between devices in a conventional PCI
>>>>> topology and therefore we cannot assign them separate AddressSpaces.
>>>>> By taking this requester ID aliasing into account, QEMU better matches
>>>>> the bare metal behavior and restrictions, and enables shared
>>>>> AddressSpace configurations that are otherwise not possible with
>>>>> guest IOMMU support.
>>>>>
>>>>> For the latter case, given any example where an IOMMU group on the
>>>>> host includes multiple devices:
>>>>>
>>>>>    $ ls  /sys/kernel/iommu_groups/1/devices/
>>>>>    0000:00:01.0  0000:01:00.0  0000:01:00.1
>>>>
>>>> [1]
>>>>
>>>>>
>>>>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>>>>> that we can only assign one of the endpoints to the guest because a
>>>>> second endpoint will attempt to use a different AddressSpace.  VFIO
>>>>> only supports IOMMU group level granularity at the container level,
>>>>> preventing this second endpoint from being assigned:
>>>>>
>>>>> qemu-system-x86_64 -machine q35... \
>>>>>    -device intel-iommu,intremap=on \
>>>>>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>>>>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>>>>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>>>>
>>>>> qemu-system-x86_64: -device
>>>>> vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>>>>> 0000:01:00.1: group 1 used in multiple address spaces
>>>>>
>>>>> However, when QEMU incorporates proper aliasing, we can make use of a
>>>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>>>>> provides the downstream devices with the same AddressSpace, ex:
>>>>>
>>>>> qemu-system-x86_64 -machine q35... \
>>>>>    -device intel-iommu,intremap=on \
>>>>>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>>>>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>>>>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>>>>
>>>>> While the utility of this hack may be limited, this AddressSpace
>>>>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>>>>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>
>>>> The patch looks sane to me even as a bug fix since otherwise the DMA
>>>> address spaces used under misc kinds of PCI bridges can be wrong, so:
>>>
>>> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
>>> I'd be cautious about this if so.  Eric Auger noted that he's seen an
>>> SMMU VM hit a guest kernel bug-on, which needs further
>>> investigation.  It's not clear if it's just an untested or
>>> unimplemented scenario for SMMU to see a conventional PCI bus or if
>>> there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
>>> only VT-d to a very limited degree, thus RFC.
>>
>> So I have tracked this further and here is what I can see.
>>
>> On guest side, the 2 assigned devices that I have put downstream to the
>> pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one
>> corresponding to the requester id of the very device and the second one
>> corresponding to the rid matching the same bus number and devfn=0
>>
>> dev0 =  0000:02:01.0
>>          0000:02:00.0
>>
>> dev1 = 0000:02:01.1
>>         0000:02:00.0
>>
>> Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1.
>> Each time it iterates over the associated ids and we call add_device
>> twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver
>> recognizes a context is already alive for 0000:02:00.0 and triggers a
>> BUG_ON().
> 
> Hmm, aliasing bridges are supposed to be handled as of commit
> 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") -
> what's changed since then?

I our case, the problem is not we have the same ids[] for a given device
but we have 2 functions attached to the pcie-to-pci bridge and their
fwspec have an ids[] in common, the one with the subordinate bus and
devfn=0.

device pcie-pci-bridge,addr=1e.0,id=pci.1 \
-device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
-device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

[    2.509381] ixgbe 0000:02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1
[    2.512095] arm_smmu_install_ste_for_dev sid[0]=520
[    2.513524] arm_smmu_write_strtab_ent sid=520 val=1
[    2.514872] arm_smmu_write_strtab_ent sid=520 ABORT
[    2.516266] arm_smmu_install_ste_for_dev sid[1]=512
[    2.517609] arm_smmu_write_strtab_ent sid=512 val=1
[    2.518958] arm_smmu_write_strtab_ent sid=512 ABORT

[    3.301316] ixgbe 0000:02:01.1: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1
[    3.302456] arm_smmu_install_ste_for_dev sid[0]=521
[    3.303420] arm_smmu_write_strtab_ent sid=521 val=1
[    3.304408] arm_smmu_write_strtab_ent sid=521 ABORT
[    3.305442] arm_smmu_install_ste_for_dev sid[1]=512
[    3.306812] arm_smmu_write_strtab_ent sid=512 val=3758686219
[    3.308390] arm_smmu_write_strtab_ent sid=512 S1 | S2 ste_live


Thanks

Eric
> 
> Robin.
> 
>> At the origin of the creation of 2 ids for each device,
>> iort_iommu_configure is called on each downstream device which calls
>> pci_for_each_dma_alias(). We enter the
>> pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and
>> iort_pci_iommu_init is called with bus number 2 and devfn=0.
>>
>> Thanks
>>
>> Eric
>>>  
>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>
>>>> Though I have a question that confused me even before: Alex, do you
>>>> know why all the context entry of the devices in the IOMMU root table
>>>> will be programmed even if the devices are under a pcie-to-pci bridge?
>>>> I'm giving an example with above [1] to be clear: in that case IIUC
>>>> we'll program context entries for all the three devices (00:01.0,
>>>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
>>>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
>>>> bare metal and then why we bother to program the context entry of
>>>> 01:00.1?  It seems never used.
>>>>
>>>> (It should be used for current QEMU to work with pcie-to-pci bridges
>>>>   if without this patch, but I feel like I don't know the real answer
>>>>   behind)
>>>
>>> We actually have two different scenarios that could be represented by
>>> [1], the group can be formed by lack of isolation or by lack of
>>> visibility.  In the group above, it's the former, isolation.  The PCIe
>>> root port does not support ACS, so while the IOMMU has visibility of
>>> the individual devices, peer-to-peer between devices may also be
>>> possible.  Native, trusted, in-kernel drivers for these devices could
>>> still make use of separate IOMMU domains per device, but in order to
>>> expose the devices to a userspace driver we need to consider them a
>>> non-isolated set to prevent side-channel attacks between devices.  We
>>> therefore consider them as a group within the IOMMU API and it's
>>> required that each context entry maps to the same domain as the IOMMU
>>> will see transactions for each requester ID.
>>>
>>> If we had the visibility case, such as if [1] represented a PCIe-to-PCI
>>> bridge subgroup, then the IOMMU really does only see the bridge
>>> requester ID and there may not be a reason to populate the context
>>> entries for the downstream aliased devices.  Perhaps the IOMMU might
>>> still choose to do so, particularly if the bridge is actually a PCI-X
>>> bridge as PCI-X does incorporate a requester ID, but also has strange
>>> rules about the bridge being able to claim ownership of the
>>> transaction.  So it might be paranoia or simplification that causes all
>>> the context entries to be programmed, or for alias quirks, uncertainty
>>> if a device exclusively uses a quirked requester ID or might sometimes
>>> use the proper requester ID.
>>>
>>> In the example I present, we're taking [1], which could be either case
>>> above, and converting it into the visibility case in order to force the
>>> IOMMU to handle the devices within a single address space.  Thanks,
>>>
>>> Alex
>>>

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Robin Murphy 5 years ago
On 28/03/2019 12:53, Auger Eric wrote:
> Hi Robin,
> 
> On 3/28/19 11:56 AM, Robin Murphy wrote:
>> On 28/03/2019 10:38, Auger Eric wrote:
>>> Hi Alex,
>>>
>>> [+ Robin]
>>>
>>> On 3/27/19 5:37 PM, Alex Williamson wrote:
>>>> On Wed, 27 Mar 2019 14:25:00 +0800
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>>> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
>>>>>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>>>>>> distinguish by devfn & bus between devices in a conventional PCI
>>>>>> topology and therefore we cannot assign them separate AddressSpaces.
>>>>>> By taking this requester ID aliasing into account, QEMU better matches
>>>>>> the bare metal behavior and restrictions, and enables shared
>>>>>> AddressSpace configurations that are otherwise not possible with
>>>>>> guest IOMMU support.
>>>>>>
>>>>>> For the latter case, given any example where an IOMMU group on the
>>>>>> host includes multiple devices:
>>>>>>
>>>>>>     $ ls  /sys/kernel/iommu_groups/1/devices/
>>>>>>     0000:00:01.0  0000:01:00.0  0000:01:00.1
>>>>>
>>>>> [1]
>>>>>
>>>>>>
>>>>>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>>>>>> that we can only assign one of the endpoints to the guest because a
>>>>>> second endpoint will attempt to use a different AddressSpace.  VFIO
>>>>>> only supports IOMMU group level granularity at the container level,
>>>>>> preventing this second endpoint from being assigned:
>>>>>>
>>>>>> qemu-system-x86_64 -machine q35... \
>>>>>>     -device intel-iommu,intremap=on \
>>>>>>     -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>>>>>     -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>>>>>     -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>>>>>
>>>>>> qemu-system-x86_64: -device
>>>>>> vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>>>>>> 0000:01:00.1: group 1 used in multiple address spaces
>>>>>>
>>>>>> However, when QEMU incorporates proper aliasing, we can make use of a
>>>>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>>>>>> provides the downstream devices with the same AddressSpace, ex:
>>>>>>
>>>>>> qemu-system-x86_64 -machine q35... \
>>>>>>     -device intel-iommu,intremap=on \
>>>>>>     -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>>>>>     -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>>>>>     -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>>>>>
>>>>>> While the utility of this hack may be limited, this AddressSpace
>>>>>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>>>>>
>>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>>
>>>>> The patch looks sane to me even as a bug fix since otherwise the DMA
>>>>> address spaces used under misc kinds of PCI bridges can be wrong, so:
>>>>
>>>> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
>>>> I'd be cautious about this if so.  Eric Auger noted that he's seen an
>>>> SMMU VM hit a guest kernel bug-on, which needs further
>>>> investigation.  It's not clear if it's just an untested or
>>>> unimplemented scenario for SMMU to see a conventional PCI bus or if
>>>> there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
>>>> only VT-d to a very limited degree, thus RFC.
>>>
>>> So I have tracked this further and here is what I can see.
>>>
>>> On guest side, the 2 assigned devices that I have put downstream to the
>>> pcie-to-pci bridge get an iommu_fwspec handle with 2 ids, the first one
>>> corresponding to the requester id of the very device and the second one
>>> corresponding to the rid matching the same bus number and devfn=0
>>>
>>> dev0 =  0000:02:01.0
>>>           0000:02:00.0
>>>
>>> dev1 = 0000:02:01.1
>>>          0000:02:00.0
>>>
>>> Then iommu_probe_device is called for 0000:02:01.0 and 0000:02:01.1.
>>> Each time it iterates over the associated ids and we call add_device
>>> twice for 0000:02:00.0. The second time, the arm-smmu-v3 driver
>>> recognizes a context is already alive for 0000:02:00.0 and triggers a
>>> BUG_ON().
>>
>> Hmm, aliasing bridges are supposed to be handled as of commit
>> 563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs") -
>> what's changed since then?
> 
> I our case, the problem is not we have the same ids[] for a given device
> but we have 2 functions attached to the pcie-to-pci bridge and their
> fwspec have an ids[] in common, the one with the subordinate bus and
> devfn=0.

Oh, right, the recollection of having fixed something like this before 
distracted me from that important detail, sorry :)

Yeah, it's a shortcoming of the driver - in general we don't support 
arbitrary devices sharing Stream IDs (as arm_smmu_device_group() says), 
but since we hadn't yet seen (nor expected) a non-trivial legacy PCI 
topology with SMMUv3, that case has never really been tested properly 
either.

Robin.

> device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> 
> [    2.509381] ixgbe 0000:02:01.0: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1
> [    2.512095] arm_smmu_install_ste_for_dev sid[0]=520
> [    2.513524] arm_smmu_write_strtab_ent sid=520 val=1
> [    2.514872] arm_smmu_write_strtab_ent sid=520 ABORT
> [    2.516266] arm_smmu_install_ste_for_dev sid[1]=512
> [    2.517609] arm_smmu_write_strtab_ent sid=512 val=1
> [    2.518958] arm_smmu_write_strtab_ent sid=512 ABORT
> 
> [    3.301316] ixgbe 0000:02:01.1: arm_smmu_attach_dev ARM_SMMU_DOMAIN_S1
> [    3.302456] arm_smmu_install_ste_for_dev sid[0]=521
> [    3.303420] arm_smmu_write_strtab_ent sid=521 val=1
> [    3.304408] arm_smmu_write_strtab_ent sid=521 ABORT
> [    3.305442] arm_smmu_install_ste_for_dev sid[1]=512
> [    3.306812] arm_smmu_write_strtab_ent sid=512 val=3758686219
> [    3.308390] arm_smmu_write_strtab_ent sid=512 S1 | S2 ste_live
> 
> 
> Thanks
> 
> Eric
>>
>> Robin.
>>
>>> At the origin of the creation of 2 ids for each device,
>>> iort_iommu_configure is called on each downstream device which calls
>>> pci_for_each_dma_alias(). We enter the
>>> pci_is_pcie(tmp)/PCI_EXP_TYPE_PCI_BRIDGE code path and
>>> iort_pci_iommu_init is called with bus number 2 and devfn=0.
>>>
>>> Thanks
>>>
>>> Eric
>>>>   
>>>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> Though I have a question that confused me even before: Alex, do you
>>>>> know why all the context entry of the devices in the IOMMU root table
>>>>> will be programmed even if the devices are under a pcie-to-pci bridge?
>>>>> I'm giving an example with above [1] to be clear: in that case IIUC
>>>>> we'll program context entries for all the three devices (00:01.0,
>>>>> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
>>>>> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
>>>>> bare metal and then why we bother to program the context entry of
>>>>> 01:00.1?  It seems never used.
>>>>>
>>>>> (It should be used for current QEMU to work with pcie-to-pci bridges
>>>>>    if without this patch, but I feel like I don't know the real answer
>>>>>    behind)
>>>>
>>>> We actually have two different scenarios that could be represented by
>>>> [1], the group can be formed by lack of isolation or by lack of
>>>> visibility.  In the group above, it's the former, isolation.  The PCIe
>>>> root port does not support ACS, so while the IOMMU has visibility of
>>>> the individual devices, peer-to-peer between devices may also be
>>>> possible.  Native, trusted, in-kernel drivers for these devices could
>>>> still make use of separate IOMMU domains per device, but in order to
>>>> expose the devices to a userspace driver we need to consider them a
>>>> non-isolated set to prevent side-channel attacks between devices.  We
>>>> therefore consider them as a group within the IOMMU API and it's
>>>> required that each context entry maps to the same domain as the IOMMU
>>>> will see transactions for each requester ID.
>>>>
>>>> If we had the visibility case, such as if [1] represented a PCIe-to-PCI
>>>> bridge subgroup, then the IOMMU really does only see the bridge
>>>> requester ID and there may not be a reason to populate the context
>>>> entries for the downstream aliased devices.  Perhaps the IOMMU might
>>>> still choose to do so, particularly if the bridge is actually a PCI-X
>>>> bridge as PCI-X does incorporate a requester ID, but also has strange
>>>> rules about the bridge being able to claim ownership of the
>>>> transaction.  So it might be paranoia or simplification that causes all
>>>> the context entries to be programmed, or for alias quirks, uncertainty
>>>> if a device exclusively uses a quirked requester ID or might sometimes
>>>> use the proper requester ID.
>>>>
>>>> In the example I present, we're taking [1], which could be either case
>>>> above, and converting it into the visibility case in order to force the
>>>> IOMMU to handle the devices within a single address space.  Thanks,
>>>>
>>>> Alex
>>>>

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Posted by Peter Xu 5 years ago
On Wed, Mar 27, 2019 at 10:37:09AM -0600, Alex Williamson wrote:
> On Wed, 27 Mar 2019 14:25:00 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > > distinguish by devfn & bus between devices in a conventional PCI
> > > topology and therefore we cannot assign them separate AddressSpaces.
> > > By taking this requester ID aliasing into account, QEMU better matches
> > > the bare metal behavior and restrictions, and enables shared
> > > AddressSpace configurations that are otherwise not possible with
> > > guest IOMMU support.
> > > 
> > > For the latter case, given any example where an IOMMU group on the
> > > host includes multiple devices:
> > > 
> > >   $ ls  /sys/kernel/iommu_groups/1/devices/
> > >   0000:00:01.0  0000:01:00.0  0000:01:00.1  
> > 
> > [1]
> > 
> > > 
> > > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > > that we can only assign one of the endpoints to the guest because a
> > > second endpoint will attempt to use a different AddressSpace.  VFIO
> > > only supports IOMMU group level granularity at the container level,
> > > preventing this second endpoint from being assigned:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > > 
> > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > > 0000:01:00.1: group 1 used in multiple address spaces
> > > 
> > > However, when QEMU incorporates proper aliasing, we can make use of a
> > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > > provides the downstream devices with the same AddressSpace, ex:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > > 
> > > While the utility of this hack may be limited, this AddressSpace
> > > aliasing is the correct behavior for QEMU to emulate bare metal.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> > 
> > The patch looks sane to me even as a bug fix since otherwise the DMA
> > address spaces used under misc kinds of PCI bridges can be wrong, so:
> 
> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
> I'd be cautious about this if so.  Eric Auger noted that he's seen an
> SMMU VM hit a guest kernel bug-on, which needs further
> investigation.  It's not clear if it's just an untested or
> unimplemented scenario for SMMU to see a conventional PCI bus or if
> there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
> only VT-d to a very limited degree, thus RFC.

Sorry to be unclear.  I wasn't meant to target this for 4.0, and I
completely agree that it should be after the release since it is still
a relatively influential change to the PCI system of QEMU, not to
mention that the system mostly works well even without this patch.

(except things like assignment of multi-functions with IOMMU but it is
 rare, after all)

>  
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > 
> > Though I have a question that confused me even before: Alex, do you
> > know why all the context entry of the devices in the IOMMU root table
> > will be programmed even if the devices are under a pcie-to-pci bridge?
> > I'm giving an example with above [1] to be clear: in that case IIUC
> > we'll program context entries for all the three devices (00:01.0,
> > 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> > bare metal and then why we bother to program the context entry of
> > 01:00.1?  It seems never used.
> > 
> > (It should be used for current QEMU to work with pcie-to-pci bridges
> >  if without this patch, but I feel like I don't know the real answer
> >  behind)
> 
> We actually have two different scenarios that could be represented by
> [1], the group can be formed by lack of isolation or by lack of
> visibility.  In the group above, it's the former, isolation.  The PCIe
> root port does not support ACS, so while the IOMMU has visibility of
> the individual devices, peer-to-peer between devices may also be
> possible.  Native, trusted, in-kernel drivers for these devices could
> still make use of separate IOMMU domains per device, but in order to
> expose the devices to a userspace driver we need to consider them a
> non-isolated set to prevent side-channel attacks between devices.  We
> therefore consider them as a group within the IOMMU API and it's
> required that each context entry maps to the same domain as the IOMMU
> will see transactions for each requester ID.
> 
> If we had the visibility case, such as if [1] represented a PCIe-to-PCI
> bridge subgroup, then the IOMMU really does only see the bridge
> requester ID and there may not be a reason to populate the context
> entries for the downstream aliased devices.  Perhaps the IOMMU might
> still choose to do so, particularly if the bridge is actually a PCI-X
> bridge as PCI-X does incorporate a requester ID, but also has strange
> rules about the bridge being able to claim ownership of the
> transaction.  So it might be paranoia or simplification that causes all
> the context entries to be programmed, or for alias quirks, uncertainty
> if a device exclusively uses a quirked requester ID or might sometimes
> use the proper requester ID.
> 
> In the example I present, we're taking [1], which could be either case
> above, and converting it into the visibility case in order to force the
> IOMMU to handle the devices within a single address space.  Thanks,

The answers are detailed and clear (as usual :).  My thanks!

-- 
Peter Xu