> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, December 9, 2023 9:47 AM > > What is in a Nested domain: > Intel: A single IO page table refereed to by a PASID entry > Each vDomain-ID,PASID allocates a unique nesting domain > AMD: A GCR3 table pointer > Nesting domains are created for every unique GCR3 pointer. > vDomain-ID can possibly refer to multiple Nesting domains :( > ARM: A CD table pointer > Nesting domains are created for every unique CD table top pointer. this AMD/ARM difference is not very clear to me. How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it lead to cache tag conflict when a same PASID entry in multiple GCR3 tables points to different I/O page tables?
On 2023/12/11 10:29, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Saturday, December 9, 2023 9:47 AM >> >> What is in a Nested domain: >> Intel: A single IO page table refereed to by a PASID entry >> Each vDomain-ID,PASID allocates a unique nesting domain >> AMD: A GCR3 table pointer >> Nesting domains are created for every unique GCR3 pointer. >> vDomain-ID can possibly refer to multiple Nesting domains :( >> ARM: A CD table pointer >> Nesting domains are created for every unique CD table top pointer. > > this AMD/ARM difference is not very clear to me. > > How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it > lead to cache tag conflict when a same PASID entry in multiple GCR3 tables > points to different I/O page tables? Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually, the vDomainID will not be used to tag cache, the host DomainId would be used instead. @Jason? -- Regards, Yi Liu
On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote: > On 2023/12/11 10:29, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Saturday, December 9, 2023 9:47 AM > > > > > > What is in a Nested domain: > > > Intel: A single IO page table refereed to by a PASID entry > > > Each vDomain-ID,PASID allocates a unique nesting domain > > > AMD: A GCR3 table pointer > > > Nesting domains are created for every unique GCR3 pointer. > > > vDomain-ID can possibly refer to multiple Nesting domains :( > > > ARM: A CD table pointer > > > Nesting domains are created for every unique CD table top pointer. > > > > this AMD/ARM difference is not very clear to me. > > > > How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it > > lead to cache tag conflict when a same PASID entry in multiple GCR3 tables > > points to different I/O page tables? > > Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually, > the vDomainID will not be used to tag cache, the host DomainId would be > used instead. @Jason? The DomainID comes from the DTE table which is indexed by the RID, and the DTE entry points to the GCR3 table. So the VM certainly can setup a DTE table with multiple entires having the same vDomainID but pointing to different GCR3's. So the VMM has to do *something* with this. Most likely this is not a useful thing to do. However what should the VMM do when it sees this? Block a random DTE or push the duplication down to real HW would be my options. I'd probably try to do the latter just on the basis of better emulation. Jason
On 12/11/2023 8:05 PM, Jason Gunthorpe wrote:
> On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote:
>> On 2023/12/11 10:29, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Saturday, December 9, 2023 9:47 AM
>>>>
>>>> What is in a Nested domain:
>>>> Intel: A single IO page table refereed to by a PASID entry
>>>> Each vDomain-ID,PASID allocates a unique nesting domain
>>>> AMD: A GCR3 table pointer
>>>> Nesting domains are created for every unique GCR3 pointer.
>>>> vDomain-ID can possibly refer to multiple Nesting domains :(
>>>> ARM: A CD table pointer
>>>> Nesting domains are created for every unique CD table top pointer.
>>>
>>> this AMD/ARM difference is not very clear to me.
>>>
>>> How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
>>> lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
>>> points to different I/O page tables?
>>
>> Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
>> the vDomainID will not be used to tag cache, the host DomainId would be
>> used instead. @Jason?
>
> The DomainID comes from the DTE table which is indexed by the RID, and
> the DTE entry points to the GCR3 table. So the VM certainly can setup
> a DTE table with multiple entires having the same vDomainID but
> pointing to different GCR3's. So the VMM has to do *something* with
> this.
>
> Most likely this is not a useful thing to do. However what should the
> VMM do when it sees this? Block a random DTE or push the duplication
> down to real HW would be my options. I'd probably try to do the latter
> just on the basis of better emulation.
>
> Jason
For AMD, the hardware uses host DomainID (hDomainId) and PASID to tag
the IOMMU TLB.
The VM can setup vDomainID independently from device (RID) and
hDomainID. The vDomainId->hDomainId mapping would be managed by the host
IOMMU driver (since this is also needed by the HW when enabling the
HW-vIOMMU support a.k.a virtual function).
Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group.
One issue with this is when we have nested translation where we could
end up with multiple devices (RIDs) sharing same PASID and the same
hDomainID.
For example:
- Host view
Device1 (RID 1) w/ hDomainId 1
Device2 (RID 2) w/ hDomainId 1
- Guest view
Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
We should be able to workaround this by changing the way we assign
hDomainId to be per-device for VFIO pass-through devices although
sharing the same v1 (stage-2) page table. This would look like.
- Host view
Device1 (RID 1) w/ hDomainId 1
Device2 (RID 2) w/ hDomainId 2
- Guest view
Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
This should avoid the IOMMU TLB conflict. However, the invalidation
would need to be done for both DomainId 1 and 2 when updating the v1
(stage-2) page table.
Thanks,
Suravee
On Mon, Dec 11, 2023 at 10:34:09PM +0700, Suthikulpanit, Suravee wrote: > Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group. > One issue with this is when we have nested translation where we could end up > with multiple devices (RIDs) sharing same PASID and the same hDomainID. Which means you also create multiple GCR3 tables since those are (soon) per-device and we end up with the situation I described for a functional legitimate reason :( It is just wasting memory by duplicating GCR3 tables. > For example: > > - Host view > Device1 (RID 1) w/ hDomainId 1 > Device2 (RID 2) w/ hDomainId 1 So.. Groups are another ugly mess that we may have to do something more robust about. The group infrastructure assumes that all devices in the group have the same translation. This is not how the VM communicates, each member of the group gets to have its own DTE and there are legitimate cases where the DTEs will be different (even if just temporarily) How to mesh this is not yet solved (most likely we need to allow group members to have temporarily different translation). But in the long run the group should definately not be providing the cache tag, the driver has to be smarter than this. I think we talked about this before.. For the AMD driver the v1 page table should store the domainid in the iommu_domain and that value should be used everywhere For modes with a GCR3 table the best you can do is to de-duplicate the GCR3 tables and assign identical GCR3 tables to identical domain ids. Ie all devices in a group will eventually share GCR3 tables so they can converge on the same domain id. > - Guest view > Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0 > Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0 > > We should be able to workaround this by changing the way we assign hDomainId > to be per-device for VFIO pass-through devices although sharing the same v1 > (stage-2) page table. This would look like. As I said, this doesn't quite work since the VM could do other things. The kernel must be aware of the vDomainID and must select an appropriate hDomainID with that knowledge in mind, otherwise multi-device-groups in guests are fully broken. > - Guest view > Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0 > Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0 > > This should avoid the IOMMU TLB conflict. However, the invalidation would > need to be done for both DomainId 1 and 2 when updating the v1 (stage-2) > page table. Which is the key problem, if the VM thinks it has only one vDomainID the VMM can't split that into two hDomainID's and expect the viommu acceleration will work - so we shouldn't try to make it work in SW either, IMHO. Jason
© 2016 - 2025 Red Hat, Inc.