RE: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)

Tian, Kevin posted 6 patches 2 years ago
Only 0 patches received!
RE: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
Posted by Tian, Kevin 2 years ago
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, December 9, 2023 9:47 AM
> 
> What is in a Nested domain:
>  Intel: A single IO page table refereed to by a PASID entry
>         Each vDomain-ID,PASID allocates a unique nesting domain
>  AMD: A GCR3 table pointer
>       Nesting domains are created for every unique GCR3 pointer.
>       vDomain-ID can possibly refer to multiple Nesting domains :(
>  ARM: A CD table pointer
>       Nesting domains are created for every unique CD table top pointer.

this AMD/ARM difference is not very clear to me.

How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
points to different I/O page tables?
Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
Posted by Yi Liu 2 years ago
On 2023/12/11 10:29, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Saturday, December 9, 2023 9:47 AM
>>
>> What is in a Nested domain:
>>   Intel: A single IO page table refereed to by a PASID entry
>>          Each vDomain-ID,PASID allocates a unique nesting domain
>>   AMD: A GCR3 table pointer
>>        Nesting domains are created for every unique GCR3 pointer.
>>        vDomain-ID can possibly refer to multiple Nesting domains :(
>>   ARM: A CD table pointer
>>        Nesting domains are created for every unique CD table top pointer.
> 
> this AMD/ARM difference is not very clear to me.
> 
> How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
> lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
> points to different I/O page tables?

Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
the vDomainID will not be used to tag cache, the host DomainId would be
used instead. @Jason?

-- 
Regards,
Yi Liu
Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
Posted by Jason Gunthorpe 2 years ago
On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote:
> On 2023/12/11 10:29, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, December 9, 2023 9:47 AM
> > > 
> > > What is in a Nested domain:
> > >   Intel: A single IO page table refereed to by a PASID entry
> > >          Each vDomain-ID,PASID allocates a unique nesting domain
> > >   AMD: A GCR3 table pointer
> > >        Nesting domains are created for every unique GCR3 pointer.
> > >        vDomain-ID can possibly refer to multiple Nesting domains :(
> > >   ARM: A CD table pointer
> > >        Nesting domains are created for every unique CD table top pointer.
> > 
> > this AMD/ARM difference is not very clear to me.
> > 
> > How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
> > lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
> > points to different I/O page tables?
> 
> Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
> the vDomainID will not be used to tag cache, the host DomainId would be
> used instead. @Jason?

The DomainID comes from the DTE table which is indexed by the RID, and
the DTE entry points to the GCR3 table. So the VM certainly can setup
a DTE table with multiple entires having the same vDomainID but
pointing to different GCR3's. So the VMM has to do *something* with
this.

Most likely this is not a useful thing to do. However what should the
VMM do when it sees this? Block a random DTE or push the duplication
down to real HW would be my options. I'd probably try to do the latter
just on the basis of better emulation.

Jason
Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
Posted by Suthikulpanit, Suravee 2 years ago

On 12/11/2023 8:05 PM, Jason Gunthorpe wrote:
> On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote:
>> On 2023/12/11 10:29, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Saturday, December 9, 2023 9:47 AM
>>>>
>>>> What is in a Nested domain:
>>>>    Intel: A single IO page table refereed to by a PASID entry
>>>>           Each vDomain-ID,PASID allocates a unique nesting domain
>>>>    AMD: A GCR3 table pointer
>>>>         Nesting domains are created for every unique GCR3 pointer.
>>>>         vDomain-ID can possibly refer to multiple Nesting domains :(
>>>>    ARM: A CD table pointer
>>>>         Nesting domains are created for every unique CD table top pointer.
>>>
>>> this AMD/ARM difference is not very clear to me.
>>>
>>> How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it
>>> lead to cache tag conflict when a same PASID entry in multiple GCR3 tables
>>> points to different I/O page tables?
>>
>> Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually,
>> the vDomainID will not be used to tag cache, the host DomainId would be
>> used instead. @Jason?
> 
> The DomainID comes from the DTE table which is indexed by the RID, and
> the DTE entry points to the GCR3 table. So the VM certainly can setup
> a DTE table with multiple entires having the same vDomainID but
> pointing to different GCR3's. So the VMM has to do *something* with
> this.
> 
> Most likely this is not a useful thing to do. However what should the
> VMM do when it sees this? Block a random DTE or push the duplication
> down to real HW would be my options. I'd probably try to do the latter
> just on the basis of better emulation.
> 
> Jason

For AMD, the hardware uses host DomainID (hDomainId) and PASID to tag 
the IOMMU TLB.

The VM can setup vDomainID independently from device (RID) and 
hDomainID. The vDomainId->hDomainId mapping would be managed by the host 
IOMMU driver (since this is also needed by the HW when enabling the 
HW-vIOMMU support a.k.a virtual function).

Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group.
One issue with this is when we have nested translation where we could 
end up with multiple devices (RIDs) sharing same PASID and the same 
hDomainID.

For example:

   - Host view
     Device1 (RID 1) w/ hDomainId 1
     Device2 (RID 2) w/ hDomainId 1
   - Guest view
     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0

We should be able to workaround this by changing the way we assign 
hDomainId to be per-device for VFIO pass-through devices although 
sharing the same v1 (stage-2) page table. This would look like.

   - Host view
     Device1 (RID 1) w/ hDomainId 1
     Device2 (RID 2) w/ hDomainId 2
   - Guest view
     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0

This should avoid the IOMMU TLB conflict. However, the invalidation 
would need to be done for both DomainId 1 and 2 when updating the v1 
(stage-2) page table.

Thanks,
Suravee
Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
Posted by Jason Gunthorpe 2 years ago
On Mon, Dec 11, 2023 at 10:34:09PM +0700, Suthikulpanit, Suravee wrote:

> Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group.
> One issue with this is when we have nested translation where we could end up
> with multiple devices (RIDs) sharing same PASID and the same hDomainID.

Which means you also create multiple GCR3 tables since those are
(soon) per-device and we end up with the situation I described for a
functional legitimate reason :( It is just wasting memory by
duplicating GCR3 tables.
 
> For example:
> 
>   - Host view
>     Device1 (RID 1) w/ hDomainId 1
>     Device2 (RID 2) w/ hDomainId 1

So.. Groups are another ugly mess that we may have to do something
more robust about.

The group infrastructure assumes that all devices in the group have
the same translation. This is not how the VM communicates, each member
of the group gets to have its own DTE and there are legitimate cases
where the DTEs will be different (even if just temporarily)

How to mesh this is not yet solved (most likely we need to allow group
members to have temporarily different translation). But in the long
run the group should definately not be providing the cache tag, the
driver has to be smarter than this.

I think we talked about this before.. For the AMD driver the v1 page
table should store the domainid in the iommu_domain and that value
should be used everywhere

For modes with a GCR3 table the best you can do is to de-duplicate the
GCR3 tables and assign identical GCR3 tables to identical domain ids.
Ie all devices in a group will eventually share GCR3 tables so they
can converge on the same domain id.

>   - Guest view
>     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
>     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
> 
> We should be able to workaround this by changing the way we assign hDomainId
> to be per-device for VFIO pass-through devices although sharing the same v1
> (stage-2) page table. This would look like.

As I said, this doesn't quite work since the VM could do other
things. The kernel must be aware of the vDomainID and must select an
appropriate hDomainID with that knowledge in mind, otherwise
multi-device-groups in guests are fully broken.

>   - Guest view
>     Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
>     Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
> 
> This should avoid the IOMMU TLB conflict. However, the invalidation would
> need to be done for both DomainId 1 and 2 when updating the v1 (stage-2)
> page table.

Which is the key problem, if the VM thinks it has only one vDomainID
the VMM can't split that into two hDomainID's and expect the viommu
acceleration will work - so we shouldn't try to make it work in SW
either, IMHO.

Jason