RE: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Shameerali Kolothum Thodi 1 year ago

Hi Nicolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 11, 2025 3:32 AM
> To: will@kernel.org; robin.murphy@arm.com; jgg@nvidia.com;
> kevin.tian@intel.com; tglx@linutronix.de; maz@kernel.org;
> alex.williamson@redhat.com
> Cc: joro@8bytes.org; shuah@kernel.org; reinette.chatre@intel.com;
> eric.auger@redhat.com; yebin (H) <yebin10@huawei.com>;
> apatel@ventanamicro.com; shivamurthy.shastri@linutronix.de;
> bhelgaas@google.com; anna-maria@linutronix.de; yury.norov@gmail.com;
> nipun.gupta@amd.com; iommu@lists.linux.dev; linux-
> kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> kvm@vger.kernel.org; linux-kselftest@vger.kernel.org;
> patches@lists.linux.dev; jean-philippe@linaro.org; mdf@kernel.org;
> mshavit@google.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; smostafa@google.com;
> ddutile@redhat.com
> Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with
> nested SMMU
> 
> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is translated
> by the IOMMU. For GIC, the MSI address page is called "ITS" page. When
> the
> IOMMU is disabled, the MSI address is programmed to the physical location
> of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
> page is behind the IOMMU, so the MSI address is programmed to an
> allocated
> IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
> the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
> When a 2-stage translation is enabled, IOVA will be still used to program
> the MSI address, though the mappings will be in two stages:
>   IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
> (IPA stands for Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the
> IOVA is dynamically allocated from the top of the IOVA space. If attached
> to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the
> IOVA is
> fixed to an MSI window reported by the IOMMU driver via
> IOMMU_RESV_SW_MSI,
> which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM
> IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
> of the IOMMU translation (1-stage translation), since the IOVA for the ITS
> page is fixed and known by kernel. However, with virtual machine enabling
> a nested IOMMU translation (2-stage), a guest kernel directly controls the
> stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at
> an
> IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
> kernel can't know that guest-level IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
>    (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
>    fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT
>    region per iommu group for a direct mapping. Eventually, the mappings
>    would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
>    This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
> 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
>    driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
>    page location (IPA) to the kernel for the stage-2 mapping. Eventually:
>    IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
>    This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
> 
> Worth mentioning that when Eric Auger was working on the same topic
> with
> the VFIO iommu uAPI, he had the approach (2) first, and then switched to
> the approach (1), suggested by Jean-Philippe for reduction of complexity.
> 
> The approach (1) basically feels like the existing VFIO passthrough that
> has a 1-stage mapping for the unmanaged domain, yet only by shifting the
> MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
> iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece,
> by
> sharing the same idea of "VMM leaving everything to the kernel".
> 
> The approach (2) is an ideal solution, yet it requires additional effort
> for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
> page(s), which demands VMM to closely cooperate.
>  * It also brings some complicated use cases to the table where the host
>    or/and guest system(s) has/have multiple ITS pages.

I had done some basic sanity tests with this series and the Qemu branches you
provided on a HiSilicon hardwrae. The basic dev assignment works fine. I will 
rebase my Qemu smuv3-accel branch on top of this and will do some more tests.

One confusion I have about the above text is, do we still plan to support the
approach -1( Using RMR in Qemu) or you are just mentioning it here because
it is still possible to make use of that. I think from previous discussions the 
argument was to adopt a more dedicated MSI pass-through model which I
think is  approach-2 here.  Could you please confirm.

Thanks,
Shameer

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Thu, Jan 23, 2025 at 09:06:49AM +0000, Shameerali Kolothum Thodi wrote:

> One confusion I have about the above text is, do we still plan to support the
> approach -1( Using RMR in Qemu)

Yes, it remains an option. The VMM would use the
IOMMU_OPTION_SW_MSI_START/SIZE ioctls to tell the kernel where it
wants to put the RMR region then it would send the RMR into the VM
through ACPI.

The kernel side promises that the RMR region will have a consistent
(but unpredictable!) layout of ITS pages (however many are required)
within that RMR space, regardless of what devices/domain are attached.

I would like to start with patches up to #10 for this part as it
solves two of the three problems here.

> or you are just mentioning it here because
> it is still possible to make use of that. I think from previous discussions the
> argument was to adopt a more dedicated MSI pass-through model which I
> think is  approach-2 here.  

The basic flow of the pass through model is shown in the last two
patches, it is not fully complete but is testable. It assumes a single
ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
ITS page at the correct S2 location and then describe it in the ACPI
as an ITS page not a RMR.

The VMM will capture the MSI writes and use
VFIO_IRQ_SET_ACTION_PREPARE to convey the guests's S1 translation to
the IRQ subsystem.

This missing peice is cleaning up the ITS mapping to allow for
multiple ITS pages. I've imagined that kvm would someone give iommufd
a FD that holds the specific ITS pages instead of the
IOMMU_OPTION_SW_MSI_START/SIZE flow.

Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Eric Auger 1 year ago

Hi Jason,

On 1/23/25 2:24 PM, Jason Gunthorpe wrote:
> On Thu, Jan 23, 2025 at 09:06:49AM +0000, Shameerali Kolothum Thodi wrote:
>
>> One confusion I have about the above text is, do we still plan to support the
>> approach -1( Using RMR in Qemu)
> Yes, it remains an option. The VMM would use the
> IOMMU_OPTION_SW_MSI_START/SIZE ioctls to tell the kernel where it
> wants to put the RMR region then it would send the RMR into the VM
> through ACPI.
>
> The kernel side promises that the RMR region will have a consistent
> (but unpredictable!) layout of ITS pages (however many are required)
> within that RMR space, regardless of what devices/domain are attached.
>
> I would like to start with patches up to #10 for this part as it
> solves two of the three problems here.
>
>> or you are just mentioning it here because
>> it is still possible to make use of that. I think from previous discussions the
>> argument was to adopt a more dedicated MSI pass-through model which I
>> think is  approach-2 here.  
> The basic flow of the pass through model is shown in the last two
> patches, it is not fully complete but is testable. It assumes a single
> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
> ITS page at the correct S2 location and then describe it in the ACPI
> as an ITS page not a RMR.
This is a nice to have feature but not mandated in the first place, is it?
>
> The VMM will capture the MSI writes and use
> VFIO_IRQ_SET_ACTION_PREPARE to convey the guests's S1 translation to
> the IRQ subsystem.
>
> This missing peice is cleaning up the ITS mapping to allow for
> multiple ITS pages. I've imagined that kvm would someone give iommufd
> a FD that holds the specific ITS pages instead of the
> IOMMU_OPTION_SW_MSI_START/SIZE flow.
That's what I don't get: at the moment you only pass the gIOVA. With
technique 2, how can you build the nested mapping, ie.

         S1           S2
gIOVA    ->    gDB    ->    hDB

without passing the full gIOVA/gDB S1 mapping to the host?

Eric


>
> Jason
>

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
> >> or you are just mentioning it here because
> >> it is still possible to make use of that. I think from previous discussions the
> >> argument was to adopt a more dedicated MSI pass-through model which I
> >> think is  approach-2 here.  
> > The basic flow of the pass through model is shown in the last two
> > patches, it is not fully complete but is testable. It assumes a single
> > ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
> > ITS page at the correct S2 location and then describe it in the ACPI
> > as an ITS page not a RMR.

> This is a nice to have feature but not mandated in the first place,
> is it?

Not mandated. It just sort of happens because of the design. IMHO
nothing should use it because there is no way for userspace to
discover how many ITS pages there may be.

> > This missing peice is cleaning up the ITS mapping to allow for
> > multiple ITS pages. I've imagined that kvm would someone give iommufd
> > a FD that holds the specific ITS pages instead of the
> > IOMMU_OPTION_SW_MSI_START/SIZE flow.

> That's what I don't get: at the moment you only pass the gIOVA. With
> technique 2, how can you build the nested mapping, ie.
> 
>          S1           S2
> gIOVA    ->    gDB    ->    hDB
> 
> without passing the full gIOVA/gDB S1 mapping to the host?

The nested S2 mapping is already setup before the VM boots:

 - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
 - The ACPI tells the VM that the GIC has an ITS page at the S2's
   address (hDB)
 - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
   page (gDB). The S2 already has gDB -> hDB.
 - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
   S2 are populated at this moment.

If you have multiple ITS pages then the ACPI has to tell the guest GIC
about them, what their gDB address is, and what devices use which ITS.

Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Eric Auger 1 year ago



On 1/29/25 4:04 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
>>>> or you are just mentioning it here because
>>>> it is still possible to make use of that. I think from previous discussions the
>>>> argument was to adopt a more dedicated MSI pass-through model which I
>>>> think is  approach-2 here.  
>>> The basic flow of the pass through model is shown in the last two
>>> patches, it is not fully complete but is testable. It assumes a single
>>> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
>>> ITS page at the correct S2 location and then describe it in the ACPI
>>> as an ITS page not a RMR.
>> This is a nice to have feature but not mandated in the first place,
>> is it?
> Not mandated. It just sort of happens because of the design. IMHO
> nothing should use it because there is no way for userspace to
> discover how many ITS pages there may be.
>
>>> This missing peice is cleaning up the ITS mapping to allow for
>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>> a FD that holds the specific ITS pages instead of the
>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>> That's what I don't get: at the moment you only pass the gIOVA. With
>> technique 2, how can you build the nested mapping, ie.
>>
>>          S1           S2
>> gIOVA    ->    gDB    ->    hDB
>>
>> without passing the full gIOVA/gDB S1 mapping to the host?
> The nested S2 mapping is already setup before the VM boots:
>
>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
Ah OK. Your gDB has nothing to do with the actual S1 guest gDB, right?
It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
Is that correct? In
https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
I was passing both the gIOVA and the "true" gDB Eric
>  - The ACPI tells the VM that the GIC has an ITS page at the S2's
>    address (hDB)
>  - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
>    page (gDB). The S2 already has gDB -> hDB.
>  - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
>    S2 are populated at this moment.
>
> If you have multiple ITS pages then the ACPI has to tell the guest GIC
> about them, what their gDB address is, and what devices use which ITS.
>
> Jason
>

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
> >>> This missing peice is cleaning up the ITS mapping to allow for
> >>> multiple ITS pages. I've imagined that kvm would someone give iommufd
> >>> a FD that holds the specific ITS pages instead of the
> >>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
> >> That's what I don't get: at the moment you only pass the gIOVA. With
> >> technique 2, how can you build the nested mapping, ie.
> >>
> >>          S1           S2
> >> gIOVA    ->    gDB    ->    hDB
> >>
> >> without passing the full gIOVA/gDB S1 mapping to the host?
> > The nested S2 mapping is already setup before the VM boots:
> >
> >  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
> right?

I'm not totally sure what you mean by gDB? The above diagram suggests
it is the ITS page address in the S2? Ie the guest physical address of
the ITS.

Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
provide the gDB adress as the phys_addr_t msi_addr.

This happens because the GIC driver will have been informed of the ITS
page at the gDB address, and it will use
iommu_dma_prepare_msi(). Exactly the same as bare metal.

> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
> Is that correct?

Yes, for a single ITS page it will reliably be put at sw_msi_start.
Since the VMM can provide sw_msi_start through the OPTION, the VMM can
place the ITS page where it wants and then program the ACPI to tell
the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
doesn't work for multi ITS, for testing only)

> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
> I was passing both the gIOVA and the "true" gDB Eric

If I understand this right, it still had the hypervisor dynamically
setting up the S2, here it is pre-set and static?

Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Eric Auger 1 year ago

Hi Jason,


On 1/29/25 9:13 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
>>>>> This missing peice is cleaning up the ITS mapping to allow for
>>>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>>>> a FD that holds the specific ITS pages instead of the
>>>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>>>> That's what I don't get: at the moment you only pass the gIOVA. With
>>>> technique 2, how can you build the nested mapping, ie.
>>>>
>>>>          S1           S2
>>>> gIOVA    ->    gDB    ->    hDB
>>>>
>>>> without passing the full gIOVA/gDB S1 mapping to the host?
>>> The nested S2 mapping is already setup before the VM boots:
>>>
>>>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
>> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
>> right?
> I'm not totally sure what you mean by gDB? The above diagram suggests
> it is the ITS page address in the S2? Ie the guest physical address of
> the ITS.
Yes this is what I meant, ie. the guest ITS doorbell GPA
>
> Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
> provide the gDB adress as the phys_addr_t msi_addr.
>
> This happens because the GIC driver will have been informed of the ITS
> page at the gDB address, and it will use
> iommu_dma_prepare_msi(). Exactly the same as bare metal.

understood this is the standard MSI binding scheme.
>
>> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
>> Is that correct?
> Yes, for a single ITS page it will reliably be put at sw_msi_start.
> Since the VMM can provide sw_msi_start through the OPTION, the VMM can
> place the ITS page where it wants and then program the ACPI to tell
> the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
> doesn't work for multi ITS, for testing only)
OK so you need to set host sw_msi_start to the guest doorbell GPA which
is currently set, in qemu, at
GITS_TRANSLATER 0x08080000 + 0x10000

In my original integration, I passed pairs of S1 gIOVA/gDB used by the
guest and this gDB was directly reused for mapping hDB.

I think I get it now.

Eric
>
>> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
>> I was passing both the gIOVA and the "true" gDB Eric
> If I understand this right, it still had the hypervisor dynamically
> setting up the S2, here it is pre-set and static?
>
> Jason
>

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Tue, Feb 04, 2025 at 01:55:01PM +0100, Eric Auger wrote:

> OK so you need to set host sw_msi_start to the guest doorbell GPA which
> is currently set, in qemu, at
> GITS_TRANSLATER 0x08080000 + 0x10000

Yes (but don't do this except for testing)

The challenge that remains is how to build an API to get each ITS page
mapped into the S2 at the right position - ideally statically before
the VM is booted.

Jason