[v1] iommu: Add MSI mapping support with nested SMMU

[PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Nicolin Chen 1 year ago

[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
  IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).

If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.

So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.

There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
   (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
   fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
   region per iommu group for a direct mapping. Eventually, the mappings
   would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
   This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
   driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
   page location (IPA) to the kernel for the stage-2 mapping. Eventually:
   IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
   This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).

Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had the approach (2) first, and then switched to
the approach (1), suggested by Jean-Philippe for reduction of complexity.

The approach (1) basically feels like the existing VFIO passthrough that
has a 1-stage mapping for the unmanaged domain, yet only by shifting the
MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by
sharing the same idea of "VMM leaving everything to the kernel".

The approach (2) is an ideal solution, yet it requires additional effort
for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
page(s), which demands VMM to closely cooperate.
 * It also brings some complicated use cases to the table where the host
   or/and guest system(s) has/have multiple ITS pages.

[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.

To support both approaches, in this series
 - Get rid of the duplication in the "compose" function
 - Introduce a function pointer for the previously "prepare" function
 - Allow different domain owners to set their own "sw_msi" implementations
 - Implement an iommufd_sw_msi function to additionally support a nested
   translation use case using the approach (2), i.e. the RMR solution
 - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
   agree on (for approach 1)
 - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
   to update the msi_desc structure accordingly (for approach 2)

A missing piece
 - Potentially another IOMMUFD_CMD_IOAS_MAP_MSI ioctl for VMM to map the
   IPAs of the vITS page(s) in the stage-2 io page table. (for approach 2)
   (in this RFC, conveniently reuse the new IOMMUFD SW_MSI options to set
    the vITS page's IPA, which works finely in a single-vITS-page case.)

This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.

This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi-rfcv2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi-rfcv2-rmr
Pairing QEMU branch for testing (approach 2):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi-rfcv2-vits

Changelog
v2
 * Rebase on v6.13-rc6
 * Drop all the irq/pci patches and rework the compose function instead
 * Add a new sw_msi op to iommu_domain for a per type implementation and
   let iommufd core has its own implementation to support both approaches
 * Add RMR-solution (approach 1) support since it is straightforward and
   have been used in some out-of-tree projects widely
v1
 https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Jason Gunthorpe (5):
  genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
    iommu_cookie
  genirq/msi: Rename iommu_dma_compose_msi_msg() to
    msi_msg_set_msi_addr()
  iommu: Make iommu_dma_prepare_msi() into a generic operation
  irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that
    need it
  iommufd: Implement sw_msi support natively

Nicolin Chen (8):
  iommu: Turn fault_data to iommufd private pointer
  iommufd: Make attach_handle generic
  iommu: Turn iova_cookie to dma-iommu private pointer
  iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
  iommufd/device: Allow setting IOVAs for MSI(x) vectors
  vfio-iommufd: Provide another layer of msi_iova helpers
  vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE

 drivers/iommu/Kconfig                         |   1 -
 drivers/irqchip/Kconfig                       |   4 +
 kernel/irq/Kconfig                            |   1 +
 drivers/iommu/iommufd/iommufd_private.h       |  69 ++--
 include/linux/iommu.h                         |  58 ++--
 include/linux/iommufd.h                       |   6 +
 include/linux/msi.h                           |  43 ++-
 include/linux/vfio.h                          |  25 ++
 include/uapi/linux/iommufd.h                  |  18 +-
 include/uapi/linux/vfio.h                     |   8 +-
 drivers/iommu/dma-iommu.c                     |  63 ++--
 drivers/iommu/iommu.c                         |  29 ++
 drivers/iommu/iommufd/device.c                | 312 ++++++++++++++++--
 drivers/iommu/iommufd/fault.c                 | 122 +------
 drivers/iommu/iommufd/hw_pagetable.c          |   5 +-
 drivers/iommu/iommufd/io_pagetable.c          |   4 +-
 drivers/iommu/iommufd/ioas.c                  |  34 ++
 drivers/iommu/iommufd/main.c                  |  15 +
 drivers/irqchip/irq-gic-v2m.c                 |   5 +-
 drivers/irqchip/irq-gic-v3-its.c              |  13 +-
 drivers/irqchip/irq-gic-v3-mbi.c              |  12 +-
 drivers/irqchip/irq-ls-scfg-msi.c             |   5 +-
 drivers/vfio/iommufd.c                        |  27 ++
 drivers/vfio/pci/vfio_pci_intrs.c             |  46 +++
 drivers/vfio/vfio_main.c                      |   3 +
 tools/testing/selftests/iommu/iommufd.c       |  53 +++
 .../selftests/iommu/iommufd_fail_nth.c        |  14 +
 27 files changed, 712 insertions(+), 283 deletions(-)

-- 
2.43.0

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 12 months ago

On Fri, Jan 10, 2025 at 07:32:16PM -0800, Nicolin Chen wrote:
> Though these two approaches feel very different on the surface, they can
> share some underlying common infrastructure. Currently, only one pair of
> sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
> drivers to directly use. There could be different versions of functions
> from different domain owners: for existing VFIO passthrough cases and in-
> kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
> functions; for nested translation use cases, there can be another version
> of sw_msi functions to handle mapping and msi_msg(s) differently.
> 
> To support both approaches, in this series
>  - Get rid of the duplication in the "compose" function
>  - Introduce a function pointer for the previously "prepare" function
>  - Allow different domain owners to set their own "sw_msi" implementations
>  - Implement an iommufd_sw_msi function to additionally support a nested
>    translation use case using the approach (2), i.e. the RMR solution
>  - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
>    agree on (for approach 1)
>  - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
>    to update the msi_desc structure accordingly (for approach 2)

Thomas/Marc/Robin, are we comfortable with this general approach?
Nicolin can send something non-RFC for a proper review.

I like it, it solves many of the problems iommufd had here and it
seems logical from the irq side.

Thanks,
Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Thomas Gleixner 12 months ago

On Fri, Feb 07 2025 at 10:34, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 07:32:16PM -0800, Nicolin Chen wrote:
>> Though these two approaches feel very different on the surface, they can
>> share some underlying common infrastructure. Currently, only one pair of
>> sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
>> drivers to directly use. There could be different versions of functions
>> from different domain owners: for existing VFIO passthrough cases and in-
>> kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
>> functions; for nested translation use cases, there can be another version
>> of sw_msi functions to handle mapping and msi_msg(s) differently.
>> 
>> To support both approaches, in this series
>>  - Get rid of the duplication in the "compose" function
>>  - Introduce a function pointer for the previously "prepare" function
>>  - Allow different domain owners to set their own "sw_msi" implementations
>>  - Implement an iommufd_sw_msi function to additionally support a nested
>>    translation use case using the approach (2), i.e. the RMR solution
>>  - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
>>    agree on (for approach 1)
>>  - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
>>    to update the msi_desc structure accordingly (for approach 2)
>
> Thomas/Marc/Robin, are we comfortable with this general approach?
> Nicolin can send something non-RFC for a proper review.
>
> I like it, it solves many of the problems iommufd had here and it
> seems logical from the irq side.

I haven't seen anything horrible. My main concern of having a proper
cached and writeable message is addressed.

Thanks,

        tglx

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jacob Pan 1 year ago

Hi Nicolin,

On Fri, 10 Jan 2025 19:32:16 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:

> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is
> translated by the IOMMU. For GIC, the MSI address page is called
> "ITS" page. When the IOMMU is disabled, the MSI address is programmed
> to the physical location of the GIC ITS page (e.g. 0x20200000). When
> the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI
> address is programmed to an allocated IO virtual address (a.k.a
> IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS
> page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage
> translation is enabled, IOVA will be still used to program the MSI
> address, though the mappings will be in two stages: IOVA (0xFFFF0000)
> ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for
> Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the IOVA is dynamically allocated from the top of the IOVA space. If
> attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough
> device), the IOVA is fixed to an MSI window reported by the IOMMU
> driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE
> (IOVA==0x8000000) for ARM IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in
> charge of the IOMMU translation (1-stage translation), since the IOVA
> for the ITS page is fixed and known by kernel. However, with virtual
> machine enabling a nested IOMMU translation (2-stage), a guest kernel
> directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA,
> mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space
> (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level
> IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few
> RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> would fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> and VMM to agree on the IPA.

Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
will have MSI_IOVA_BASE in a reserved region already, no?
e.g. # cat
/sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
0x0000000008000000 0x00000000080fffff msi

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Nicolin Chen 1 year ago

On Wed, Feb 05, 2025 at 02:49:04PM -0800, Jacob Pan wrote:
> > There have been two approaches to solve this problem:
> > 1. Create an identity mapping in the stage-1. VMM could insert a few
> > RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> > would fetch these RMR entries from the IORT and create an
> > IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> > Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> > (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> > and VMM to agree on the IPA.
> 
> Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
> will have MSI_IOVA_BASE in a reserved region already, no?
> e.g. # cat
> /sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
> 0x0000000008000000 0x00000000080fffff msi

No. In Patch-9, the driver-defined MSI_IOVA_BASE will be ignored if
userspace has assigned IOMMU_OPTION_SW_MSI_START/SIZE, even if they
might have the same values as the MSI_IOVA_BASE window.

The idea of MSI_IOVA_BASE in this series is a kernel default that'd
be only effective when user space doesn't care to set anything.

Thanks
Nicolin

RE: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Shameerali Kolothum Thodi 1 year ago

Hi Nicolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 11, 2025 3:32 AM
> To: will@kernel.org; robin.murphy@arm.com; jgg@nvidia.com;
> kevin.tian@intel.com; tglx@linutronix.de; maz@kernel.org;
> alex.williamson@redhat.com
> Cc: joro@8bytes.org; shuah@kernel.org; reinette.chatre@intel.com;
> eric.auger@redhat.com; yebin (H) <yebin10@huawei.com>;
> apatel@ventanamicro.com; shivamurthy.shastri@linutronix.de;
> bhelgaas@google.com; anna-maria@linutronix.de; yury.norov@gmail.com;
> nipun.gupta@amd.com; iommu@lists.linux.dev; linux-
> kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> kvm@vger.kernel.org; linux-kselftest@vger.kernel.org;
> patches@lists.linux.dev; jean-philippe@linaro.org; mdf@kernel.org;
> mshavit@google.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; smostafa@google.com;
> ddutile@redhat.com
> Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with
> nested SMMU
> 
> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is translated
> by the IOMMU. For GIC, the MSI address page is called "ITS" page. When
> the
> IOMMU is disabled, the MSI address is programmed to the physical location
> of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
> page is behind the IOMMU, so the MSI address is programmed to an
> allocated
> IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
> the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
> When a 2-stage translation is enabled, IOVA will be still used to program
> the MSI address, though the mappings will be in two stages:
>   IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
> (IPA stands for Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the
> IOVA is dynamically allocated from the top of the IOVA space. If attached
> to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the
> IOVA is
> fixed to an MSI window reported by the IOMMU driver via
> IOMMU_RESV_SW_MSI,
> which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM
> IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
> of the IOMMU translation (1-stage translation), since the IOVA for the ITS
> page is fixed and known by kernel. However, with virtual machine enabling
> a nested IOMMU translation (2-stage), a guest kernel directly controls the
> stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at
> an
> IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
> kernel can't know that guest-level IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
>    (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
>    fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT
>    region per iommu group for a direct mapping. Eventually, the mappings
>    would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
>    This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
> 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
>    driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
>    page location (IPA) to the kernel for the stage-2 mapping. Eventually:
>    IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
>    This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
> 
> Worth mentioning that when Eric Auger was working on the same topic
> with
> the VFIO iommu uAPI, he had the approach (2) first, and then switched to
> the approach (1), suggested by Jean-Philippe for reduction of complexity.
> 
> The approach (1) basically feels like the existing VFIO passthrough that
> has a 1-stage mapping for the unmanaged domain, yet only by shifting the
> MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
> iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece,
> by
> sharing the same idea of "VMM leaving everything to the kernel".
> 
> The approach (2) is an ideal solution, yet it requires additional effort
> for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
> page(s), which demands VMM to closely cooperate.
>  * It also brings some complicated use cases to the table where the host
>    or/and guest system(s) has/have multiple ITS pages.

I had done some basic sanity tests with this series and the Qemu branches you
provided on a HiSilicon hardwrae. The basic dev assignment works fine. I will 
rebase my Qemu smuv3-accel branch on top of this and will do some more tests.

One confusion I have about the above text is, do we still plan to support the
approach -1( Using RMR in Qemu) or you are just mentioning it here because
it is still possible to make use of that. I think from previous discussions the 
argument was to adopt a more dedicated MSI pass-through model which I
think is  approach-2 here.  Could you please confirm.

Thanks,
Shameer

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Thu, Jan 23, 2025 at 09:06:49AM +0000, Shameerali Kolothum Thodi wrote:

> One confusion I have about the above text is, do we still plan to support the
> approach -1( Using RMR in Qemu)

Yes, it remains an option. The VMM would use the
IOMMU_OPTION_SW_MSI_START/SIZE ioctls to tell the kernel where it
wants to put the RMR region then it would send the RMR into the VM
through ACPI.

The kernel side promises that the RMR region will have a consistent
(but unpredictable!) layout of ITS pages (however many are required)
within that RMR space, regardless of what devices/domain are attached.

I would like to start with patches up to #10 for this part as it
solves two of the three problems here.

> or you are just mentioning it here because
> it is still possible to make use of that. I think from previous discussions the
> argument was to adopt a more dedicated MSI pass-through model which I
> think is  approach-2 here.  

The basic flow of the pass through model is shown in the last two
patches, it is not fully complete but is testable. It assumes a single
ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
ITS page at the correct S2 location and then describe it in the ACPI
as an ITS page not a RMR.

The VMM will capture the MSI writes and use
VFIO_IRQ_SET_ACTION_PREPARE to convey the guests's S1 translation to
the IRQ subsystem.

This missing peice is cleaning up the ITS mapping to allow for
multiple ITS pages. I've imagined that kvm would someone give iommufd
a FD that holds the specific ITS pages instead of the
IOMMU_OPTION_SW_MSI_START/SIZE flow.

Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Eric Auger 1 year ago

Hi Jason,

On 1/23/25 2:24 PM, Jason Gunthorpe wrote:
> On Thu, Jan 23, 2025 at 09:06:49AM +0000, Shameerali Kolothum Thodi wrote:
>
>> One confusion I have about the above text is, do we still plan to support the
>> approach -1( Using RMR in Qemu)
> Yes, it remains an option. The VMM would use the
> IOMMU_OPTION_SW_MSI_START/SIZE ioctls to tell the kernel where it
> wants to put the RMR region then it would send the RMR into the VM
> through ACPI.
>
> The kernel side promises that the RMR region will have a consistent
> (but unpredictable!) layout of ITS pages (however many are required)
> within that RMR space, regardless of what devices/domain are attached.
>
> I would like to start with patches up to #10 for this part as it
> solves two of the three problems here.
>
>> or you are just mentioning it here because
>> it is still possible to make use of that. I think from previous discussions the
>> argument was to adopt a more dedicated MSI pass-through model which I
>> think is  approach-2 here.  
> The basic flow of the pass through model is shown in the last two
> patches, it is not fully complete but is testable. It assumes a single
> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
> ITS page at the correct S2 location and then describe it in the ACPI
> as an ITS page not a RMR.
This is a nice to have feature but not mandated in the first place, is it?
>
> The VMM will capture the MSI writes and use
> VFIO_IRQ_SET_ACTION_PREPARE to convey the guests's S1 translation to
> the IRQ subsystem.
>
> This missing peice is cleaning up the ITS mapping to allow for
> multiple ITS pages. I've imagined that kvm would someone give iommufd
> a FD that holds the specific ITS pages instead of the
> IOMMU_OPTION_SW_MSI_START/SIZE flow.
That's what I don't get: at the moment you only pass the gIOVA. With
technique 2, how can you build the nested mapping, ie.

         S1           S2
gIOVA    ->    gDB    ->    hDB

without passing the full gIOVA/gDB S1 mapping to the host?

Eric


>
> Jason
>

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
> >> or you are just mentioning it here because
> >> it is still possible to make use of that. I think from previous discussions the
> >> argument was to adopt a more dedicated MSI pass-through model which I
> >> think is  approach-2 here.  
> > The basic flow of the pass through model is shown in the last two
> > patches, it is not fully complete but is testable. It assumes a single
> > ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
> > ITS page at the correct S2 location and then describe it in the ACPI
> > as an ITS page not a RMR.

> This is a nice to have feature but not mandated in the first place,
> is it?

Not mandated. It just sort of happens because of the design. IMHO
nothing should use it because there is no way for userspace to
discover how many ITS pages there may be.

> > This missing peice is cleaning up the ITS mapping to allow for
> > multiple ITS pages. I've imagined that kvm would someone give iommufd
> > a FD that holds the specific ITS pages instead of the
> > IOMMU_OPTION_SW_MSI_START/SIZE flow.

> That's what I don't get: at the moment you only pass the gIOVA. With
> technique 2, how can you build the nested mapping, ie.
> 
>          S1           S2
> gIOVA    ->    gDB    ->    hDB
> 
> without passing the full gIOVA/gDB S1 mapping to the host?

The nested S2 mapping is already setup before the VM boots:

 - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
 - The ACPI tells the VM that the GIC has an ITS page at the S2's
   address (hDB)
 - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
   page (gDB). The S2 already has gDB -> hDB.
 - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
   S2 are populated at this moment.

If you have multiple ITS pages then the ACPI has to tell the guest GIC
about them, what their gDB address is, and what devices use which ITS.

Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Eric Auger 1 year ago



On 1/29/25 4:04 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
>>>> or you are just mentioning it here because
>>>> it is still possible to make use of that. I think from previous discussions the
>>>> argument was to adopt a more dedicated MSI pass-through model which I
>>>> think is  approach-2 here.  
>>> The basic flow of the pass through model is shown in the last two
>>> patches, it is not fully complete but is testable. It assumes a single
>>> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
>>> ITS page at the correct S2 location and then describe it in the ACPI
>>> as an ITS page not a RMR.
>> This is a nice to have feature but not mandated in the first place,
>> is it?
> Not mandated. It just sort of happens because of the design. IMHO
> nothing should use it because there is no way for userspace to
> discover how many ITS pages there may be.
>
>>> This missing peice is cleaning up the ITS mapping to allow for
>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>> a FD that holds the specific ITS pages instead of the
>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>> That's what I don't get: at the moment you only pass the gIOVA. With
>> technique 2, how can you build the nested mapping, ie.
>>
>>          S1           S2
>> gIOVA    ->    gDB    ->    hDB
>>
>> without passing the full gIOVA/gDB S1 mapping to the host?
> The nested S2 mapping is already setup before the VM boots:
>
>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
Ah OK. Your gDB has nothing to do with the actual S1 guest gDB, right?
It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
Is that correct? In
https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
I was passing both the gIOVA and the "true" gDB Eric
>  - The ACPI tells the VM that the GIC has an ITS page at the S2's
>    address (hDB)
>  - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
>    page (gDB). The S2 already has gDB -> hDB.
>  - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
>    S2 are populated at this moment.
>
> If you have multiple ITS pages then the ACPI has to tell the guest GIC
> about them, what their gDB address is, and what devices use which ITS.
>
> Jason
>

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
> >>> This missing peice is cleaning up the ITS mapping to allow for
> >>> multiple ITS pages. I've imagined that kvm would someone give iommufd
> >>> a FD that holds the specific ITS pages instead of the
> >>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
> >> That's what I don't get: at the moment you only pass the gIOVA. With
> >> technique 2, how can you build the nested mapping, ie.
> >>
> >>          S1           S2
> >> gIOVA    ->    gDB    ->    hDB
> >>
> >> without passing the full gIOVA/gDB S1 mapping to the host?
> > The nested S2 mapping is already setup before the VM boots:
> >
> >  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
> right?

I'm not totally sure what you mean by gDB? The above diagram suggests
it is the ITS page address in the S2? Ie the guest physical address of
the ITS.

Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
provide the gDB adress as the phys_addr_t msi_addr.

This happens because the GIC driver will have been informed of the ITS
page at the gDB address, and it will use
iommu_dma_prepare_msi(). Exactly the same as bare metal.

> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
> Is that correct?

Yes, for a single ITS page it will reliably be put at sw_msi_start.
Since the VMM can provide sw_msi_start through the OPTION, the VMM can
place the ITS page where it wants and then program the ACPI to tell
the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
doesn't work for multi ITS, for testing only)

> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
> I was passing both the gIOVA and the "true" gDB Eric

If I understand this right, it still had the hypervisor dynamically
setting up the S2, here it is pre-set and static?

Jason

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Eric Auger 1 year ago

Hi Jason,


On 1/29/25 9:13 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
>>>>> This missing peice is cleaning up the ITS mapping to allow for
>>>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>>>> a FD that holds the specific ITS pages instead of the
>>>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>>>> That's what I don't get: at the moment you only pass the gIOVA. With
>>>> technique 2, how can you build the nested mapping, ie.
>>>>
>>>>          S1           S2
>>>> gIOVA    ->    gDB    ->    hDB
>>>>
>>>> without passing the full gIOVA/gDB S1 mapping to the host?
>>> The nested S2 mapping is already setup before the VM boots:
>>>
>>>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
>> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
>> right?
> I'm not totally sure what you mean by gDB? The above diagram suggests
> it is the ITS page address in the S2? Ie the guest physical address of
> the ITS.
Yes this is what I meant, ie. the guest ITS doorbell GPA
>
> Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
> provide the gDB adress as the phys_addr_t msi_addr.
>
> This happens because the GIC driver will have been informed of the ITS
> page at the gDB address, and it will use
> iommu_dma_prepare_msi(). Exactly the same as bare metal.

understood this is the standard MSI binding scheme.
>
>> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
>> Is that correct?
> Yes, for a single ITS page it will reliably be put at sw_msi_start.
> Since the VMM can provide sw_msi_start through the OPTION, the VMM can
> place the ITS page where it wants and then program the ACPI to tell
> the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
> doesn't work for multi ITS, for testing only)
OK so you need to set host sw_msi_start to the guest doorbell GPA which
is currently set, in qemu, at
GITS_TRANSLATER 0x08080000 + 0x10000

In my original integration, I passed pairs of S1 gIOVA/gDB used by the
guest and this gDB was directly reused for mapping hDB.

I think I get it now.

Eric
>
>> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
>> I was passing both the gIOVA and the "true" gDB Eric
> If I understand this right, it still had the hypervisor dynamically
> setting up the S2, here it is pre-set and static?
>
> Jason
>

Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

Posted by Jason Gunthorpe 1 year ago

On Tue, Feb 04, 2025 at 01:55:01PM +0100, Eric Auger wrote:

> OK so you need to set host sw_msi_start to the guest doorbell GPA which
> is currently set, in qemu, at
> GITS_TRANSLATER 0x08080000 + 0x10000

Yes (but don't do this except for testing)

The challenge that remains is how to build an API to get each ITS page
mapped into the S2 at the right position - ideally statically before
the VM is booted.

Jason