drivers/iommu/iommufd/Makefile | 5 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++--- drivers/iommu/iommufd/iommufd_private.h | 36 ++------ drivers/iommu/iommufd/iommufd_test.h | 2 + include/linux/iommu.h | 14 +++ include/linux/iommufd.h | 89 +++++++++++++++++++ include/uapi/linux/iommufd.h | 56 ++++++++++-- tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++ .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +- drivers/iommu/iommufd/driver.c | 38 ++++++++ drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++- drivers/iommu/iommufd/main.c | 58 ++++++------ drivers/iommu/iommufd/selftest.c | 73 +++++++++++++-- drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++ tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 11 +++ Documentation/userspace-api/iommufd.rst | 69 +++++++++++++- 18 files changed, 701 insertions(+), 124 deletions(-) create mode 100644 drivers/iommu/iommufd/driver.c create mode 100644 drivers/iommu/iommufd/viommu.c
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| |
| [5] |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | | | |
| | | | |
| | [1] | | [4] [2] |
| | ______ | | _____________ ________ |
| | | | | [3] | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that are passed to (via devices) a guest VM, while
being able to hold the shareable parent HWPT. Each vIOMMU then just needs
to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4
(paring QEMU branch for testing will be provided with the part2 series)
Changelog
v4
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvidia.com/
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (11):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Add iommufd_verify_unfinalized_object
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add domain_alloc_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++---
drivers/iommu/iommufd/iommufd_private.h | 36 ++------
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +++
include/linux/iommufd.h | 89 +++++++++++++++++++
include/uapi/linux/iommufd.h | 56 ++++++++++--
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +-
drivers/iommu/iommufd/driver.c | 38 ++++++++
drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++-
drivers/iommu/iommufd/main.c | 58 ++++++------
drivers/iommu/iommufd/selftest.c | 73 +++++++++++++--
drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++
tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +++
Documentation/userspace-api/iommufd.rst | 69 +++++++++++++-
18 files changed, 701 insertions(+), 124 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
--
2.43.0
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Tuesday, October 22, 2024 8:19 AM > > This series introduces a new vIOMMU infrastructure and related ioctls. > > IOMMUFD has been using the HWPT infrastructure for all cases, including a > nested IO page table support. Yet, there're limitations for an HWPT-based > structure to support some advanced HW-accelerated features, such as > CMDQV > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi- > IOMMU > environment, it is not straightforward for nested HWPTs to share the same > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a > parent HWPT typically hold one stage-2 IO pagetable and tag it with only > one ID in the cache entries. When sharing one large stage-2 IO pagetable > across physical IOMMU instances, that one ID may not always be available > across all the IOMMU instances. In other word, it's ideal for SW to have > a different container for the stage-2 IO pagetable so it can hold another > ID that's available. Just holding multiple IDs doesn't require a different container. This is just a side effect when vIOMMU will be required for other said reasons. If we have to put more words here I'd prefer to adding a bit more for CMDQV which is more compelling. not a big deal though. 😊 > > For this "different container", add vIOMMU, an additional layer to hold > extra virtualization information: > > ________________________________________________________________ > _______ > | iommufd (with vIOMMU) | > | | > | [5] | > | _____________ | > | | | | > | |----------------| vIOMMU | | > | | | | | > | | | | | > | | [1] | | [4] [2] | > | | ______ | | _____________ ________ | > | | | | | [3] | | | | | | > | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | > | | |______| |_____________| |_____________| |________| | > | | | | | | | > > |______|________|______________|__________________|_____________ > __|_____| > | | | | | > ______v_____ | ______v_____ ______v_____ ___v__ > | struct | | PFN | (paging) | | (nested) | |struct| > |iommu_device| |------>|iommu_domain|<----|iommu_domain|<---- > |device| > |____________| storage|____________| |____________| |______| > nit - [1] ... [5] can be removed. > The vIOMMU object should be seen as a slice of a physical IOMMU instance > that is passed to or shared with a VM. That can be some HW/SW resources: > - Security namespace for guest owned ID, e.g. guest-controlled cache tags > - Access to a sharable nesting parent pagetable across physical IOMMUs > - Virtualization of various platforms IDs, e.g. RIDs and others > - Delivery of paravirtualized invalidation > - Direct assigned invalidation queues > - Direct assigned interrupts > - Non-affiliated event reporting sorry no idea about 'non-affiliated event'. Can you elaborate? > > On a multi-IOMMU system, the vIOMMU object must be instanced to the > number > of the physical IOMMUs that are passed to (via devices) a guest VM, while 'to the number of the physical IOMMUs that have a slice passed to ..." > being able to hold the shareable parent HWPT. Each vIOMMU then just > needs > to allocate its own individual ID to tag its own cache: > ---------------------------- > ---------------- | | paging_hwpt0 | > | hwpt_nested0 |--->| viommu0 ------------------ > ---------------- | | IDx | > ---------------------------- > ---------------------------- > ---------------- | | paging_hwpt0 | > | hwpt_nested1 |--->| viommu1 ------------------ > ---------------- | | IDy | > ---------------------------- > > As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an > allocation > only. And implement it in arm-smmu-v3 driver as a real world use case. > > More vIOMMU-based structs and ioctls will be introduced in the follow-up > series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we > repurposed the vIOMMU object from an earlier RFC, just for a referece: > https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ > > This series is on Github: > https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4 > (paring QEMU branch for testing will be provided with the part2 series) > > Changelog > v4 > * Added "Reviewed-by" from Jason > * Dropped IOMMU_VIOMMU_TYPE_DEFAULT support > * Dropped iommufd_object_alloc_elm renamings > * Renamed iommufd's viommu_api.c to driver.c > * Reworked iommufd_viommu_alloc helper > * Added a separate iommufd_hwpt_nested_alloc_for_viommu function for > hwpt_nested allocations on a vIOMMU, and added comparison between > viommu->iommu_dev->ops and dev_iommu_ops(idev->dev) > * Replaced s2_parent with vsmmu in arm_smmu_nested_domain > * Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in > viommu_ops > * Replaced wait_queue_head_t with a completion, to delay the unplug of > mock_iommu_dev > * Corrected documentation graph that was missing struct iommu_device > * Added an iommufd_verify_unfinalized_object helper to verify driver- > allocated vIOMMU/vDEVICE objects > * Added missing test cases for TEST_LENGTH and fail_nth > v3 > https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/ > * Rebased on top of Jason's nesting v3 series > https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1- > smmuv3_nesting_jgg@nvidia.com/ > * Split the series into smaller parts > * Added Jason's Reviewed-by > * Added back viommu->iommu_dev > * Added support for driver-allocated vIOMMU v.s. core-allocated > * Dropped arm_smmu_cache_invalidate_user > * Added an iommufd_test_wait_for_users() in selftest > * Reworked test code to make viommu an individual FIXTURE > * Added missing TEST_LENGTH case for the new ioctl command > v2 > https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/ > * Limited vdev_id to one per idev > * Added a rw_sem to protect the vdev_id list > * Reworked driver-level APIs with proper lockings > * Added a new viommu_api file for IOMMUFD_DRIVER config > * Dropped useless iommu_dev point from the viommu structure > * Added missing index numnbers to new types in the uAPI header > * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT > one > * Reworked mock_viommu_cache_invalidate() using the new iommu helper > * Reordered details of set/unset_vdev_id handlers for proper lockings > v1 > https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/ > > Thanks! > Nicolin > > Nicolin Chen (11): > iommufd: Move struct iommufd_object to public iommufd header > iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct > iommufd: Add iommufd_verify_unfinalized_object > iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl > iommufd: Add domain_alloc_nested op to iommufd_viommu_ops > iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC > iommufd/selftest: Add refcount to mock_iommu_device > iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST > iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage > Documentation: userspace-api: iommufd: Update vIOMMU > iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 > support > > drivers/iommu/iommufd/Makefile | 5 +- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++--- > drivers/iommu/iommufd/iommufd_private.h | 36 ++------ > drivers/iommu/iommufd/iommufd_test.h | 2 + > include/linux/iommu.h | 14 +++ > include/linux/iommufd.h | 89 +++++++++++++++++++ > include/uapi/linux/iommufd.h | 56 ++++++++++-- > tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++ > .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------ > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +- > drivers/iommu/iommufd/driver.c | 38 ++++++++ > drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++- > drivers/iommu/iommufd/main.c | 58 ++++++------ > drivers/iommu/iommufd/selftest.c | 73 +++++++++++++-- > drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++ > tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++ > .../selftests/iommu/iommufd_fail_nth.c | 11 +++ > Documentation/userspace-api/iommufd.rst | 69 +++++++++++++- > 18 files changed, 701 insertions(+), 124 deletions(-) > create mode 100644 drivers/iommu/iommufd/driver.c > create mode 100644 drivers/iommu/iommufd/viommu.c > > -- > 2.43.0 >
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Tuesday, October 22, 2024 8:19 AM > > > > This series introduces a new vIOMMU infrastructure and related ioctls. > > > > IOMMUFD has been using the HWPT infrastructure for all cases, including a > > nested IO page table support. Yet, there're limitations for an HWPT-based > > structure to support some advanced HW-accelerated features, such as > > CMDQV > > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi- > > IOMMU > > environment, it is not straightforward for nested HWPTs to share the same > > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a > > parent HWPT typically hold one stage-2 IO pagetable and tag it with only > > one ID in the cache entries. When sharing one large stage-2 IO pagetable > > across physical IOMMU instances, that one ID may not always be available > > across all the IOMMU instances. In other word, it's ideal for SW to have > > a different container for the stage-2 IO pagetable so it can hold another > > ID that's available. > > Just holding multiple IDs doesn't require a different container. This is > just a side effect when vIOMMU will be required for other said reasons. > > If we have to put more words here I'd prefer to adding a bit more for > CMDQV which is more compelling. not a big deal though. 😊 Ack. > > For this "different container", add vIOMMU, an additional layer to hold > > extra virtualization information: > > > > ________________________________________________________________ > > _______ > > | iommufd (with vIOMMU) | > > | | > > | [5] | > > | _____________ | > > | | | | > > | |----------------| vIOMMU | | > > | | | | | > > | | | | | > > | | [1] | | [4] [2] | > > | | ______ | | _____________ ________ | > > | | | | | [3] | | | | | | > > | | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | | > > | | |______| |_____________| |_____________| |________| | > > | | | | | | | > > > > |______|________|______________|__________________|_____________ > > __|_____| > > | | | | | > > ______v_____ | ______v_____ ______v_____ ___v__ > > | struct | | PFN | (paging) | | (nested) | |struct| > > |iommu_device| |------>|iommu_domain|<----|iommu_domain|<---- > > |device| > > |____________| storage|____________| |____________| |______| > > > > nit - [1] ... [5] can be removed. They are copied from the Documentation where numbers are needed. I will take all the numbers out in the cover-letters. > > The vIOMMU object should be seen as a slice of a physical IOMMU instance > > that is passed to or shared with a VM. That can be some HW/SW resources: > > - Security namespace for guest owned ID, e.g. guest-controlled cache tags > > - Access to a sharable nesting parent pagetable across physical IOMMUs > > - Virtualization of various platforms IDs, e.g. RIDs and others > > - Delivery of paravirtualized invalidation > > - Direct assigned invalidation queues > > - Direct assigned interrupts > > - Non-affiliated event reporting > > sorry no idea about 'non-affiliated event'. Can you elaborate? I'll put an "e.g.". > > On a multi-IOMMU system, the vIOMMU object must be instanced to the > > number > > of the physical IOMMUs that are passed to (via devices) a guest VM, while > > 'to the number of the physical IOMMUs that have a slice passed to ..." Ack. Thanks Nicolin
On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote: > > The vIOMMU object should be seen as a slice of a physical IOMMU instance > > that is passed to or shared with a VM. That can be some HW/SW resources: > > - Security namespace for guest owned ID, e.g. guest-controlled cache tags > > - Access to a sharable nesting parent pagetable across physical IOMMUs > > - Virtualization of various platforms IDs, e.g. RIDs and others > > - Delivery of paravirtualized invalidation > > - Direct assigned invalidation queues > > - Direct assigned interrupts > > - Non-affiliated event reporting > > sorry no idea about 'non-affiliated event'. Can you elaborate? This would be an even that is not a connected to a device For instance a CMDQ experienced a problem. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, October 25, 2024 11:43 PM > > On Fri, Oct 25, 2024 at 08:34:05AM +0000, Tian, Kevin wrote: > > > The vIOMMU object should be seen as a slice of a physical IOMMU > instance > > > that is passed to or shared with a VM. That can be some HW/SW > resources: > > > - Security namespace for guest owned ID, e.g. guest-controlled cache > tags > > > - Access to a sharable nesting parent pagetable across physical IOMMUs > > > - Virtualization of various platforms IDs, e.g. RIDs and others > > > - Delivery of paravirtualized invalidation > > > - Direct assigned invalidation queues > > > - Direct assigned interrupts > > > - Non-affiliated event reporting > > > > sorry no idea about 'non-affiliated event'. Can you elaborate? > > This would be an even that is not a connected to a device > > For instance a CMDQ experienced a problem. > Okay, then 'non-device-affiliated' is probably clearer.
© 2016 - 2026 Red Hat, Inc.