[PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain

Samiullah Khawaja posted 3 patches 1 month ago
drivers/iommu/intel/iommu.c                   | 107 +++++++++----
tools/testing/selftests/vfio/Makefile         |   1 +
.../vfio/lib/include/libvfio/iommu.h          |   2 +
.../lib/include/libvfio/vfio_pci_device.h     |   2 +
tools/testing/selftests/vfio/lib/iommu.c      |  60 ++++++-
.../selftests/vfio/lib/vfio_pci_device.c      |  16 +-
.../vfio/vfio_iommufd_hwpt_replace_test.c     | 151 ++++++++++++++++++
7 files changed, 303 insertions(+), 36 deletions(-)
create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_hwpt_replace_test.c
[PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain
Posted by Samiullah Khawaja 1 month ago
Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in
scalable mode. The support is not available in legacy mode and for
NO_PASID in scalable mode. This patch series adds the support for legacy
mode and NO_PASID scalable mode.

This is needed for the Live update IOMMU persistence to hotswap the
iommu domain after liveupdate:
https://lore.kernel.org/all/20251202230303.1017519-1-skhawaja@google.com/

The patch adds the support for scalable mode NO_PASID mode by using the
existing replace semantics. This works since in scalable mode the
context entries are not updated and only the pasid entries are updated.

The patch series also contains a vfio selftests for the iommu domain
replace using iommufd hwpt replace functionality.

Tested on a Host with scalable mode and on Qemu with legacy mode:

tools/testing/selftests/vfio/scripts/setup.sh <dsa_device_bdf>
tools/testing/selftests/vfio/vfio_iommufd_hwpt_replace_test <dsa_device_bdf>

TAP version 13
1..2
# Starting 2 tests from 2 test cases.
#  RUN           vfio_iommufd_replace_hwpt_test.domain_replace.memcpy ...
#            OK  vfio_iommufd_replace_hwpt_test.domain_replace.memcpy
ok 1 vfio_iommufd_replace_hwpt_test.domain_replace.memcpy
#  RUN           vfio_iommufd_replace_hwpt_test.noreplace.memcpy ...
#            OK  vfio_iommufd_replace_hwpt_test.noreplace.memcpy
ok 2 vfio_iommufd_replace_hwpt_test.noreplace.memcpy
# PASSED: 2 / 2 tests passed.
# Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Samiullah Khawaja (3):
  iommu/vt-d: Allow replacing no_pasid iommu_domain
  vfio: selftests: Add support of creating iommus from iommufd
  vfio: selftests: Add iommufd hwpt replace test

 drivers/iommu/intel/iommu.c                   | 107 +++++++++----
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../vfio/lib/include/libvfio/iommu.h          |   2 +
 .../lib/include/libvfio/vfio_pci_device.h     |   2 +
 tools/testing/selftests/vfio/lib/iommu.c      |  60 ++++++-
 .../selftests/vfio/lib/vfio_pci_device.c      |  16 +-
 .../vfio/vfio_iommufd_hwpt_replace_test.c     | 151 ++++++++++++++++++
 7 files changed, 303 insertions(+), 36 deletions(-)
 create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_hwpt_replace_test.c


base-commit: 6cd6c12031130a349a098dbeb19d8c3070d2dfbe
-- 
2.52.0.351.gbe84eed79e-goog
Re: [PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain
Posted by Jason Gunthorpe 1 month ago
On Wed, Jan 07, 2026 at 08:17:57PM +0000, Samiullah Khawaja wrote:
> Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in
> scalable mode.

It does? We were just talking about how it doesn't work because it
makes the PASID entry non-present while loading the new domain.

Jason
Re: [PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain
Posted by Jason Gunthorpe 1 month ago
On Wed, Jan 07, 2026 at 04:28:12PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 07, 2026 at 08:17:57PM +0000, Samiullah Khawaja wrote:
> > Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in
> > scalable mode.
> 
> It does? We were just talking about how it doesn't work because it
> makes the PASID entry non-present while loading the new domain.

If you tried your tests in scalable mode they are probably only
working because the HW is holding the entry in cache while the CPU is
completely mangling it:

int intel_pasid_replace_first_level(struct intel_iommu *iommu,
				    struct device *dev, phys_addr_t fsptptr,
				    u32 pasid, u16 did, u16 old_did,
				    int flags)
{
[..]
	*pte = new_pte;

That just doesn't work for "replace", it isn't hitless unless the
entry stays in the cache. Since your test effectively will hold the
context entry in the cache while testing for "hitless" it doesn't
really test if it is really working without races..

All of this needs to be reworked to always use the stack to build the
entry, like the replace path does, and have a ARM-like algorithm to
update the live memory in just the right order to guarentee the HW
does not see a corrupted entry.

It is a little bit tricky, but it should start with reworking
everything to consistently use the stack to create the new entry and
calling a centralized function to set the new entry to the live
memory. This replace/not replace split should be purged completely.

Some discussion is here

https://lore.kernel.org/all/20260106142301.GS125261@ziepe.ca/

It also needs to be very careful that the invalidation is doing both
the old and new context entry concurrently while it is being replaced.

For instance the placement of cache_tag_assign_domain() looks wrong to
me, it can't be *after* the HW has been programmed to use the new tags
:\

I also didn't note where the currently active cache_tag is removed
from the linked list during attach, is that another bug?

In short, this needs alot of work to actually properly implement
hitless replace the way ARM can. Fortunately I think it is mostly
mechanical and should be fairly straightfoward. Refer to the ARM
driver and try to structure vtd to have the same essential flow..

Jason
Re: [PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain
Posted by Samiullah Khawaja 4 weeks ago
On Wed, Jan 7, 2026 at 12:46 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jan 07, 2026 at 04:28:12PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 07, 2026 at 08:17:57PM +0000, Samiullah Khawaja wrote:
> > > Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in
> > > scalable mode.
> >
> > It does? We were just talking about how it doesn't work because it
> > makes the PASID entry non-present while loading the new domain.
>
> If you tried your tests in scalable mode they are probably only
> working because the HW is holding the entry in cache while the CPU is
> completely mangling it:
>
> int intel_pasid_replace_first_level(struct intel_iommu *iommu,
>                                     struct device *dev, phys_addr_t fsptptr,
>                                     u32 pasid, u16 did, u16 old_did,
>                                     int flags)
> {
> [..]
>         *pte = new_pte;
>
> That just doesn't work for "replace", it isn't hitless unless the
> entry stays in the cache. Since your test effectively will hold the
> context entry in the cache while testing for "hitless" it doesn't
> really test if it is really working without races..

Ah.. you are absolutely correct. This will not work if the entries are
not cached.
>
> All of this needs to be reworked to always use the stack to build the
> entry, like the replace path does, and have a ARM-like algorithm to
> update the live memory in just the right order to guarentee the HW
> does not see a corrupted entry.

Agreed. I will go through the VTD specs and also the ARM driver to
determine the right order to set this up.
>
> It is a little bit tricky, but it should start with reworking
> everything to consistently use the stack to create the new entry and
> calling a centralized function to set the new entry to the live
> memory. This replace/not replace split should be purged completely.
>
> Some discussion is here
>
> https://lore.kernel.org/all/20260106142301.GS125261@ziepe.ca/
>
> It also needs to be very careful that the invalidation is doing both
> the old and new context entry concurrently while it is being replaced.

Yes, let me follow the discussion over there closely.
>
> For instance the placement of cache_tag_assign_domain() looks wrong to
> me, it can't be *after* the HW has been programmed to use the new tags
> :\
>
> I also didn't note where the currently active cache_tag is removed
> from the linked list during attach, is that another bug?

You are correct, the placement of cache_tag_assign_domain is wrong.
The removal of the currently active cache tag is done in
dmar_domain_attach_device, but I will re-evaluate the placement of
both in the v2.
>
> In short, this needs alot of work to actually properly implement
> hitless replace the way ARM can. Fortunately I think it is mostly
> mechanical and should be fairly straightfoward. Refer to the ARM
> driver and try to structure vtd to have the same essential flow..

Thank you for the feedback. I will prepare a v2 series addressing these points.
>
> Jason
Re: [PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain
Posted by Jason Gunthorpe 3 weeks, 5 days ago
> Thank you for the feedback. I will prepare a v2 series addressing these points.

I think there are so many problems here you should talk to Kevin and
Baolu to come up with some plan. A single series is not going to be
able to do all of this.

Jason
Re: [PATCH 0/3] iommu/vt-d: Add support to hitless replace IOMMU domain
Posted by Baolu Lu 3 weeks, 5 days ago
On 1/12/26 06:14, Jason Gunthorpe wrote:
>> Thank you for the feedback. I will prepare a v2 series addressing these points.
> I think there are so many problems here you should talk to Kevin and
> Baolu to come up with some plan. A single series is not going to be
> able to do all of this.

Yes, absolutely. I am working on a patch series to address the
fundamental atomicity issue. I hope to post it for discussion soon. Once
that is addressed, we can discuss any additional problems.

Thanks,
baolu