drivers/iommu/intel/iommu.c | 107 +++++++++---- tools/testing/selftests/vfio/Makefile | 1 + .../vfio/lib/include/libvfio/iommu.h | 2 + .../lib/include/libvfio/vfio_pci_device.h | 2 + tools/testing/selftests/vfio/lib/iommu.c | 60 ++++++- .../selftests/vfio/lib/vfio_pci_device.c | 16 +- .../vfio/vfio_iommufd_hwpt_replace_test.c | 151 ++++++++++++++++++ 7 files changed, 303 insertions(+), 36 deletions(-) create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_hwpt_replace_test.c
Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in scalable mode. The support is not available in legacy mode and for NO_PASID in scalable mode. This patch series adds the support for legacy mode and NO_PASID scalable mode. This is needed for the Live update IOMMU persistence to hotswap the iommu domain after liveupdate: https://lore.kernel.org/all/20251202230303.1017519-1-skhawaja@google.com/ The patch adds the support for scalable mode NO_PASID mode by using the existing replace semantics. This works since in scalable mode the context entries are not updated and only the pasid entries are updated. The patch series also contains a vfio selftests for the iommu domain replace using iommufd hwpt replace functionality. Tested on a Host with scalable mode and on Qemu with legacy mode: tools/testing/selftests/vfio/scripts/setup.sh <dsa_device_bdf> tools/testing/selftests/vfio/vfio_iommufd_hwpt_replace_test <dsa_device_bdf> TAP version 13 1..2 # Starting 2 tests from 2 test cases. # RUN vfio_iommufd_replace_hwpt_test.domain_replace.memcpy ... # OK vfio_iommufd_replace_hwpt_test.domain_replace.memcpy ok 1 vfio_iommufd_replace_hwpt_test.domain_replace.memcpy # RUN vfio_iommufd_replace_hwpt_test.noreplace.memcpy ... # OK vfio_iommufd_replace_hwpt_test.noreplace.memcpy ok 2 vfio_iommufd_replace_hwpt_test.noreplace.memcpy # PASSED: 2 / 2 tests passed. # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0 Samiullah Khawaja (3): iommu/vt-d: Allow replacing no_pasid iommu_domain vfio: selftests: Add support of creating iommus from iommufd vfio: selftests: Add iommufd hwpt replace test drivers/iommu/intel/iommu.c | 107 +++++++++---- tools/testing/selftests/vfio/Makefile | 1 + .../vfio/lib/include/libvfio/iommu.h | 2 + .../lib/include/libvfio/vfio_pci_device.h | 2 + tools/testing/selftests/vfio/lib/iommu.c | 60 ++++++- .../selftests/vfio/lib/vfio_pci_device.c | 16 +- .../vfio/vfio_iommufd_hwpt_replace_test.c | 151 ++++++++++++++++++ 7 files changed, 303 insertions(+), 36 deletions(-) create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_hwpt_replace_test.c base-commit: 6cd6c12031130a349a098dbeb19d8c3070d2dfbe -- 2.52.0.351.gbe84eed79e-goog
On Wed, Jan 07, 2026 at 08:17:57PM +0000, Samiullah Khawaja wrote: > Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in > scalable mode. It does? We were just talking about how it doesn't work because it makes the PASID entry non-present while loading the new domain. Jason
On Wed, Jan 07, 2026 at 04:28:12PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 07, 2026 at 08:17:57PM +0000, Samiullah Khawaja wrote:
> > Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in
> > scalable mode.
>
> It does? We were just talking about how it doesn't work because it
> makes the PASID entry non-present while loading the new domain.
If you tried your tests in scalable mode they are probably only
working because the HW is holding the entry in cache while the CPU is
completely mangling it:
int intel_pasid_replace_first_level(struct intel_iommu *iommu,
struct device *dev, phys_addr_t fsptptr,
u32 pasid, u16 did, u16 old_did,
int flags)
{
[..]
*pte = new_pte;
That just doesn't work for "replace", it isn't hitless unless the
entry stays in the cache. Since your test effectively will hold the
context entry in the cache while testing for "hitless" it doesn't
really test if it is really working without races..
All of this needs to be reworked to always use the stack to build the
entry, like the replace path does, and have a ARM-like algorithm to
update the live memory in just the right order to guarentee the HW
does not see a corrupted entry.
It is a little bit tricky, but it should start with reworking
everything to consistently use the stack to create the new entry and
calling a centralized function to set the new entry to the live
memory. This replace/not replace split should be purged completely.
Some discussion is here
https://lore.kernel.org/all/20260106142301.GS125261@ziepe.ca/
It also needs to be very careful that the invalidation is doing both
the old and new context entry concurrently while it is being replaced.
For instance the placement of cache_tag_assign_domain() looks wrong to
me, it can't be *after* the HW has been programmed to use the new tags
:\
I also didn't note where the currently active cache_tag is removed
from the linked list during attach, is that another bug?
In short, this needs alot of work to actually properly implement
hitless replace the way ARM can. Fortunately I think it is mostly
mechanical and should be fairly straightfoward. Refer to the ARM
driver and try to structure vtd to have the same essential flow..
Jason
On Wed, Jan 7, 2026 at 12:46 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jan 07, 2026 at 04:28:12PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 07, 2026 at 08:17:57PM +0000, Samiullah Khawaja wrote:
> > > Intel IOMMU Driver already supports replacing IOMMU domain hitlessly in
> > > scalable mode.
> >
> > It does? We were just talking about how it doesn't work because it
> > makes the PASID entry non-present while loading the new domain.
>
> If you tried your tests in scalable mode they are probably only
> working because the HW is holding the entry in cache while the CPU is
> completely mangling it:
>
> int intel_pasid_replace_first_level(struct intel_iommu *iommu,
> struct device *dev, phys_addr_t fsptptr,
> u32 pasid, u16 did, u16 old_did,
> int flags)
> {
> [..]
> *pte = new_pte;
>
> That just doesn't work for "replace", it isn't hitless unless the
> entry stays in the cache. Since your test effectively will hold the
> context entry in the cache while testing for "hitless" it doesn't
> really test if it is really working without races..
Ah.. you are absolutely correct. This will not work if the entries are
not cached.
>
> All of this needs to be reworked to always use the stack to build the
> entry, like the replace path does, and have a ARM-like algorithm to
> update the live memory in just the right order to guarentee the HW
> does not see a corrupted entry.
Agreed. I will go through the VTD specs and also the ARM driver to
determine the right order to set this up.
>
> It is a little bit tricky, but it should start with reworking
> everything to consistently use the stack to create the new entry and
> calling a centralized function to set the new entry to the live
> memory. This replace/not replace split should be purged completely.
>
> Some discussion is here
>
> https://lore.kernel.org/all/20260106142301.GS125261@ziepe.ca/
>
> It also needs to be very careful that the invalidation is doing both
> the old and new context entry concurrently while it is being replaced.
Yes, let me follow the discussion over there closely.
>
> For instance the placement of cache_tag_assign_domain() looks wrong to
> me, it can't be *after* the HW has been programmed to use the new tags
> :\
>
> I also didn't note where the currently active cache_tag is removed
> from the linked list during attach, is that another bug?
You are correct, the placement of cache_tag_assign_domain is wrong.
The removal of the currently active cache tag is done in
dmar_domain_attach_device, but I will re-evaluate the placement of
both in the v2.
>
> In short, this needs alot of work to actually properly implement
> hitless replace the way ARM can. Fortunately I think it is mostly
> mechanical and should be fairly straightfoward. Refer to the ARM
> driver and try to structure vtd to have the same essential flow..
Thank you for the feedback. I will prepare a v2 series addressing these points.
>
> Jason
> Thank you for the feedback. I will prepare a v2 series addressing these points. I think there are so many problems here you should talk to Kevin and Baolu to come up with some plan. A single series is not going to be able to do all of this. Jason
On 1/12/26 06:14, Jason Gunthorpe wrote: >> Thank you for the feedback. I will prepare a v2 series addressing these points. > I think there are so many problems here you should talk to Kevin and > Baolu to come up with some plan. A single series is not going to be > able to do all of this. Yes, absolutely. I am working on a patch series to address the fundamental atomicity issue. I hope to post it for discussion soon. Once that is addressed, we can discuss any additional problems. Thanks, baolu
© 2016 - 2026 Red Hat, Inc.