The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:
- To flush caches when a second-stage page table is attached to a device
whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
resident remapping structures are visible to hardware (CAP_REG.RWBF==1).
However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.
Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.
Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.
Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738
Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com
Fixes: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com
---
drivers/iommu/intel/iommu.c | 19 ++++++++++++++++++-
drivers/iommu/intel/iommu.h | 3 +++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 148b944143b8..b23efb70b52c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
(pgd_t *)pgd, flags, old);
}
+static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
+ struct intel_iommu *iommu)
+{
+ if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
+ return true;
+
+ if (rwbf_quirk || cap_rwbf(iommu->cap))
+ return true;
+
+ return false;
+}
+
static int dmar_domain_attach_device(struct dmar_domain *domain,
struct device *dev)
{
@@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
if (ret)
goto out_block_translation;
+ domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
+
return 0;
out_block_translation:
@@ -3954,7 +3968,10 @@ static bool risky_device(struct pci_dev *pdev)
static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
unsigned long iova, size_t size)
{
- cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1);
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+ if (dmar_domain->iotlb_sync_map)
+ cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1);
return 0;
}
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 2d1afab5eedc..61f42802fe9e 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -614,6 +614,9 @@ struct dmar_domain {
u8 has_mappings:1; /* Has mappings configured through
* iommu_map() interface.
*/
+ u8 iotlb_sync_map:1; /* Need to flush IOTLB cache or write
+ * buffer when creating mappings.
+ */
spinlock_t lock; /* Protect device tracking lists */
struct list_head devices; /* all devices' list */
--
2.43.0
On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote: > @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, > if (ret) > goto out_block_translation; > > + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); This has no locking and is in the wrong order anyhow :( Any change to how invalidation works has to be done before attaching the HW so that the required invalidations are already happening before the HW can walk the page table. And you need to serialize somehow with concurrent map/unmap as iommufd doesn't prevent userspace from racing attach with map/unmap. The cache_tag_assign_domain() looks similarly wrong too, it needs to start invalidating the cache tag of the new domain, then change the context then stop invalidating the cache tag of the old domain. Otherwise there are invalidation races. Finally, if the HW needs RWBF then this also needs to do the buffer flush in this thread before installing the context to prevent a race. Overall this dynamic behavior may just be a bad idea, and perhaps you can live with domains having the domain->iotlb_sync_map as a static property set once during paging domain allocation. If the iommu requires iotlb_sync_map but the domain does not have it then the attach is rejected. This reduces domain sharing possibilities, but maybe that is just fine?? Jason
On 7/16/25 22:12, Jason Gunthorpe wrote: > On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote: >> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, >> if (ret) >> goto out_block_translation; >> >> + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); > > This has no locking and is in the wrong order anyhow :( > > Any change to how invalidation works has to be done before attaching > the HW so that the required invalidations are already happening before > the HW can walk the page table. > > And you need to serialize somehow with concurrent map/unmap as iommufd > doesn't prevent userspace from racing attach with map/unmap. domain->iotlb_sync_map does not change the driver's behavior. It simply indicates that there's no need to waste time calling cache_tag_flush_range_np(), as it's just a no-op. > > The cache_tag_assign_domain() looks similarly wrong too, it needs to > start invalidating the cache tag of the new domain, then change the > context then stop invalidating the cache tag of the old > domain. Otherwise there are invalidation races. > > Finally, if the HW needs RWBF then this also needs to do the buffer > flush in this thread before installing the context to prevent a race. > > Overall this dynamic behavior may just be a bad idea, and perhaps you > can live with domains having the domain->iotlb_sync_map as a static > property set once during paging domain allocation. > > If the iommu requires iotlb_sync_map but the domain does not have it > then the attach is rejected. This reduces domain sharing > possibilities, but maybe that is just fine?? I previously discussed this with Kevin, and we agreed on a phase-by- phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint for the driver, preventing it from looping through all cache tags to determine if any cache invalidation work needs to be performed. We already know it's predetermined that no work needs to be done. RWBF is only required on some early implementations where memory coherence was not yet implemented by the VT-d engine. It should be difficult to find such systems in modern environments. Thus, iotlb_sync_map is primarily relevant for nested translation that utilizes S2 shadowing page tables. This, too, is a legacy feature, as Intel has supported hardware-assisted nested translation for years. Making iotlb_sync_map static is a feature, not an optimization. We are still evaluating the value of this, as it's only truly helpful if there are real use cases. Thanks, baolu
On Thu, Jul 17, 2025 at 10:40:01AM +0800, Baolu Lu wrote: > On 7/16/25 22:12, Jason Gunthorpe wrote: > > On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote: > > > @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, > > > if (ret) > > > goto out_block_translation; > > > + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); > > > > This has no locking and is in the wrong order anyhow :( > > > > Any change to how invalidation works has to be done before attaching > > the HW so that the required invalidations are already happening before > > the HW can walk the page table. > > > > And you need to serialize somehow with concurrent map/unmap as iommufd > > doesn't prevent userspace from racing attach with map/unmap. > > domain->iotlb_sync_map does not change the driver's behavior. It simply > indicates that there's no need to waste time calling > cache_tag_flush_range_np(), as it's just a no-op. Of course it changes the behavior, it changes what the invalidation callback does. Without locking you have a race situation where a PGD is visible to HW that requires extra flushing and the SW is not doing the extra flushing. Before any PGD is made visible to the HW the software must ensure all the required invalidations are happening. > I previously discussed this with Kevin, and we agreed on a phase-by- > phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint > for the driver, preventing it from looping through all cache tags to > determine if any cache invalidation work needs to be performed. We > already know it's predetermined that no work needs to be done. The iteration though the cache tags is done inside a lock so it doesn't have this race (it has the issue I mentioned setting up the cache tage list though). > RWBF is only required on some early implementations where memory > coherence was not yet implemented by the VT-d engine. It should be > difficult to find such systems in modern environments. Then I would set it at domain creation time, check it during attach, and remove this race. Jason
On 7/17/25 19:55, Jason Gunthorpe wrote: > On Thu, Jul 17, 2025 at 10:40:01AM +0800, Baolu Lu wrote: >> On 7/16/25 22:12, Jason Gunthorpe wrote: >>> On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote: >>>> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, >>>> if (ret) >>>> goto out_block_translation; >>>> + domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); >>> >>> This has no locking and is in the wrong order anyhow :( >>> >>> Any change to how invalidation works has to be done before attaching >>> the HW so that the required invalidations are already happening before >>> the HW can walk the page table. >>> >>> And you need to serialize somehow with concurrent map/unmap as iommufd >>> doesn't prevent userspace from racing attach with map/unmap. >> >> domain->iotlb_sync_map does not change the driver's behavior. It simply >> indicates that there's no need to waste time calling >> cache_tag_flush_range_np(), as it's just a no-op. > > Of course it changes the behavior, it changes what the invalidation > callback does. > > Without locking you have a race situation where a PGD is visible to HW > that requires extra flushing and the SW is not doing the extra > flushing. > > Before any PGD is made visible to the HW the software must ensure all > the required invalidations are happening. Oh, I understand now. If there is no synchronization between attach/ detach and map/unmap operations, the cache invalidation behavior must be determined when a domain is allocated. > >> I previously discussed this with Kevin, and we agreed on a phase-by- >> phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint >> for the driver, preventing it from looping through all cache tags to >> determine if any cache invalidation work needs to be performed. We >> already know it's predetermined that no work needs to be done. > > The iteration though the cache tags is done inside a lock so it > doesn't have this race (it has the issue I mentioned setting up the > cache tage list though). > >> RWBF is only required on some early implementations where memory >> coherence was not yet implemented by the VT-d engine. It should be >> difficult to find such systems in modern environments. > > Then I would set it at domain creation time, check it during attach, > and remove this race. How about the following changes (compiled but not tested)? diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 8db8be9b7e7d..bb00dc14275d 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1780,18 +1780,6 @@ static int domain_setup_first_level(struct intel_iommu *iommu, __pa(pgd), flags, old); } -static bool domain_need_iotlb_sync_map(struct dmar_domain *domain, - struct intel_iommu *iommu) -{ - if (cap_caching_mode(iommu->cap) && intel_domain_is_ss_paging(domain)) - return true; - - if (rwbf_quirk || cap_rwbf(iommu->cap)) - return true; - - return false; -} - static int dmar_domain_attach_device(struct dmar_domain *domain, struct device *dev) { @@ -1831,8 +1819,6 @@ static int dmar_domain_attach_device(struct dmar_domain *domain, if (ret) goto out_block_translation; - domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu); - return 0; out_block_translation: @@ -3352,6 +3338,14 @@ intel_iommu_domain_alloc_first_stage(struct device *dev, return ERR_CAST(dmar_domain); dmar_domain->domain.ops = &intel_fs_paging_domain_ops; + /* + * iotlb sync for map is only needed for legacy implementations that + * explicitly require flushing internal write buffers to ensure memory + * coherence. + */ + if (rwbf_quirk || cap_rwbf(iommu->cap)) + dmar_domain->iotlb_sync_map = true; + return &dmar_domain->domain; } @@ -3386,6 +3380,14 @@ intel_iommu_domain_alloc_second_stage(struct device *dev, if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING) dmar_domain->domain.dirty_ops = &intel_dirty_ops; + /* + * Besides the internal write buffer flush, the caching mode used for + * legacy nested translation (which utilizes shadowing page tables) + * also requires iotlb sync on map. + */ + if (rwbf_quirk || cap_rwbf(iommu->cap) || cap_caching_mode(iommu->cap)) + dmar_domain->iotlb_sync_map = true; + return &dmar_domain->domain; } @@ -3446,6 +3448,11 @@ static int paging_domain_compatible_first_stage(struct dmar_domain *dmar_domain, if (!cap_fl1gp_support(iommu->cap) && (dmar_domain->domain.pgsize_bitmap & SZ_1G)) return -EINVAL; + + /* iotlb sync on map requirement */ + if ((rwbf_quirk || cap_rwbf(iommu->cap)) && !dmar_domain->iotlb_sync_map) + return -EINVAL; + return 0; } @@ -3469,6 +3476,12 @@ paging_domain_compatible_second_stage(struct dmar_domain *dmar_domain, return -EINVAL; if (!(sslps & BIT(1)) && (dmar_domain->domain.pgsize_bitmap & SZ_1G)) return -EINVAL; + + /* iotlb sync on map requirement */ + if ((rwbf_quirk || cap_rwbf(iommu->cap) || cap_caching_mode(iommu->cap)) && + !dmar_domain->iotlb_sync_map) + return -EINVAL; + return 0; } -- 2.43.0 Thanks, baolu
On Fri, Jul 18, 2025 at 10:56:24AM +0800, Baolu Lu wrote: > How about the following changes (compiled but not tested)? Yes, exactly what I was thinking! That leaves the cache tag linked list race.. Nicolin is working on a neat solution for SMMU I wonder if we can reuse it here too.. Jason
On 7/18/25 21:29, Jason Gunthorpe wrote: > On Fri, Jul 18, 2025 at 10:56:24AM +0800, Baolu Lu wrote: > >> How about the following changes (compiled but not tested)? > Yes, exactly what I was thinking! I will submit a formal patch. > > That leaves the cache tag linked list race.. > > Nicolin is working on a neat solution for SMMU I wonder if we can > reuse it here too.. Okay, let's wait and see. Thanks, baolu
© 2016 - 2025 Red Hat, Inc.