[PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes

Lu Baolu posted 11 patches 2 months, 3 weeks ago
[PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Lu Baolu 2 months, 3 weeks ago
The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:

- To flush caches when a second-stage page table is attached to a device
  whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
  resident remapping structures are visible to hardware (CAP_REG.RWBF==1).

However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.

Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.

Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.

Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738
Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com
Fixes: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com
---
 drivers/iommu/intel/iommu.c | 19 ++++++++++++++++++-
 drivers/iommu/intel/iommu.h |  3 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 148b944143b8..b23efb70b52c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
 					  (pgd_t *)pgd, flags, old);
 }
 
+static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
+				       struct intel_iommu *iommu)
+{
+	if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
+		return true;
+
+	if (rwbf_quirk || cap_rwbf(iommu->cap))
+		return true;
+
+	return false;
+}
+
 static int dmar_domain_attach_device(struct dmar_domain *domain,
 				     struct device *dev)
 {
@@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 	if (ret)
 		goto out_block_translation;
 
+	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
+
 	return 0;
 
 out_block_translation:
@@ -3954,7 +3968,10 @@ static bool risky_device(struct pci_dev *pdev)
 static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 				      unsigned long iova, size_t size)
 {
-	cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1);
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+	if (dmar_domain->iotlb_sync_map)
+		cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1);
 
 	return 0;
 }
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 2d1afab5eedc..61f42802fe9e 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -614,6 +614,9 @@ struct dmar_domain {
 	u8 has_mappings:1;		/* Has mappings configured through
 					 * iommu_map() interface.
 					 */
+	u8 iotlb_sync_map:1;		/* Need to flush IOTLB cache or write
+					 * buffer when creating mappings.
+					 */
 
 	spinlock_t lock;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
-- 
2.43.0
Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Jason Gunthorpe 2 months, 3 weeks ago
On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>  	if (ret)
>  		goto out_block_translation;
>  
> +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);

This has no locking and is in the wrong order anyhow :(

Any change to how invalidation works has to be done before attaching
the HW so that the required invalidations are already happening before
the HW can walk the page table.

And you need to serialize somehow with concurrent map/unmap as iommufd
doesn't prevent userspace from racing attach with map/unmap.

The cache_tag_assign_domain() looks similarly wrong too, it needs to
start invalidating the cache tag of the new domain, then change the
context then stop invalidating the cache tag of the old
domain. Otherwise there are invalidation races.

Finally, if the HW needs RWBF then this also needs to do the buffer
flush in this thread before installing the context to prevent a race.

Overall this dynamic behavior may just be a bad idea, and perhaps you
can live with domains having the domain->iotlb_sync_map as a static
property set once during paging domain allocation.

If the iommu requires iotlb_sync_map but the domain does not have it
then the attach is rejected. This reduces domain sharing
possibilities, but maybe that is just fine??

Jason
Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Baolu Lu 2 months, 3 weeks ago
On 7/16/25 22:12, Jason Gunthorpe wrote:
> On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
>> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>   	if (ret)
>>   		goto out_block_translation;
>>   
>> +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
> 
> This has no locking and is in the wrong order anyhow :(
> 
> Any change to how invalidation works has to be done before attaching
> the HW so that the required invalidations are already happening before
> the HW can walk the page table.
> 
> And you need to serialize somehow with concurrent map/unmap as iommufd
> doesn't prevent userspace from racing attach with map/unmap.

domain->iotlb_sync_map does not change the driver's behavior. It simply
indicates that there's no need to waste time calling
cache_tag_flush_range_np(), as it's just a no-op.

> 
> The cache_tag_assign_domain() looks similarly wrong too, it needs to
> start invalidating the cache tag of the new domain, then change the
> context then stop invalidating the cache tag of the old
> domain. Otherwise there are invalidation races.
> 
> Finally, if the HW needs RWBF then this also needs to do the buffer
> flush in this thread before installing the context to prevent a race.
> 
> Overall this dynamic behavior may just be a bad idea, and perhaps you
> can live with domains having the domain->iotlb_sync_map as a static
> property set once during paging domain allocation.
> 
> If the iommu requires iotlb_sync_map but the domain does not have it
> then the attach is rejected. This reduces domain sharing
> possibilities, but maybe that is just fine??

I previously discussed this with Kevin, and we agreed on a phase-by-
phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint
for the driver, preventing it from looping through all cache tags to
determine if any cache invalidation work needs to be performed. We
already know it's predetermined that no work needs to be done.

RWBF is only required on some early implementations where memory
coherence was not yet implemented by the VT-d engine. It should be
difficult to find such systems in modern environments. Thus,
iotlb_sync_map is primarily relevant for nested translation that
utilizes S2 shadowing page tables. This, too, is a legacy feature, as
Intel has supported hardware-assisted nested translation for years.

Making iotlb_sync_map static is a feature, not an optimization. We are
still evaluating the value of this, as it's only truly helpful if there
are real use cases.

Thanks,
baolu
Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Jason Gunthorpe 2 months, 3 weeks ago
On Thu, Jul 17, 2025 at 10:40:01AM +0800, Baolu Lu wrote:
> On 7/16/25 22:12, Jason Gunthorpe wrote:
> > On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
> > > @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
> > >   	if (ret)
> > >   		goto out_block_translation;
> > > +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
> > 
> > This has no locking and is in the wrong order anyhow :(
> > 
> > Any change to how invalidation works has to be done before attaching
> > the HW so that the required invalidations are already happening before
> > the HW can walk the page table.
> > 
> > And you need to serialize somehow with concurrent map/unmap as iommufd
> > doesn't prevent userspace from racing attach with map/unmap.
> 
> domain->iotlb_sync_map does not change the driver's behavior. It simply
> indicates that there's no need to waste time calling
> cache_tag_flush_range_np(), as it's just a no-op.

Of course it changes the behavior, it changes what the invalidation
callback does.

Without locking you have a race situation where a PGD is visible to HW
that requires extra flushing and the SW is not doing the extra
flushing.

Before any PGD is made visible to the HW the software must ensure all
the required invalidations are happening.

> I previously discussed this with Kevin, and we agreed on a phase-by-
> phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint
> for the driver, preventing it from looping through all cache tags to
> determine if any cache invalidation work needs to be performed. We
> already know it's predetermined that no work needs to be done.

The iteration though the cache tags is done inside a lock so it
doesn't have this race (it has the issue I mentioned setting up the
cache tage list though).

> RWBF is only required on some early implementations where memory
> coherence was not yet implemented by the VT-d engine. It should be
> difficult to find such systems in modern environments.

Then I would set it at domain creation time, check it during attach,
and remove this race.

Jason
Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Baolu Lu 2 months, 3 weeks ago
On 7/17/25 19:55, Jason Gunthorpe wrote:
> On Thu, Jul 17, 2025 at 10:40:01AM +0800, Baolu Lu wrote:
>> On 7/16/25 22:12, Jason Gunthorpe wrote:
>>> On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
>>>> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>>>    	if (ret)
>>>>    		goto out_block_translation;
>>>> +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
>>>
>>> This has no locking and is in the wrong order anyhow :(
>>>
>>> Any change to how invalidation works has to be done before attaching
>>> the HW so that the required invalidations are already happening before
>>> the HW can walk the page table.
>>>
>>> And you need to serialize somehow with concurrent map/unmap as iommufd
>>> doesn't prevent userspace from racing attach with map/unmap.
>>
>> domain->iotlb_sync_map does not change the driver's behavior. It simply
>> indicates that there's no need to waste time calling
>> cache_tag_flush_range_np(), as it's just a no-op.
> 
> Of course it changes the behavior, it changes what the invalidation
> callback does.
> 
> Without locking you have a race situation where a PGD is visible to HW
> that requires extra flushing and the SW is not doing the extra
> flushing.
> 
> Before any PGD is made visible to the HW the software must ensure all
> the required invalidations are happening.

Oh, I understand now. If there is no synchronization between attach/
detach and map/unmap operations, the cache invalidation behavior must be
determined when a domain is allocated.

> 
>> I previously discussed this with Kevin, and we agreed on a phase-by-
>> phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint
>> for the driver, preventing it from looping through all cache tags to
>> determine if any cache invalidation work needs to be performed. We
>> already know it's predetermined that no work needs to be done.
> 
> The iteration though the cache tags is done inside a lock so it
> doesn't have this race (it has the issue I mentioned setting up the
> cache tage list though).
> 
>> RWBF is only required on some early implementations where memory
>> coherence was not yet implemented by the VT-d engine. It should be
>> difficult to find such systems in modern environments.
> 
> Then I would set it at domain creation time, check it during attach,
> and remove this race.

How about the following changes (compiled but not tested)?

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 8db8be9b7e7d..bb00dc14275d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1780,18 +1780,6 @@ static int domain_setup_first_level(struct 
intel_iommu *iommu,
  					  __pa(pgd), flags, old);
  }

-static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
-				       struct intel_iommu *iommu)
-{
-	if (cap_caching_mode(iommu->cap) && intel_domain_is_ss_paging(domain))
-		return true;
-
-	if (rwbf_quirk || cap_rwbf(iommu->cap))
-		return true;
-
-	return false;
-}
-
  static int dmar_domain_attach_device(struct dmar_domain *domain,
  				     struct device *dev)
  {
@@ -1831,8 +1819,6 @@ static int dmar_domain_attach_device(struct 
dmar_domain *domain,
  	if (ret)
  		goto out_block_translation;

-	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
-
  	return 0;

  out_block_translation:
@@ -3352,6 +3338,14 @@ intel_iommu_domain_alloc_first_stage(struct 
device *dev,
  		return ERR_CAST(dmar_domain);

  	dmar_domain->domain.ops = &intel_fs_paging_domain_ops;
+	/*
+	 * iotlb sync for map is only needed for legacy implementations that
+	 * explicitly require flushing internal write buffers to ensure memory
+	 * coherence.
+	 */
+	if (rwbf_quirk || cap_rwbf(iommu->cap))
+		dmar_domain->iotlb_sync_map = true;
+
  	return &dmar_domain->domain;
  }

@@ -3386,6 +3380,14 @@ intel_iommu_domain_alloc_second_stage(struct 
device *dev,
  	if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)
  		dmar_domain->domain.dirty_ops = &intel_dirty_ops;

+	/*
+	 * Besides the internal write buffer flush, the caching mode used for
+	 * legacy nested translation (which utilizes shadowing page tables)
+	 * also requires iotlb sync on map.
+	 */
+	if (rwbf_quirk || cap_rwbf(iommu->cap) || cap_caching_mode(iommu->cap))
+		dmar_domain->iotlb_sync_map = true;
+
  	return &dmar_domain->domain;
  }

@@ -3446,6 +3448,11 @@ static int 
paging_domain_compatible_first_stage(struct dmar_domain *dmar_domain,
  	if (!cap_fl1gp_support(iommu->cap) &&
  	    (dmar_domain->domain.pgsize_bitmap & SZ_1G))
  		return -EINVAL;
+
+	/* iotlb sync on map requirement */
+	if ((rwbf_quirk || cap_rwbf(iommu->cap)) && !dmar_domain->iotlb_sync_map)
+		return -EINVAL;
+
  	return 0;
  }

@@ -3469,6 +3476,12 @@ paging_domain_compatible_second_stage(struct 
dmar_domain *dmar_domain,
  		return -EINVAL;
  	if (!(sslps & BIT(1)) && (dmar_domain->domain.pgsize_bitmap & SZ_1G))
  		return -EINVAL;
+
+	/* iotlb sync on map requirement */
+	if ((rwbf_quirk || cap_rwbf(iommu->cap) || 
cap_caching_mode(iommu->cap)) &&
+	    !dmar_domain->iotlb_sync_map)
+		return -EINVAL;
+
  	return 0;
  }

-- 
2.43.0

Thanks,
baolu
Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Jason Gunthorpe 2 months, 2 weeks ago
On Fri, Jul 18, 2025 at 10:56:24AM +0800, Baolu Lu wrote:

> How about the following changes (compiled but not tested)?

Yes, exactly what I was thinking!

That leaves the cache tag linked list race..

Nicolin is working on a neat solution for SMMU I wonder if we can
reuse it here too..

Jason
Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
Posted by Baolu Lu 2 months, 2 weeks ago
On 7/18/25 21:29, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 10:56:24AM +0800, Baolu Lu wrote:
> 
>> How about the following changes (compiled but not tested)?
> Yes, exactly what I was thinking!

I will submit a formal patch.

> 
> That leaves the cache tag linked list race..
> 
> Nicolin is working on a neat solution for SMMU I wonder if we can
> reuse it here too..

Okay, let's wait and see.

Thanks,
baolu