[PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry

Lu Baolu posted 3 patches 3 weeks, 4 days ago
There is a newer version of this series
[PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Lu Baolu 3 weeks, 4 days ago
The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
bytes). When tearing down an entry, the current implementation zeros the
entire 64-byte structure immediately.

However, the IOMMU hardware may fetch these 64 bytes using multiple
internal transactions (e.g., four 128-bit bursts). If a hardware fetch
occurs simultaneously with the CPU zeroing the entry, the hardware could
observe a "torn" entry — where some chunks are zeroed and others still
contain old data — leading to unpredictable behavior or spurious faults.

Follow the "Guidance to Software for Invalidations" in the VT-d spec
(Section 6.5.3.3) by implementing a proper ownership handshake:

1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
   hardware that the entry is no longer valid.
2. Execute the required invalidation sequence (PASID cache, IOTLB, and
   Device-TLB flush) to ensure the hardware has released all cached
   references to the entry.
3. Only after the flushes are complete, zero out the remaining fields of
   the PASID entry.

Additionally, add an explicit clflush in intel_pasid_clear_entry() to
ensure that the cleared entry is visible to the IOMMU on systems where
memory coherency (ecap_coherent) is not supported.

Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/pasid.h | 12 ++++++++++++
 drivers/iommu/intel/pasid.c |  9 +++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index b4c85242dc79..35de1d77355f 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -237,6 +237,18 @@ static inline void pasid_set_present(struct pasid_entry *pe)
 	pasid_set_bits(&pe->val[0], 1 << 0, 1);
 }
 
+/*
+ * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
+ * This initiates the transition of the entry's ownership from hardware
+ * to software. The caller is responsible for fulfilling the invalidation
+ * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
+ * Software for Invalidations).
+ */
+static inline void pasid_clear_present(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 0, 0);
+}
+
 /*
  * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
  * entry.
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 298a39183996..4f36138448d8 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -178,7 +178,8 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
  * Interfaces for PASID table entry manipulation:
  */
 static void
-intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
+intel_pasid_clear_entry(struct intel_iommu *iommu, struct device *dev,
+			u32 pasid, bool fault_ignore)
 {
 	struct pasid_entry *pe;
 
@@ -190,6 +191,9 @@ intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
 		pasid_clear_entry_with_fpd(pe);
 	else
 		pasid_clear_entry(pe);
+
+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(pe, sizeof(*pe));
 }
 
 static void
@@ -272,7 +276,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
 
 	did = pasid_get_domain_id(pte);
 	pgtt = pasid_pte_get_pgtt(pte);
-	intel_pasid_clear_entry(dev, pasid, fault_ignore);
+	pasid_clear_present(pte);
 	spin_unlock(&iommu->lock);
 
 	if (!ecap_coherent(iommu->ecap))
@@ -286,6 +290,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
 
 	devtlb_invalidation_with_pasid(iommu, dev, pasid);
+	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
 	if (!fault_ignore)
 		intel_iommu_drain_pasid_prq(dev, pasid);
 }
-- 
2.43.0

RE: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Tian, Kevin 3 weeks, 3 days ago
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Tuesday, January 13, 2026 11:01 AM
> 
> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> bytes). When tearing down an entry, the current implementation zeros the
> entire 64-byte structure immediately.
> 
> However, the IOMMU hardware may fetch these 64 bytes using multiple
> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> occurs simultaneously with the CPU zeroing the entry, the hardware could
> observe a "torn" entry — where some chunks are zeroed and others still
> contain old data — leading to unpredictable behavior or spurious faults.
> 
> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> (Section 6.5.3.3) by implementing a proper ownership handshake:
> 
> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>    hardware that the entry is no longer valid.
> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>    Device-TLB flush) to ensure the hardware has released all cached
>    references to the entry.
> 3. Only after the flushes are complete, zero out the remaining fields of
>    the PASID entry.
> 
> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
> ensure that the cleared entry is visible to the IOMMU on systems where
> memory coherency (ecap_coherent) is not supported.
> 

better split the clflush part into a separate patch
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Baolu Lu 3 weeks, 3 days ago
On 1/14/26 15:32, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Tuesday, January 13, 2026 11:01 AM
>>
>> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
>> bytes). When tearing down an entry, the current implementation zeros the
>> entire 64-byte structure immediately.
>>
>> However, the IOMMU hardware may fetch these 64 bytes using multiple
>> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
>> occurs simultaneously with the CPU zeroing the entry, the hardware could
>> observe a "torn" entry — where some chunks are zeroed and others still
>> contain old data — leading to unpredictable behavior or spurious faults.
>>
>> Follow the "Guidance to Software for Invalidations" in the VT-d spec
>> (Section 6.5.3.3) by implementing a proper ownership handshake:
>>
>> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>>     hardware that the entry is no longer valid.
>> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>>     Device-TLB flush) to ensure the hardware has released all cached
>>     references to the entry.
>> 3. Only after the flushes are complete, zero out the remaining fields of
>>     the PASID entry.
>>
>> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
>> ensure that the cleared entry is visible to the IOMMU on systems where
>> memory coherency (ecap_coherent) is not supported.
>>
> 
> better split the clflush part into a separate patch

Ah, the commit message is a bit confusing. Previously, 
intel_pasid_clear_entry() was followed by a clflush. Now that we've 
moved the clearing logic after the cache invalidation, an explicit 
clflush is required. Instead of splitting this into a separate patch, 
how about amending the commit message to make this clear.

Or, keep intel_pasid_clear_entry() untouched like the following:

iommu/vt-d: Clear Present bit before tearing down PASID entry

The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
bytes). When tearing down an entry, the current implementation zeros the
entire 64-byte structure immediately.

However, the IOMMU hardware may fetch these 64 bytes using multiple
internal transactions (e.g., four 128-bit bursts). If a hardware fetch
occurs simultaneously with the CPU zeroing the entry, the hardware could
observe a "torn" entry — where some chunks are zeroed and others still
contain old data — leading to unpredictable behavior or spurious faults.

Follow the "Guidance to Software for Invalidations" in the VT-d spec
(Section 6.5.3.3) by implementing a proper ownership handshake:

1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
    hardware that the entry is no longer valid.
2. Execute the required invalidation sequence (PASID cache, IOTLB, and
    Device-TLB flush) to ensure the hardware has released all cached
    references to the entry.
3. Only after the flushes are complete, zero out the remaining fields of
    the PASID entry.

Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
  drivers/iommu/intel/pasid.h | 12 ++++++++++++
  drivers/iommu/intel/pasid.c |  6 +++++-
  2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index b4c85242dc79..35de1d77355f 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -237,6 +237,18 @@ static inline void pasid_set_present(struct 
pasid_entry *pe)
  	pasid_set_bits(&pe->val[0], 1 << 0, 1);
  }

+/*
+ * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
+ * This initiates the transition of the entry's ownership from hardware
+ * to software. The caller is responsible for fulfilling the invalidation
+ * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
+ * Software for Invalidations).
+ */
+static inline void pasid_clear_present(struct pasid_entry *pe)
+{
+	pasid_set_bits(&pe->val[0], 1 << 0, 0);
+}
+
  /*
   * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
   * entry.
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 298a39183996..3a8ec5eba83e 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -272,7 +272,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu 
*iommu, struct device *dev,

  	did = pasid_get_domain_id(pte);
  	pgtt = pasid_pte_get_pgtt(pte);
-	intel_pasid_clear_entry(dev, pasid, fault_ignore);
+	pasid_clear_present(pte);
  	spin_unlock(&iommu->lock);

  	if (!ecap_coherent(iommu->ecap))
@@ -286,6 +286,10 @@ void intel_pasid_tear_down_entry(struct intel_iommu 
*iommu, struct device *dev,
  		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);

  	devtlb_invalidation_with_pasid(iommu, dev, pasid);
+	intel_pasid_clear_entry(dev, pasid, fault_ignore);
+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(pte, sizeof(*pte));
+
  	if (!fault_ignore)
  		intel_iommu_drain_pasid_prq(dev, pasid);
  }
-- 
2.43.0

Thanks,
baolu
RE: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Tian, Kevin 3 weeks, 2 days ago
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, January 14, 2026 4:27 PM
> 
> On 1/14/26 15:32, Tian, Kevin wrote:
> >> From: Lu Baolu <baolu.lu@linux.intel.com>
> >> Sent: Tuesday, January 13, 2026 11:01 AM
> >>
> >> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> >> bytes). When tearing down an entry, the current implementation zeros
> the
> >> entire 64-byte structure immediately.
> >>
> >> However, the IOMMU hardware may fetch these 64 bytes using multiple
> >> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> >> occurs simultaneously with the CPU zeroing the entry, the hardware
> could
> >> observe a "torn" entry — where some chunks are zeroed and others still
> >> contain old data — leading to unpredictable behavior or spurious faults.
> >>
> >> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> >> (Section 6.5.3.3) by implementing a proper ownership handshake:
> >>
> >> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
> >>     hardware that the entry is no longer valid.
> >> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
> >>     Device-TLB flush) to ensure the hardware has released all cached
> >>     references to the entry.
> >> 3. Only after the flushes are complete, zero out the remaining fields of
> >>     the PASID entry.
> >>
> >> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
> >> ensure that the cleared entry is visible to the IOMMU on systems where
> >> memory coherency (ecap_coherent) is not supported.
> >>
> >
> > better split the clflush part into a separate patch
> 
> Ah, the commit message is a bit confusing. Previously,
> intel_pasid_clear_entry() was followed by a clflush. Now that we've
> moved the clearing logic after the cache invalidation, an explicit
> clflush is required. Instead of splitting this into a separate patch,
> how about amending the commit message to make this clear.
> 
> Or, keep intel_pasid_clear_entry() untouched like the following:

let's go this way to have two places consistent.

> 
> iommu/vt-d: Clear Present bit before tearing down PASID entry
> 
> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> bytes). When tearing down an entry, the current implementation zeros the
> entire 64-byte structure immediately.
> 
> However, the IOMMU hardware may fetch these 64 bytes using multiple
> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> occurs simultaneously with the CPU zeroing the entry, the hardware could
> observe a "torn" entry — where some chunks are zeroed and others still
> contain old data — leading to unpredictable behavior or spurious faults.
> 
> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> (Section 6.5.3.3) by implementing a proper ownership handshake:
> 
> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>     hardware that the entry is no longer valid.
> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>     Device-TLB flush) to ensure the hardware has released all cached
>     references to the entry.
> 3. Only after the flushes are complete, zero out the remaining fields of
>     the PASID entry.
> 
> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>   drivers/iommu/intel/pasid.h | 12 ++++++++++++
>   drivers/iommu/intel/pasid.c |  6 +++++-
>   2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index b4c85242dc79..35de1d77355f 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -237,6 +237,18 @@ static inline void pasid_set_present(struct
> pasid_entry *pe)
>   	pasid_set_bits(&pe->val[0], 1 << 0, 1);
>   }
> 
> +/*
> + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
> + * This initiates the transition of the entry's ownership from hardware
> + * to software. The caller is responsible for fulfilling the invalidation
> + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
> + * Software for Invalidations).
> + */
> +static inline void pasid_clear_present(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 0, 0);
> +}
> +
>   /*
>    * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
>    * entry.
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 298a39183996..3a8ec5eba83e 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -272,7 +272,7 @@ void intel_pasid_tear_down_entry(struct
> intel_iommu
> *iommu, struct device *dev,
> 
>   	did = pasid_get_domain_id(pte);
>   	pgtt = pasid_pte_get_pgtt(pte);
> -	intel_pasid_clear_entry(dev, pasid, fault_ignore);
> +	pasid_clear_present(pte);
>   	spin_unlock(&iommu->lock);
> 
>   	if (!ecap_coherent(iommu->ecap))
> @@ -286,6 +286,10 @@ void intel_pasid_tear_down_entry(struct
> intel_iommu
> *iommu, struct device *dev,
>   		iommu->flush.flush_iotlb(iommu, did, 0, 0,
> DMA_TLB_DSI_FLUSH);
> 
>   	devtlb_invalidation_with_pasid(iommu, dev, pasid);
> +	intel_pasid_clear_entry(dev, pasid, fault_ignore);
> +	if (!ecap_coherent(iommu->ecap))
> +		clflush_cache_range(pte, sizeof(*pte));
> +
>   	if (!fault_ignore)
>   		intel_iommu_drain_pasid_prq(dev, pasid);
>   }
> --
> 2.43.0
> 
> Thanks,
> baolu
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Dmytro Maluka 3 weeks, 3 days ago
On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
> bytes). When tearing down an entry, the current implementation zeros the
> entire 64-byte structure immediately.
> 
> However, the IOMMU hardware may fetch these 64 bytes using multiple
> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
> occurs simultaneously with the CPU zeroing the entry, the hardware could
> observe a "torn" entry — where some chunks are zeroed and others still
> contain old data — leading to unpredictable behavior or spurious faults.
> 
> Follow the "Guidance to Software for Invalidations" in the VT-d spec
> (Section 6.5.3.3) by implementing a proper ownership handshake:
> 
> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>    hardware that the entry is no longer valid.
> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>    Device-TLB flush) to ensure the hardware has released all cached
>    references to the entry.
> 3. Only after the flushes are complete, zero out the remaining fields of
>    the PASID entry.
> 
> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
> ensure that the cleared entry is visible to the IOMMU on systems where
> memory coherency (ecap_coherent) is not supported.
> 
> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  drivers/iommu/intel/pasid.h | 12 ++++++++++++
>  drivers/iommu/intel/pasid.c |  9 +++++++--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index b4c85242dc79..35de1d77355f 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -237,6 +237,18 @@ static inline void pasid_set_present(struct pasid_entry *pe)
>  	pasid_set_bits(&pe->val[0], 1 << 0, 1);
>  }
>  
> +/*
> + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
> + * This initiates the transition of the entry's ownership from hardware
> + * to software. The caller is responsible for fulfilling the invalidation
> + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
> + * Software for Invalidations).
> + */
> +static inline void pasid_clear_present(struct pasid_entry *pe)
> +{
> +	pasid_set_bits(&pe->val[0], 1 << 0, 0);
> +}
> +
>  /*
>   * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
>   * entry.
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 298a39183996..4f36138448d8 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -178,7 +178,8 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
>   * Interfaces for PASID table entry manipulation:
>   */
>  static void
> -intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
> +intel_pasid_clear_entry(struct intel_iommu *iommu, struct device *dev,
> +			u32 pasid, bool fault_ignore)
>  {
>  	struct pasid_entry *pe;
>  
> @@ -190,6 +191,9 @@ intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
>  		pasid_clear_entry_with_fpd(pe);
>  	else
>  		pasid_clear_entry(pe);
> +
> +	if (!ecap_coherent(iommu->ecap))
> +		clflush_cache_range(pe, sizeof(*pe));
>  }
>  
>  static void
> @@ -272,7 +276,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>  
>  	did = pasid_get_domain_id(pte);
>  	pgtt = pasid_pte_get_pgtt(pte);
> -	intel_pasid_clear_entry(dev, pasid, fault_ignore);
> +	pasid_clear_present(pte);
>  	spin_unlock(&iommu->lock);
>  
>  	if (!ecap_coherent(iommu->ecap))
> @@ -286,6 +290,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>  		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>  
>  	devtlb_invalidation_with_pasid(iommu, dev, pasid);
> +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);

Is it safe to do this with iommu->lock already unlocked?

>  	if (!fault_ignore)
>  		intel_iommu_drain_pasid_prq(dev, pasid);
>  }
> -- 
> 2.43.0
> 
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Baolu Lu 3 weeks, 3 days ago
On 1/14/26 03:34, Dmytro Maluka wrote:
> On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
>> The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64
>> bytes). When tearing down an entry, the current implementation zeros the
>> entire 64-byte structure immediately.
>>
>> However, the IOMMU hardware may fetch these 64 bytes using multiple
>> internal transactions (e.g., four 128-bit bursts). If a hardware fetch
>> occurs simultaneously with the CPU zeroing the entry, the hardware could
>> observe a "torn" entry — where some chunks are zeroed and others still
>> contain old data — leading to unpredictable behavior or spurious faults.
>>
>> Follow the "Guidance to Software for Invalidations" in the VT-d spec
>> (Section 6.5.3.3) by implementing a proper ownership handshake:
>>
>> 1. Clear only the 'Present' (P) bit of the PASID entry. This tells the
>>     hardware that the entry is no longer valid.
>> 2. Execute the required invalidation sequence (PASID cache, IOTLB, and
>>     Device-TLB flush) to ensure the hardware has released all cached
>>     references to the entry.
>> 3. Only after the flushes are complete, zero out the remaining fields of
>>     the PASID entry.
>>
>> Additionally, add an explicit clflush in intel_pasid_clear_entry() to
>> ensure that the cleared entry is visible to the IOMMU on systems where
>> memory coherency (ecap_coherent) is not supported.
>>
>> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables")
>> Signed-off-by: Lu Baolu<baolu.lu@linux.intel.com>
>> ---
>>   drivers/iommu/intel/pasid.h | 12 ++++++++++++
>>   drivers/iommu/intel/pasid.c |  9 +++++++--
>>   2 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
>> index b4c85242dc79..35de1d77355f 100644
>> --- a/drivers/iommu/intel/pasid.h
>> +++ b/drivers/iommu/intel/pasid.h
>> @@ -237,6 +237,18 @@ static inline void pasid_set_present(struct pasid_entry *pe)
>>   	pasid_set_bits(&pe->val[0], 1 << 0, 1);
>>   }
>>   
>> +/*
>> + * Clear the Present (P) bit (bit 0) of a scalable-mode PASID table entry.
>> + * This initiates the transition of the entry's ownership from hardware
>> + * to software. The caller is responsible for fulfilling the invalidation
>> + * handshake recommended by the VT-d spec, Section 6.5.3.3 (Guidance to
>> + * Software for Invalidations).
>> + */
>> +static inline void pasid_clear_present(struct pasid_entry *pe)
>> +{
>> +	pasid_set_bits(&pe->val[0], 1 << 0, 0);
>> +}
>> +
>>   /*
>>    * Setup Page Walk Snoop bit (Bit 87) of a scalable mode PASID
>>    * entry.
>> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
>> index 298a39183996..4f36138448d8 100644
>> --- a/drivers/iommu/intel/pasid.c
>> +++ b/drivers/iommu/intel/pasid.c
>> @@ -178,7 +178,8 @@ static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
>>    * Interfaces for PASID table entry manipulation:
>>    */
>>   static void
>> -intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
>> +intel_pasid_clear_entry(struct intel_iommu *iommu, struct device *dev,
>> +			u32 pasid, bool fault_ignore)
>>   {
>>   	struct pasid_entry *pe;
>>   
>> @@ -190,6 +191,9 @@ intel_pasid_clear_entry(struct device *dev, u32 pasid, bool fault_ignore)
>>   		pasid_clear_entry_with_fpd(pe);
>>   	else
>>   		pasid_clear_entry(pe);
>> +
>> +	if (!ecap_coherent(iommu->ecap))
>> +		clflush_cache_range(pe, sizeof(*pe));
>>   }
>>   
>>   static void
>> @@ -272,7 +276,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>>   
>>   	did = pasid_get_domain_id(pte);
>>   	pgtt = pasid_pte_get_pgtt(pte);
>> -	intel_pasid_clear_entry(dev, pasid, fault_ignore);
>> +	pasid_clear_present(pte);
>>   	spin_unlock(&iommu->lock);
>>   
>>   	if (!ecap_coherent(iommu->ecap))
>> @@ -286,6 +290,7 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, struct device *dev,
>>   		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>>   
>>   	devtlb_invalidation_with_pasid(iommu, dev, pasid);
>> +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> Is it safe to do this with iommu->lock already unlocked?

Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
 >mutex in the iommu core, which ensures that no other thread can attempt
to allocate or setup this same PASID until intel_pasid_tear_down_entry()
has returned.

The iommu->lock is held during the initial transition (P->0) to ensure
atomicity against other hardware-table walkers, but once the P bit is
cleared and the caches are flushed, the final zeroing of the 'dead'
entry does not strictly require the spinlock because the PASID remains
reserved in software until the function completes.

Thanks,
baolu
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Dmytro Maluka 3 weeks, 3 days ago
On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
> On 1/14/26 03:34, Dmytro Maluka wrote:
> > On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> > > +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> > Is it safe to do this with iommu->lock already unlocked?
> 
> Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
> >mutex in the iommu core, which ensures that no other thread can attempt
> to allocate or setup this same PASID until intel_pasid_tear_down_entry()
> has returned.
> 
> The iommu->lock is held during the initial transition (P->0) to ensure
> atomicity against other hardware-table walkers, but once the P bit is
> cleared and the caches are flushed, the final zeroing of the 'dead'
> entry does not strictly require the spinlock because the PASID remains
> reserved in software until the function completes.

Ok. Just to understand: "other hardware-table walkers" means some
software walkers, not hardware ones? Which software walkers are those?
(I can't imagine how holding a spinlock could prevent the hardware from
walking those tables. :))
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Baolu Lu 3 weeks, 2 days ago
On 1/14/26 19:12, Dmytro Maluka wrote:
> On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
>> On 1/14/26 03:34, Dmytro Maluka wrote:
>>> On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
>>>> +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
>>> Is it safe to do this with iommu->lock already unlocked?
>>
>> Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
>>> mutex in the iommu core, which ensures that no other thread can attempt
>> to allocate or setup this same PASID until intel_pasid_tear_down_entry()
>> has returned.
>>
>> The iommu->lock is held during the initial transition (P->0) to ensure
>> atomicity against other hardware-table walkers, but once the P bit is
>> cleared and the caches are flushed, the final zeroing of the 'dead'
>> entry does not strictly require the spinlock because the PASID remains
>> reserved in software until the function completes.
> 
> Ok. Just to understand: "other hardware-table walkers" means some
> software walkers, not hardware ones? Which software walkers are those?
> (I can't imagine how holding a spinlock could prevent the hardware from
> walking those tables. :))

You are right. A spinlock doesn't stop the hardware. The spinlock
serializes software threads to ensure the hardware walker always sees a
consistent entry.

When a PASID entry is active (P=1), other kernel paths might modify
the control bits in-place. For example:

void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
                                           struct device *dev, u32 pasid)
{
         struct pasid_entry *pte;
         u16 did;

         spin_lock(&iommu->lock);
         pte = intel_pasid_get_entry(dev, pasid);
         if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
                 spin_unlock(&iommu->lock);
                 return;
         }

         pasid_set_pgsnp(pte);
         did = pasid_get_domain_id(pte);
         spin_unlock(&iommu->lock);

         intel_pasid_flush_present(iommu, dev, pasid, did, pte);
}

In this case, the iommu->lock ensures that if two threads try to modify
the same active entry, they don't interfere with each other and leave
the entry in a 'torn' state for the IOMMU hardware to read.

In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
(setting P=0 and flushing caches), the entry is owned exclusively  by
the teardown thread until it is re-configured. That's the reason why the
final zeroing doesn't need the spinlock.

Thanks,
baolu
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Dmytro Maluka 3 weeks, 1 day ago
On Thu, Jan 15, 2026 at 10:45:12AM +0800, Baolu Lu wrote:
> On 1/14/26 19:12, Dmytro Maluka wrote:
> > On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
> > > On 1/14/26 03:34, Dmytro Maluka wrote:
> > > > On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> > > > > +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> > > > Is it safe to do this with iommu->lock already unlocked?
> > > 
> > > Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
> > > > mutex in the iommu core, which ensures that no other thread can attempt
> > > to allocate or setup this same PASID until intel_pasid_tear_down_entry()
> > > has returned.
> > > 
> > > The iommu->lock is held during the initial transition (P->0) to ensure
> > > atomicity against other hardware-table walkers, but once the P bit is
> > > cleared and the caches are flushed, the final zeroing of the 'dead'
> > > entry does not strictly require the spinlock because the PASID remains
> > > reserved in software until the function completes.
> > 
> > Ok. Just to understand: "other hardware-table walkers" means some
> > software walkers, not hardware ones? Which software walkers are those?
> > (I can't imagine how holding a spinlock could prevent the hardware from
> > walking those tables. :))
> 
> You are right. A spinlock doesn't stop the hardware. The spinlock
> serializes software threads to ensure the hardware walker always sees a
> consistent entry.
> 
> When a PASID entry is active (P=1), other kernel paths might modify
> the control bits in-place. For example:
> 
> void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
>                                           struct device *dev, u32 pasid)
> {
>         struct pasid_entry *pte;
>         u16 did;
> 
>         spin_lock(&iommu->lock);
>         pte = intel_pasid_get_entry(dev, pasid);
>         if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
>                 spin_unlock(&iommu->lock);
>                 return;
>         }
> 
>         pasid_set_pgsnp(pte);
>         did = pasid_get_domain_id(pte);
>         spin_unlock(&iommu->lock);
> 
>         intel_pasid_flush_present(iommu, dev, pasid, did, pte);
> }
> 
> In this case, the iommu->lock ensures that if two threads try to modify
> the same active entry, they don't interfere with each other and leave
> the entry in a 'torn' state for the IOMMU hardware to read.
> 
> In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
> (setting P=0 and flushing caches), the entry is owned exclusively  by
> the teardown thread until it is re-configured. That's the reason why the
> final zeroing doesn't need the spinlock.

I see. Am I correct that those other code paths (modifying an entry
in-place) are not supposed to do that concurrently with
intel_pasid_tear_down_entry(), i.e. they should only do that while it is
guaranteed that the entry remains present? Otherwise there is a bug
(hence, for example, the WARN_ON in
intel_pasid_setup_page_snoop_control())? So, holding iommu->lock during
entry teardown is not strictly necessary (i.e. we could unlock it even
earlier than setting P=0), i.e. holding the lock until the entry is
deactivated is basically just a safety measure for possible buggy code?
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Baolu Lu 3 weeks, 1 day ago
On 1/16/26 05:35, Dmytro Maluka wrote:
> On Thu, Jan 15, 2026 at 10:45:12AM +0800, Baolu Lu wrote:
>> On 1/14/26 19:12, Dmytro Maluka wrote:
>>> On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
>>>> On 1/14/26 03:34, Dmytro Maluka wrote:
>>>>> On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
>>>>>> +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
>>>>> Is it safe to do this with iommu->lock already unlocked?
>>>>
>>>> Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
>>>>> mutex in the iommu core, which ensures that no other thread can attempt
>>>> to allocate or setup this same PASID until intel_pasid_tear_down_entry()
>>>> has returned.
>>>>
>>>> The iommu->lock is held during the initial transition (P->0) to ensure
>>>> atomicity against other hardware-table walkers, but once the P bit is
>>>> cleared and the caches are flushed, the final zeroing of the 'dead'
>>>> entry does not strictly require the spinlock because the PASID remains
>>>> reserved in software until the function completes.
>>>
>>> Ok. Just to understand: "other hardware-table walkers" means some
>>> software walkers, not hardware ones? Which software walkers are those?
>>> (I can't imagine how holding a spinlock could prevent the hardware from
>>> walking those tables. :))
>>
>> You are right. A spinlock doesn't stop the hardware. The spinlock
>> serializes software threads to ensure the hardware walker always sees a
>> consistent entry.
>>
>> When a PASID entry is active (P=1), other kernel paths might modify
>> the control bits in-place. For example:
>>
>> void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
>>                                            struct device *dev, u32 pasid)
>> {
>>          struct pasid_entry *pte;
>>          u16 did;
>>
>>          spin_lock(&iommu->lock);
>>          pte = intel_pasid_get_entry(dev, pasid);
>>          if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
>>                  spin_unlock(&iommu->lock);
>>                  return;
>>          }
>>
>>          pasid_set_pgsnp(pte);
>>          did = pasid_get_domain_id(pte);
>>          spin_unlock(&iommu->lock);
>>
>>          intel_pasid_flush_present(iommu, dev, pasid, did, pte);
>> }
>>
>> In this case, the iommu->lock ensures that if two threads try to modify
>> the same active entry, they don't interfere with each other and leave
>> the entry in a 'torn' state for the IOMMU hardware to read.
>>
>> In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
>> (setting P=0 and flushing caches), the entry is owned exclusively  by
>> the teardown thread until it is re-configured. That's the reason why the
>> final zeroing doesn't need the spinlock.
> 
> I see. Am I correct that those other code paths (modifying an entry
> in-place) are not supposed to do that concurrently with
> intel_pasid_tear_down_entry(), i.e. they should only do that while it is
> guaranteed that the entry remains present? Otherwise there is a bug
> (hence, for example, the WARN_ON in
> intel_pasid_setup_page_snoop_control())?

The iommu driver assumes that high-level software should ensure this.

> So, holding iommu->lock during
> entry teardown is not strictly necessary (i.e. we could unlock it even
> earlier than setting P=0), i.e. holding the lock until the entry is
> deactivated is basically just a safety measure for possible buggy code?

There are other paths that may be concurrent, such as the debugfs path
(dumping the pasid table through debugfs). Therefore, keeping iommu-
 >lock in the driver is neither redundant nor buggy.

Thanks,
baolu
Re: [PATCH 2/3] iommu/vt-d: Clear Present bit before tearing down PASID entry
Posted by Dmytro Maluka 2 weeks, 4 days ago
On Fri, Jan 16, 2026 at 02:06:30PM +0800, Baolu Lu wrote:
> On 1/16/26 05:35, Dmytro Maluka wrote:
> > On Thu, Jan 15, 2026 at 10:45:12AM +0800, Baolu Lu wrote:
> > > On 1/14/26 19:12, Dmytro Maluka wrote:
> > > > On Wed, Jan 14, 2026 at 01:38:13PM +0800, Baolu Lu wrote:
> > > > > On 1/14/26 03:34, Dmytro Maluka wrote:
> > > > > > On Tue, Jan 13, 2026 at 11:00:47AM +0800, Lu Baolu wrote:
> > > > > > > +	intel_pasid_clear_entry(iommu, dev, pasid, fault_ignore);
> > > > > > Is it safe to do this with iommu->lock already unlocked?
> > > > > 
> > > > > Yes, it is. The PASID entry lifecycle is serialized by the iommu_group-
> > > > > > mutex in the iommu core, which ensures that no other thread can attempt
> > > > > to allocate or setup this same PASID until intel_pasid_tear_down_entry()
> > > > > has returned.
> > > > > 
> > > > > The iommu->lock is held during the initial transition (P->0) to ensure
> > > > > atomicity against other hardware-table walkers, but once the P bit is
> > > > > cleared and the caches are flushed, the final zeroing of the 'dead'
> > > > > entry does not strictly require the spinlock because the PASID remains
> > > > > reserved in software until the function completes.
> > > > 
> > > > Ok. Just to understand: "other hardware-table walkers" means some
> > > > software walkers, not hardware ones? Which software walkers are those?
> > > > (I can't imagine how holding a spinlock could prevent the hardware from
> > > > walking those tables. :))
> > > 
> > > You are right. A spinlock doesn't stop the hardware. The spinlock
> > > serializes software threads to ensure the hardware walker always sees a
> > > consistent entry.
> > > 
> > > When a PASID entry is active (P=1), other kernel paths might modify
> > > the control bits in-place. For example:
> > > 
> > > void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
> > >                                            struct device *dev, u32 pasid)
> > > {
> > >          struct pasid_entry *pte;
> > >          u16 did;
> > > 
> > >          spin_lock(&iommu->lock);
> > >          pte = intel_pasid_get_entry(dev, pasid);
> > >          if (WARN_ON(!pte || !pasid_pte_is_present(pte))) {
> > >                  spin_unlock(&iommu->lock);
> > >                  return;
> > >          }
> > > 
> > >          pasid_set_pgsnp(pte);
> > >          did = pasid_get_domain_id(pte);
> > >          spin_unlock(&iommu->lock);
> > > 
> > >          intel_pasid_flush_present(iommu, dev, pasid, did, pte);
> > > }
> > > 
> > > In this case, the iommu->lock ensures that if two threads try to modify
> > > the same active entry, they don't interfere with each other and leave
> > > the entry in a 'torn' state for the IOMMU hardware to read.
> > > 
> > > In intel_pasid_tear_down_entry(), once the PASID entry is deactivated
> > > (setting P=0 and flushing caches), the entry is owned exclusively  by
> > > the teardown thread until it is re-configured. That's the reason why the
> > > final zeroing doesn't need the spinlock.
> > 
> > I see. Am I correct that those other code paths (modifying an entry
> > in-place) are not supposed to do that concurrently with
> > intel_pasid_tear_down_entry(), i.e. they should only do that while it is
> > guaranteed that the entry remains present? Otherwise there is a bug
> > (hence, for example, the WARN_ON in
> > intel_pasid_setup_page_snoop_control())?
> 
> The iommu driver assumes that high-level software should ensure this.
> 
> > So, holding iommu->lock during
> > entry teardown is not strictly necessary (i.e. we could unlock it even
> > earlier than setting P=0), i.e. holding the lock until the entry is
> > deactivated is basically just a safety measure for possible buggy code?
> 
> There are other paths that may be concurrent, such as the debugfs path
> (dumping the pasid table through debugfs). Therefore, keeping iommu-
> >lock in the driver is neither redundant nor buggy.

I see, that makes sense: clearing the present bit before iommu->lock is
unlocked should prevent such read-only walkers like debugfs from trying
to further walk down the path (i.e. to page tables) after iommu->lock is
unlocked.

> Thanks,
> baolu