[PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry

Guanghui Feng posted 1 patch 5 days, 1 hour ago
drivers/iommu/intel/dmar.c | 18 +++---------------
1 file changed, 3 insertions(+), 15 deletions(-)
[PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry
Posted by Guanghui Feng 5 days, 1 hour ago
Device-TLB Invalidation Response Time-out (ITE) handling was added in
commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695.

When an ITE occurs, iommu will sets the ITE (Invalidation Time-out
Error) field in the Fault Status Register. No new descriptors are
fetched from the Invalidation Queue until software clears the ITE field
in the Fault Status Register. Tail pointer Register updates by software
while the ITE field is Set does not cause descriptor fetches by
hardware. At the time ITE field is Set, hardware aborts any
inv_wait_dsc commands pending in hardware and does not increment
the Invalidation Queue Head register. When software clears the
ITE field in the Fault Status Register, hardware fetches
descriptor pointed by the Invalidation Queue Head register.

But in the qi_check_fault process, it is implemented by default
according to the 2009 commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695,
that is, only one struct qi_desc is submitted at a time. A qi_desc request is
immediately followed by a wait_desc/QI_IWD_TYPE for
synchronization. Therefore, the IOMMU driver implementation
considers invalid queue entries at odd positions to be
wait_desc. After ITE is set, hardware aborts any pending
inv_wait_dsc commands in hardware. Therefore, qi_check_fault
iterates through odd-position as wait_desc entries and sets
desc_status to QI_ABORT. However, the current implementation
allows multiple struct qi_desc to be submitted simultaneously,
followed by one wait_desc, so it's no longer guaranteed that
odd-position entries will be wait_desc. When the number of submitted
struct qi_desc is even, wait_desc's desc_status will not be set to QI_ABORT,
qi_check_fault will return 0, and qi_submit_sync will then
execute in an infinite loop and cause a hard lockup when
interrupts are disabled and the PCIe device does not respond to
Device-TLB Invalidation requests.

Additionally, if the device remains online and an IOMMU ITE
occurs, simply returning -EAGAIN is sufficient. When processing
the -EAGAIN result, qi_submit_sync will automatically reclaim
all submitted struct qi_desc and resubmit the requests.

Through this modification:
1. Correctly triggers the resubmission of struct qi_desc when
an ITE occurs.
2. Prevents the IOMMU driver from disabling interrupts and
executing in an infinite loop within qi_submit_sync when an
ITE occurs, avoiding hardlockup.

Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
---
 drivers/iommu/intel/dmar.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index ec975c73cfe6..f31f0095f9a8 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1271,7 +1271,7 @@ static void qi_dump_fault(struct intel_iommu *iommu, u32 fault)
 static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
 {
 	u32 fault;
-	int head, tail;
+	int head;
 	struct device *dev;
 	u64 iqe_err, ite_sid;
 	struct q_inval *qi = iommu->qi;
@@ -1312,12 +1312,6 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
 	 * No new descriptors are fetched until the ITE is cleared.
 	 */
 	if (fault & DMA_FSTS_ITE) {
-		head = readl(iommu->reg + DMAR_IQH_REG);
-		head = ((head >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
-		head |= 1;
-		tail = readl(iommu->reg + DMAR_IQT_REG);
-		tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
-
 		/*
 		 * SID field is valid only when the ITE field is Set in FSTS_REG
 		 * see Intel VT-d spec r4.1, section 11.4.9.9
@@ -1328,12 +1322,6 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
 		writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
 		pr_info("Invalidation Time-out Error (ITE) cleared\n");
 
-		do {
-			if (qi->desc_status[head] == QI_IN_USE)
-				qi->desc_status[head] = QI_ABORT;
-			head = (head - 2 + QI_LENGTH) % QI_LENGTH;
-		} while (head != tail);
-
 		/*
 		 * If device was released or isn't present, no need to retry
 		 * the ATS invalidate request anymore.
@@ -1347,8 +1335,8 @@ static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
 			    !pci_device_is_present(to_pci_dev(dev)))
 				return -ETIMEDOUT;
 		}
-		if (qi->desc_status[wait_index] == QI_ABORT)
-			return -EAGAIN;
+
+		return -EAGAIN;
 	}
 
 	if (fault & DMA_FSTS_ICE) {
-- 
2.43.7
Re: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry
Posted by Baolu Lu 2 days, 17 hours ago
On 2/2/2026 10:09 AM, Guanghui Feng wrote:
> Device-TLB Invalidation Response Time-out (ITE) handling was added in
> commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695.
> 
> When an ITE occurs, iommu will sets the ITE (Invalidation Time-out
> Error) field in the Fault Status Register. No new descriptors are
> fetched from the Invalidation Queue until software clears the ITE field
> in the Fault Status Register. Tail pointer Register updates by software
> while the ITE field is Set does not cause descriptor fetches by
> hardware. At the time ITE field is Set, hardware aborts any
> inv_wait_dsc commands pending in hardware and does not increment
> the Invalidation Queue Head register. When software clears the
> ITE field in the Fault Status Register, hardware fetches
> descriptor pointed by the Invalidation Queue Head register.
> 
> But in the qi_check_fault process, it is implemented by default
> according to the 2009 commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695,
> that is, only one struct qi_desc is submitted at a time. A qi_desc request is
> immediately followed by a wait_desc/QI_IWD_TYPE for
> synchronization. Therefore, the IOMMU driver implementation
> considers invalid queue entries at odd positions to be
> wait_desc. After ITE is set, hardware aborts any pending
> inv_wait_dsc commands in hardware. Therefore, qi_check_fault
> iterates through odd-position as wait_desc entries and sets
> desc_status to QI_ABORT. However, the current implementation
> allows multiple struct qi_desc to be submitted simultaneously,
> followed by one wait_desc, so it's no longer guaranteed that
> odd-position entries will be wait_desc. When the number of submitted
> struct qi_desc is even, wait_desc's desc_status will not be set to QI_ABORT,
> qi_check_fault will return 0, and qi_submit_sync will then
> execute in an infinite loop and cause a hard lockup when
> interrupts are disabled and the PCIe device does not respond to
> Device-TLB Invalidation requests.

Yes. This appears a real software bug.

> 
> Additionally, if the device remains online and an IOMMU ITE
> occurs, simply returning -EAGAIN is sufficient. When processing
> the -EAGAIN result, qi_submit_sync will automatically reclaim
> all submitted struct qi_desc and resubmit the requests.
> 
> Through this modification:
> 1. Correctly triggers the resubmission of struct qi_desc when
> an ITE occurs.
> 2. Prevents the IOMMU driver from disabling interrupts and
> executing in an infinite loop within qi_submit_sync when an
> ITE occurs, avoiding hardlockup.

But I think this fix changes the behavior of the driver.

Previously, when an ITE error was detected, it cleared the ITE so that
hardware could keep going, aborted all wait-descriptors that were being
handled by hardware, and returned -EAGAIN if its own wait-descriptor was
impacted.

This patch changes the behavior; it returns -EAGAIN directly whenever it
detects an ITE error, regardless of whether its wait-desc is impacted.
In the single-threaded case, it works as expected, but race condition
might occur when qi_submit_sync() is called in multiple threads at the
same time.

> 
> Signed-off-by: Guanghui Feng<guanghuifeng@linux.alibaba.com>
> ---
>   drivers/iommu/intel/dmar.c | 18 +++---------------
>   1 file changed, 3 insertions(+), 15 deletions(-)

Have you tried to fix it by dropping the "odd position" assumption? For
example, removing "head |= 1" and decrementing by 1 instead of 2 in the
loop?

      do {
              if (qi->desc_status[head] == QI_IN_USE)
                      qi->desc_status[head] = QI_ABORT;
              head = (head - 2 + QI_LENGTH) % QI_LENGTH;
      } while (head != tail);

Thanks,
baolu
Re: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry
Posted by guanghuifeng@linux.alibaba.com 1 day, 16 hours ago
在 2026/2/4 17:32, Baolu Lu 写道:
> On 2/2/2026 10:09 AM, Guanghui Feng wrote:
>> Device-TLB Invalidation Response Time-out (ITE) handling was added in
>> commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695.
>>
>> When an ITE occurs, iommu will sets the ITE (Invalidation Time-out
>> Error) field in the Fault Status Register. No new descriptors are
>> fetched from the Invalidation Queue until software clears the ITE field
>> in the Fault Status Register. Tail pointer Register updates by software
>> while the ITE field is Set does not cause descriptor fetches by
>> hardware. At the time ITE field is Set, hardware aborts any
>> inv_wait_dsc commands pending in hardware and does not increment
>> the Invalidation Queue Head register. When software clears the
>> ITE field in the Fault Status Register, hardware fetches
>> descriptor pointed by the Invalidation Queue Head register.
>>
>> But in the qi_check_fault process, it is implemented by default
>> according to the 2009 commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695,
>> that is, only one struct qi_desc is submitted at a time. A qi_desc 
>> request is
>> immediately followed by a wait_desc/QI_IWD_TYPE for
>> synchronization. Therefore, the IOMMU driver implementation
>> considers invalid queue entries at odd positions to be
>> wait_desc. After ITE is set, hardware aborts any pending
>> inv_wait_dsc commands in hardware. Therefore, qi_check_fault
>> iterates through odd-position as wait_desc entries and sets
>> desc_status to QI_ABORT. However, the current implementation
>> allows multiple struct qi_desc to be submitted simultaneously,
>> followed by one wait_desc, so it's no longer guaranteed that
>> odd-position entries will be wait_desc. When the number of submitted
>> struct qi_desc is even, wait_desc's desc_status will not be set to 
>> QI_ABORT,
>> qi_check_fault will return 0, and qi_submit_sync will then
>> execute in an infinite loop and cause a hard lockup when
>> interrupts are disabled and the PCIe device does not respond to
>> Device-TLB Invalidation requests.
>
> Yes. This appears a real software bug.
>
>>
>> Additionally, if the device remains online and an IOMMU ITE
>> occurs, simply returning -EAGAIN is sufficient. When processing
>> the -EAGAIN result, qi_submit_sync will automatically reclaim
>> all submitted struct qi_desc and resubmit the requests.
>>
>> Through this modification:
>> 1. Correctly triggers the resubmission of struct qi_desc when
>> an ITE occurs.
>> 2. Prevents the IOMMU driver from disabling interrupts and
>> executing in an infinite loop within qi_submit_sync when an
>> ITE occurs, avoiding hardlockup.
>
> But I think this fix changes the behavior of the driver.
>
> Previously, when an ITE error was detected, it cleared the ITE so that
> hardware could keep going, aborted all wait-descriptors that were being
> handled by hardware, and returned -EAGAIN if its own wait-descriptor was
> impacted.
>
> This patch changes the behavior; it returns -EAGAIN directly whenever it
> detects an ITE error, regardless of whether its wait-desc is impacted.
> In the single-threaded case, it works as expected, but race condition
> might occur when qi_submit_sync() is called in multiple threads at the
> same time.
>
>>
>> Signed-off-by: Guanghui Feng<guanghuifeng@linux.alibaba.com>
>> ---
>>   drivers/iommu/intel/dmar.c | 18 +++---------------
>>   1 file changed, 3 insertions(+), 15 deletions(-)
>
> Have you tried to fix it by dropping the "odd position" assumption? For
> example, removing "head |= 1" and decrementing by 1 instead of 2 in the
> loop?
>
>      do {
>              if (qi->desc_status[head] == QI_IN_USE)
>                      qi->desc_status[head] = QI_ABORT;
>              head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>      } while (head != tail);
>
> Thanks,
> baolu

Thank you for your reply.

There are a few points that need clarification:
The descriptors between head and tail are requests that have not been 
fetched and executed.


Regarding the requests before the head:
Method 1: Does the IOMMU update the head address register immediately 
after fetching the descriptor?
Method 2: Or does the IOMMU update the head register only after fetching 
and executing the request?

The current Intel IOMMU VT-d specification does not describe this 
behavior in detail.
Does the IOMMU currently use Method 1?

Therefore, after an ITE timeout, it's necessary to resend the requests 
before the head index


Re: [PATCH] iommu/vt-d: fix intel iommu iotlb sync hardlockup & retry
Posted by Baolu Lu 1 day ago
On 2/5/26 18:28, guanghuifeng@linux.alibaba.com wrote:
> 
> 在 2026/2/4 17:32, Baolu Lu 写道:
>> On 2/2/2026 10:09 AM, Guanghui Feng wrote:
>>> Device-TLB Invalidation Response Time-out (ITE) handling was added in
>>> commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695.
>>>
>>> When an ITE occurs, iommu will sets the ITE (Invalidation Time-out
>>> Error) field in the Fault Status Register. No new descriptors are
>>> fetched from the Invalidation Queue until software clears the ITE field
>>> in the Fault Status Register. Tail pointer Register updates by software
>>> while the ITE field is Set does not cause descriptor fetches by
>>> hardware. At the time ITE field is Set, hardware aborts any
>>> inv_wait_dsc commands pending in hardware and does not increment
>>> the Invalidation Queue Head register. When software clears the
>>> ITE field in the Fault Status Register, hardware fetches
>>> descriptor pointed by the Invalidation Queue Head register.
>>>
>>> But in the qi_check_fault process, it is implemented by default
>>> according to the 2009 commit: 6ba6c3a4cacfd68bf970e3e04e2ff0d66fa0f695,
>>> that is, only one struct qi_desc is submitted at a time. A qi_desc 
>>> request is
>>> immediately followed by a wait_desc/QI_IWD_TYPE for
>>> synchronization. Therefore, the IOMMU driver implementation
>>> considers invalid queue entries at odd positions to be
>>> wait_desc. After ITE is set, hardware aborts any pending
>>> inv_wait_dsc commands in hardware. Therefore, qi_check_fault
>>> iterates through odd-position as wait_desc entries and sets
>>> desc_status to QI_ABORT. However, the current implementation
>>> allows multiple struct qi_desc to be submitted simultaneously,
>>> followed by one wait_desc, so it's no longer guaranteed that
>>> odd-position entries will be wait_desc. When the number of submitted
>>> struct qi_desc is even, wait_desc's desc_status will not be set to 
>>> QI_ABORT,
>>> qi_check_fault will return 0, and qi_submit_sync will then
>>> execute in an infinite loop and cause a hard lockup when
>>> interrupts are disabled and the PCIe device does not respond to
>>> Device-TLB Invalidation requests.
>>
>> Yes. This appears a real software bug.
>>
>>>
>>> Additionally, if the device remains online and an IOMMU ITE
>>> occurs, simply returning -EAGAIN is sufficient. When processing
>>> the -EAGAIN result, qi_submit_sync will automatically reclaim
>>> all submitted struct qi_desc and resubmit the requests.
>>>
>>> Through this modification:
>>> 1. Correctly triggers the resubmission of struct qi_desc when
>>> an ITE occurs.
>>> 2. Prevents the IOMMU driver from disabling interrupts and
>>> executing in an infinite loop within qi_submit_sync when an
>>> ITE occurs, avoiding hardlockup.
>>
>> But I think this fix changes the behavior of the driver.
>>
>> Previously, when an ITE error was detected, it cleared the ITE so that
>> hardware could keep going, aborted all wait-descriptors that were being
>> handled by hardware, and returned -EAGAIN if its own wait-descriptor was
>> impacted.
>>
>> This patch changes the behavior; it returns -EAGAIN directly whenever it
>> detects an ITE error, regardless of whether its wait-desc is impacted.
>> In the single-threaded case, it works as expected, but race condition
>> might occur when qi_submit_sync() is called in multiple threads at the
>> same time.
>>
>>>
>>> Signed-off-by: Guanghui Feng<guanghuifeng@linux.alibaba.com>
>>> ---
>>>   drivers/iommu/intel/dmar.c | 18 +++---------------
>>>   1 file changed, 3 insertions(+), 15 deletions(-)
>>
>> Have you tried to fix it by dropping the "odd position" assumption? For
>> example, removing "head |= 1" and decrementing by 1 instead of 2 in the
>> loop?
>>
>>      do {
>>              if (qi->desc_status[head] == QI_IN_USE)
>>                      qi->desc_status[head] = QI_ABORT;
>>              head = (head - 2 + QI_LENGTH) % QI_LENGTH;
>>      } while (head != tail);
>>
>> Thanks,
>> baolu
> 
> Thank you for your reply.
> 
> There are a few points that need clarification:
> The descriptors between head and tail are requests that have not been 
> fetched and executed.
> 
> 
> Regarding the requests before the head:
> Method 1: Does the IOMMU update the head address register immediately 
> after fetching the descriptor?
> Method 2: Or does the IOMMU update the head register only after fetching 
> and executing the request?
> 
> The current Intel IOMMU VT-d specification does not describe this 
> behavior in detail.
> Does the IOMMU currently use Method 1?
> 
> Therefore, after an ITE timeout, it's necessary to resend the requests 
> before the head index

An obvious race that I can think of is something like this:

Thread A placed a dev-tlb-inv-desc in the invalidation queue. After
that, thread B placed an iotlb-inv-desc in the queue. Now the requests
in the queue look like this:

     dev-tlb-inv-desc for A
     iotlb-inv-desc for B

Then a device TLB invalidation timeout error happens and triggers the
ITE bit to be set in the fault register. Thread B sees this in its
qi_check_fault(), clears the ITE bit, and returns -EAGAIN. Then thread A
will loop infinitely waiting for DONE in its wait-desc.

The qi_submit_sync() logic has been there for years. Changing its
behavior without enough validation on real hardware will cause
unexpected issues. The better approach I would suggest is to fix the
outdated logic.

Thanks,
baolu