[v2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device

[PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

Posted by Jinhui Guo 2 months ago

Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
request when device is disconnected") relies on
pci_dev_is_disconnected() to skip ATS invalidation for
safely-removed devices, but it does not cover link-down caused
by faults, which can still hard-lock the system.

For example, if a VM fails to connect to the PCIe device,
"virsh destroy" is executed to release resources and isolate
the fault, but a hard-lockup occurs while releasing the group fd.

Call Trace:
 qi_submit_sync
 qi_flush_dev_iotlb
 intel_pasid_tear_down_entry
 device_block_translation
 blocking_domain_attach_dev
 __iommu_attach_device
 __iommu_device_set_domain
 __iommu_group_set_domain_internal
 iommu_detach_group
 vfio_iommu_type1_detach_group
 vfio_group_detach_container
 vfio_group_fops_release
 __fput

Although pci_device_is_present() is slower than
pci_dev_is_disconnected(), it still takes only ~70 µs on a
ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
and width increase.

Besides, devtlb_invalidation_with_pasid() is called only in the
paths below, which are far less frequent than memory map/unmap.

1. mm-struct release
2. {attach,release}_dev
3. set/remove PASID
4. dirty-tracking setup

The gain in system stability far outweighs the negligible cost
of using pci_device_is_present() instead of pci_dev_is_disconnected()
to decide when to skip ATS invalidation, especially under GDR
high-load conditions.

Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/iommu/intel/pasid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index a369690f5926..e64d445de964 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -218,7 +218,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 	if (!info || !info->ats_enabled)
 		return;
 
-	if (pci_dev_is_disconnected(to_pci_dev(dev)))
+	if (!pci_device_is_present(to_pci_dev(dev)))
 		return;
 
 	sid = PCI_DEVID(info->bus, info->devfn);
-- 
2.20.1

RE: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

Posted by Tian, Kevin 1 month, 3 weeks ago

> From: Jinhui Guo <guojinhui.liam@bytedance.com>
> Sent: Thursday, December 11, 2025 12:00 PM
> 
> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> request when device is disconnected") relies on
> pci_dev_is_disconnected() to skip ATS invalidation for
> safely-removed devices, but it does not cover link-down caused
> by faults, which can still hard-lock the system.

According to the commit msg it actually tries to fix the hard lockup
with surprise removal. For safe removal the device is not removed
before invalidation is done:

"
    For safe removal, device wouldn't be removed until the whole software
    handling process is done, it wouldn't trigger the hard lock up issue
    caused by too long ATS Invalidation timeout wait.
"

Can you help articulate the problem especially about the part
'link-down caused by faults"? What are those faults? How are
they different from the said surprise removal in the commit
msg to not set pci_dev_is_disconnected()?

> 
> For example, if a VM fails to connect to the PCIe device,

'failed' for what reason?

> "virsh destroy" is executed to release resources and isolate
> the fault, but a hard-lockup occurs while releasing the group fd.
> 
> Call Trace:
>  qi_submit_sync
>  qi_flush_dev_iotlb
>  intel_pasid_tear_down_entry
>  device_block_translation
>  blocking_domain_attach_dev
>  __iommu_attach_device
>  __iommu_device_set_domain
>  __iommu_group_set_domain_internal
>  iommu_detach_group
>  vfio_iommu_type1_detach_group
>  vfio_group_detach_container
>  vfio_group_fops_release
>  __fput
> 
> Although pci_device_is_present() is slower than
> pci_dev_is_disconnected(), it still takes only ~70 µs on a
> ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
> and width increase.
> 
> Besides, devtlb_invalidation_with_pasid() is called only in the
> paths below, which are far less frequent than memory map/unmap.
> 
> 1. mm-struct release
> 2. {attach,release}_dev
> 3. set/remove PASID
> 4. dirty-tracking setup
> 

surprise removal can happen at any time, e.g. after the check of
pci_device_is_present(). In the end we need the logic in
qi_check_fault() to check the presence upon ITE timeout error
received to break the infinite loop. So in your case even with
that logici in place you still observe lockup (probably due to
hardware ITE timeout is longer than the lockup detection on 
the CPU?

In any case this change cannot 100% fix the lockup. It just
reduces the possibility which should be made clear.

Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

Posted by Jinhui Guo 1 month, 2 weeks ago

On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> > From: Jinhui Guo <guojinhui.liam@bytedance.com>
> > Sent: Thursday, December 11, 2025 12:00 PM
> > 
> > Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> > request when device is disconnected") relies on
> > pci_dev_is_disconnected() to skip ATS invalidation for
> > safely-removed devices, but it does not cover link-down caused
> > by faults, which can still hard-lock the system.
> 
> According to the commit msg it actually tries to fix the hard lockup
> with surprise removal. For safe removal the device is not removed
> before invalidation is done:
> 
> "
>     For safe removal, device wouldn't be removed until the whole software
>     handling process is done, it wouldn't trigger the hard lock up issue
>     caused by too long ATS Invalidation timeout wait.
> "
> 
> Can you help articulate the problem especially about the part
> 'link-down caused by faults"? What are those faults? How are
> they different from the said surprise removal in the commit
> msg to not set pci_dev_is_disconnected()?
> 

Hi, kevin, sorry for the delayed reply.

A normal or surprise removal of a PCIe device on a hot-plug port normally
triggers an interrupt from the PCIe switch.

We have, however, observed cases where no interrupt is generated when the
device suddenly loses its link; the behaviour is identical to setting the
Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
unknown.

> > 
> > For example, if a VM fails to connect to the PCIe device,
> 
> 'failed' for what reason?
> 
> > "virsh destroy" is executed to release resources and isolate
> > the fault, but a hard-lockup occurs while releasing the group fd.
> > 
> > Call Trace:
> >  qi_submit_sync
> >  qi_flush_dev_iotlb
> >  intel_pasid_tear_down_entry
> >  device_block_translation
> >  blocking_domain_attach_dev
> >  __iommu_attach_device
> >  __iommu_device_set_domain
> >  __iommu_group_set_domain_internal
> >  iommu_detach_group
> >  vfio_iommu_type1_detach_group
> >  vfio_group_detach_container
> >  vfio_group_fops_release
> >  __fput
> > 
> > Although pci_device_is_present() is slower than
> > pci_dev_is_disconnected(), it still takes only ~70 µs on a
> > ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
> > and width increase.
> > 
> > Besides, devtlb_invalidation_with_pasid() is called only in the
> > paths below, which are far less frequent than memory map/unmap.
> > 
> > 1. mm-struct release
> > 2. {attach,release}_dev
> > 3. set/remove PASID
> > 4. dirty-tracking setup
> > 
> 
> surprise removal can happen at any time, e.g. after the check of
> pci_device_is_present(). In the end we need the logic in
> qi_check_fault() to check the presence upon ITE timeout error
> received to break the infinite loop. So in your case even with
> that logici in place you still observe lockup (probably due to
> hardware ITE timeout is longer than the lockup detection on 
> the CPU?

Are you referring to the timeout added in patch
https://lore.kernel.org/all/20240222090251.2849702-4-haifeng.zhao@linux.intel.com/ ?

Our lockup-detection timeout is the default 10 s.

We see ITE-timeout messages in the kernel log. Yet the system still
hard-locks—probably because, as you mentioned, the hardware ITE timeout
is longer than the CPU’s lockup-detection window. I’ll reproduce the
case and follow up with a deeper analysis.

kernel: [ 2402.642685][  T607] vfio-pci 0000:3f:00.0: Unable to change power state from D0 to D3hot, device inaccessible
kernel: [ 2403.441828][T49880] DMAR: VT-d detected Invalidation Time-out Error: SID 0
kernel: [ 2403.441830][    C0] DMAR: DRHD: handling fault status reg 40
kernel: [ 2403.441831][T49880] DMAR: QI HEAD: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07fc
kernel: [ 2403.441833][T49880] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07f8
kernel: [ 2403.441879][T49880] DMAR: Invalidation Time-out Error (ITE) cleared
kernel: [ 2423.643527][    C7] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
kernel: [ 2423.643551][    C7] rcu:        8-...0: (0 ticks this GP) idle=198c/1/0x4000000000000000 softirq=19450/19450 fqs=4403
kernel: [ 2423.643567][    C7] rcu:        (detected by 7, t=21002 jiffies, g=238909, q=4932 ncpus=96)
kernel: [ 2423.643578][    C7] Sending NMI from CPU 7 to CPUs 8:
kernel: [ 2423.643581][    C8] NMI backtrace for cpu 8
kernel: [ 2423.643585][    C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S          E       6.18.0 #5 PREEMPT(voluntary)
kernel: [ 2423.643588][    C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
kernel: [ 2423.643589][    C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021
kernel: [ 2423.643590][    C8] RIP: 0010:qi_submit_sync+0x6cf/0x8d0
kernel: [ 2423.643597][    C8] Code: 89 4c 24 50 89 70 34 48 c7 c7 f0 f5 4a a5 e8 48 15 89 ff 48 8b 4c 24 50 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 <75> 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1
kernel: [ 2423.643598][    C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000097
kernel: [ 2423.643600][    C8] RAX: ffff9dac803a06bc RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 2423.643601][    C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480
kernel: [ 2423.643602][    C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003
kernel: [ 2423.643603][    C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040
kernel: [ 2423.643605][    C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000000af
kernel: [ 2423.643606][    C8] FS:  0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000
kernel: [ 2423.643607][    C8] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 2423.643608][    C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0
kernel: [ 2423.643610][    C8] PKRU: 55555554
kernel: [ 2423.643611][    C8] Call Trace:
kernel: [ 2423.643613][    C8]  <TASK>
kernel: [ 2423.643616][    C8]  ? __pfx_domain_context_clear_one_cb+0x10/0x10
kernel: [ 2423.643620][    C8]  qi_flush_dev_iotlb+0xd5/0xe0
kernel: [ 2423.643622][    C8]  __context_flush_dev_iotlb.part.0+0x3c/0x80
kernel: [ 2423.643625][    C8]  domain_context_clear_one_cb+0x16/0x20
kernel: [ 2423.643626][    C8]  pci_for_each_dma_alias+0x3b/0x140
kernel: [ 2423.643631][    C8]  device_block_translation+0x122/0x180
kernel: [ 2423.643634][    C8]  blocking_domain_attach_dev+0x39/0x50
kernel: [ 2423.643636][    C8]  __iommu_attach_device+0x1b/0x90
kernel: [ 2423.643639][    C8]  __iommu_device_set_domain+0x5d/0xb0
kernel: [ 2423.643642][    C8]  __iommu_group_set_domain_internal+0x60/0x110
kernel: [ 2423.643644][    C8]  iommu_detach_group+0x3a/0x60
kernel: [ 2423.643650][    C8]  vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1]
kernel: [ 2423.643654][    C8]  ? __dentry_kill+0x12a/0x180
kernel: [ 2423.643660][    C8]  ? __pm_runtime_idle+0x44/0xe0
kernel: [ 2423.643666][    C8]  vfio_group_detach_container+0x4f/0x160 [vfio]
kernel: [ 2423.643672][    C8]  vfio_group_fops_release+0x3e/0x80 [vfio]
kernel: [ 2423.643677][    C8]  __fput+0xe6/0x2b0
kernel: [ 2423.643682][    C8]  task_work_run+0x58/0x90
kernel: [ 2423.643688][    C8]  do_exit+0x29b/0xa80
kernel: [ 2423.643694][    C8]  do_group_exit+0x2c/0x80
kernel: [ 2423.643696][    C8]  get_signal+0x8f9/0x900
kernel: [ 2423.643700][    C8]  arch_do_signal_or_restart+0x29/0x210
kernel: [ 2423.643704][    C8]  ? __schedule+0x582/0xe80
kernel: [ 2423.643708][    C8]  exit_to_user_mode_loop+0x8e/0x4f0
kernel: [ 2423.643712][    C8]  do_syscall_64+0x262/0x630
kernel: [ 2423.643717][    C8]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: [ 2423.643720][    C8] RIP: 0033:0x7fde19078514
kernel: [ 2423.643722][    C8] Code: Unable to access opcode bytes at 0x7fde190784ea.
kernel: [ 2423.643723][    C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
kernel: [ 2423.643724][    C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514
kernel: [ 2423.643726][    C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000
kernel: [ 2423.643727][    C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000
kernel: [ 2423.643728][    C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0
kernel: [ 2423.643729][    C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2423.643731][    C8]  </TASK>
kernel: [ 2424.375254][T81463] vfio-pci 0000:3f:00.0: Unable to change power state from D3cold to D0, device inaccessible
...
kernel: [ 2448.327929][    C8] watchdog: CPU8: Watchdog detected hard LOCKUP on cpu 8
kernel: [ 2448.327932][    C8] Modules linked in: vfio_pci(E) vfio_pci_core(E) vfio_iommu_type1(E) vfio(E) udp_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) ip_set_hash_net(E) nft_compat(E) x_tables(E) ip_set(E) msr(E) nf_tables(E) ...
kernel: [ 2448.327963][    C8]  ib_core(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) nvme(E) xhci_hcd(E) i2c_i801(E) nvme_core(E) usbcore(E) scsi_mod(E) mlx5_core(E) i2c_smbus(E) lpc_ich(E) usb_common(E) scsi_common(E) wmi(E)
kernel: [ 2448.327972][    C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S          EL      6.18.0 #5 PREEMPT(voluntary)
kernel: [ 2448.327975][    C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP
kernel: [ 2448.327976][    C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021
kernel: [ 2448.327977][    C8] RIP: 0010:qi_submit_sync+0x6e7/0x8d0
kernel: [ 2448.327981][    C8] Code: 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 75 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 e8 18 <41> 01 c7 45 0f b6 ff 41 29 c7 44 39 fa 75 cb 48 85 c9 0f 85 05 01
kernel: [ 2448.327983][    C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000046
kernel: [ 2448.327984][    C8] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 2448.327985][    C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480
kernel: [ 2448.327986][    C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003
kernel: [ 2448.327987][    C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040
kernel: [ 2448.327988][    C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000001b3
kernel: [ 2448.327989][    C8] FS:  0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000
kernel: [ 2448.327990][    C8] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 2448.327991][    C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0
kernel: [ 2448.327992][    C8] PKRU: 55555554
kernel: [ 2448.327993][    C8] Call Trace:
kernel: [ 2448.327995][    C8]  <TASK>
kernel: [ 2448.327997][    C8]  ? __pfx_domain_context_clear_one_cb+0x10/0x10
kernel: [ 2448.328000][    C8]  qi_flush_dev_iotlb+0xd5/0xe0
kernel: [ 2448.328002][    C8]  __context_flush_dev_iotlb.part.0+0x3c/0x80
kernel: [ 2448.328004][    C8]  domain_context_clear_one_cb+0x16/0x20
kernel: [ 2448.328006][    C8]  pci_for_each_dma_alias+0x3b/0x140
kernel: [ 2448.328010][    C8]  device_block_translation+0x122/0x180
kernel: [ 2448.328012][    C8]  blocking_domain_attach_dev+0x39/0x50
kernel: [ 2448.328014][    C8]  __iommu_attach_device+0x1b/0x90
kernel: [ 2448.328017][    C8]  __iommu_device_set_domain+0x5d/0xb0
kernel: [ 2448.328019][    C8]  __iommu_group_set_domain_internal+0x60/0x110
kernel: [ 2448.328021][    C8]  iommu_detach_group+0x3a/0x60
kernel: [ 2448.328023][    C8]  vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1]
kernel: [ 2448.328026][    C8]  ? __dentry_kill+0x12a/0x180
kernel: [ 2448.328030][    C8]  ? __pm_runtime_idle+0x44/0xe0
kernel: [ 2448.328035][    C8]  vfio_group_detach_container+0x4f/0x160 [vfio]
kernel: [ 2448.328041][    C8]  vfio_group_fops_release+0x3e/0x80 [vfio]
kernel: [ 2448.328046][    C8]  __fput+0xe6/0x2b0
kernel: [ 2448.328049][    C8]  task_work_run+0x58/0x90
kernel: [ 2448.328053][    C8]  do_exit+0x29b/0xa80
kernel: [ 2448.328057][    C8]  do_group_exit+0x2c/0x80
kernel: [ 2448.328060][    C8]  get_signal+0x8f9/0x900
kernel: [ 2448.328064][    C8]  arch_do_signal_or_restart+0x29/0x210
kernel: [ 2448.328068][    C8]  ? __schedule+0x582/0xe80
kernel: [ 2448.328070][    C8]  exit_to_user_mode_loop+0x8e/0x4f0
kernel: [ 2448.328074][    C8]  do_syscall_64+0x262/0x630
kernel: [ 2448.328076][    C8]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: [ 2448.328078][    C8] RIP: 0033:0x7fde19078514
kernel: [ 2448.328080][    C8] Code: Unable to access opcode bytes at 0x7fde190784ea.
kernel: [ 2448.328081][    C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022
kernel: [ 2448.328082][    C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514
kernel: [ 2448.328083][    C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000
kernel: [ 2448.328085][    C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000
kernel: [ 2448.328085][    C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0
kernel: [ 2448.328086][    C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 2448.328088][    C8]  </TASK>
kernel: [ 2450.245901][    C7] watchdog: BUG: soft lockup - CPU#7 stuck for 41s! [mongoosev3-agen:4727]

> 
> In any case this change cannot 100% fix the lockup. It just
> reduces the possibility which should be made clear.

I agree with the above, but it's better to cover more corner cases.

Best Regards,
Jinhui

Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

Posted by Baolu Lu 1 month, 2 weeks ago

On 12/22/25 19:19, Jinhui Guo wrote:
> On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
>>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
>>> Sent: Thursday, December 11, 2025 12:00 PM
>>>
>>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
>>> request when device is disconnected") relies on
>>> pci_dev_is_disconnected() to skip ATS invalidation for
>>> safely-removed devices, but it does not cover link-down caused
>>> by faults, which can still hard-lock the system.
>> According to the commit msg it actually tries to fix the hard lockup
>> with surprise removal. For safe removal the device is not removed
>> before invalidation is done:
>>
>> "
>>      For safe removal, device wouldn't be removed until the whole software
>>      handling process is done, it wouldn't trigger the hard lock up issue
>>      caused by too long ATS Invalidation timeout wait.
>> "
>>
>> Can you help articulate the problem especially about the part
>> 'link-down caused by faults"? What are those faults? How are
>> they different from the said surprise removal in the commit
>> msg to not set pci_dev_is_disconnected()?
>>
> Hi, kevin, sorry for the delayed reply.
> 
> A normal or surprise removal of a PCIe device on a hot-plug port normally
> triggers an interrupt from the PCIe switch.
> 
> We have, however, observed cases where no interrupt is generated when the
> device suddenly loses its link; the behaviour is identical to setting the
> Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
> unknown.

In this scenario, the hardware has effectively vanished, yet the device
driver remains bound and the IOMMU resources haven't been released. I’m
just curious if this stale state could trigger issues in other places
before the kernel fully realizes the device is gone? I’m not objecting
to the fix. I'm just interested in whether this 'zombie' state creates
risks elsewhere.

> 
>>> For example, if a VM fails to connect to the PCIe device,
>> 'failed' for what reason?
>>
>>> "virsh destroy" is executed to release resources and isolate
>>> the fault, but a hard-lockup occurs while releasing the group fd.
>>>
>>> Call Trace:
>>>   qi_submit_sync
>>>   qi_flush_dev_iotlb
>>>   intel_pasid_tear_down_entry
>>>   device_block_translation
>>>   blocking_domain_attach_dev
>>>   __iommu_attach_device
>>>   __iommu_device_set_domain
>>>   __iommu_group_set_domain_internal
>>>   iommu_detach_group
>>>   vfio_iommu_type1_detach_group
>>>   vfio_group_detach_container
>>>   vfio_group_fops_release
>>>   __fput
>>>
>>> Although pci_device_is_present() is slower than
>>> pci_dev_is_disconnected(), it still takes only ~70 µs on a
>>> ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
>>> and width increase.
>>>
>>> Besides, devtlb_invalidation_with_pasid() is called only in the
>>> paths below, which are far less frequent than memory map/unmap.
>>>
>>> 1. mm-struct release
>>> 2. {attach,release}_dev
>>> 3. set/remove PASID
>>> 4. dirty-tracking setup
>>>
>> surprise removal can happen at any time, e.g. after the check of
>> pci_device_is_present(). In the end we need the logic in
>> qi_check_fault() to check the presence upon ITE timeout error
>> received to break the infinite loop. So in your case even with
>> that logici in place you still observe lockup (probably due to
>> hardware ITE timeout is longer than the lockup detection on
>> the CPU?
> Are you referring to the timeout added in patch
> https://lore.kernel.org/all/20240222090251.2849702-4- 
> haifeng.zhao@linux.intel.com/ ?

This doesn't appear to be a deterministic solution, because ...

> Our lockup-detection timeout is the default 10 s.
> 
> We see ITE-timeout messages in the kernel log. Yet the system still
> hard-locks—probably because, as you mentioned, the hardware ITE timeout
> is longer than the CPU’s lockup-detection window. I’ll reproduce the
> case and follow up with a deeper analysis.

... as you see, neither the PCI nor the VT-d specifications mandate a
specific device-TLB invalidation timeout value for hardware
implementations. Consequently, the ITE timeout value may exceed the CPU
watchdog threshold, meaning a hard lockup will be detected before the
ITE even occurs.

Thanks,
baolu

RE: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

Posted by Tian, Kevin 1 month, 2 weeks ago

+Bjorn for guidance.

quick context - previously intel-iommu driver fixed a lockup issue in surprise
removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the
lockup issue in a setup where no interrupt is raised to pci core upon surprise
removal (so pci_dev_is_disconnected() is false), hence suggesting to replace
the check with pci_device_is_present() instead.

Bjorn, is it a common practice to fix it directly/only in drivers or should the
pci core be notified e.g. simulating a late removal event? By searching the
code looks it's the former, but better confirm with you before picking this
fix...

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Tuesday, December 23, 2025 12:06 PM
> 
> On 12/22/25 19:19, Jinhui Guo wrote:
> > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> >>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
> >>> Sent: Thursday, December 11, 2025 12:00 PM
> >>>
> >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> >>> request when device is disconnected") relies on
> >>> pci_dev_is_disconnected() to skip ATS invalidation for
> >>> safely-removed devices, but it does not cover link-down caused
> >>> by faults, which can still hard-lock the system.
> >> According to the commit msg it actually tries to fix the hard lockup
> >> with surprise removal. For safe removal the device is not removed
> >> before invalidation is done:
> >>
> >> "
> >>      For safe removal, device wouldn't be removed until the whole software
> >>      handling process is done, it wouldn't trigger the hard lock up issue
> >>      caused by too long ATS Invalidation timeout wait.
> >> "
> >>
> >> Can you help articulate the problem especially about the part
> >> 'link-down caused by faults"? What are those faults? How are
> >> they different from the said surprise removal in the commit
> >> msg to not set pci_dev_is_disconnected()?
> >>
> > Hi, kevin, sorry for the delayed reply.
> >
> > A normal or surprise removal of a PCIe device on a hot-plug port normally
> > triggers an interrupt from the PCIe switch.
> >
> > We have, however, observed cases where no interrupt is generated when
> the
> > device suddenly loses its link; the behaviour is identical to setting the
> > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> > what goes wrong in the LTSSM between the PCIe switch and the endpoint
> remains
> > unknown.
> 
> In this scenario, the hardware has effectively vanished, yet the device
> driver remains bound and the IOMMU resources haven't been released. I’m
> just curious if this stale state could trigger issues in other places
> before the kernel fully realizes the device is gone? I’m not objecting
> to the fix. I'm just interested in whether this 'zombie' state creates
> risks elsewhere.
>

Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

Posted by Jinhui Guo 1 month, 2 weeks ago

On Tue, Dec 23, 2025 12:06:24 +0800, Baolu Lu wrote:
> > On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
> >>> From: Jinhui Guo<guojinhui.liam@bytedance.com>
> >>> Sent: Thursday, December 11, 2025 12:00 PM
> >>>
> >>> Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
> >>> request when device is disconnected") relies on
> >>> pci_dev_is_disconnected() to skip ATS invalidation for
> >>> safely-removed devices, but it does not cover link-down caused
> >>> by faults, which can still hard-lock the system.
> >> According to the commit msg it actually tries to fix the hard lockup
> >> with surprise removal. For safe removal the device is not removed
> >> before invalidation is done:
> >>
> >> "
> >>      For safe removal, device wouldn't be removed until the whole software
> >>      handling process is done, it wouldn't trigger the hard lock up issue
> >>      caused by too long ATS Invalidation timeout wait.
> >> "
> >>
> >> Can you help articulate the problem especially about the part
> >> 'link-down caused by faults"? What are those faults? How are
> >> they different from the said surprise removal in the commit
> >> msg to not set pci_dev_is_disconnected()?
> >>
> > Hi, kevin, sorry for the delayed reply.
> > 
> > A normal or surprise removal of a PCIe device on a hot-plug port normally
> > triggers an interrupt from the PCIe switch.
> > 
> > We have, however, observed cases where no interrupt is generated when the
> > device suddenly loses its link; the behaviour is identical to setting the
> > Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
> > what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
> > unknown.
> 
> In this scenario, the hardware has effectively vanished, yet the device
> driver remains bound and the IOMMU resources haven't been released. I’m
> just curious if this stale state could trigger issues in other places
> before the kernel fully realizes the device is gone? I’m not objecting
> to the fix. I'm just interested in whether this 'zombie' state creates
> risks elsewhere.

Hi, Baolu

In our scenario we see no other issues; a hard-LOCKUP panic is triggered the
moment the Mellanox Ethernet device vanishes. But we can analyze what happens
when we access the Mellanox Ethernet device whose link is disabled.
(If we check whether the PCIe endpoint device (Mellanox Ethernet) is present
before issuing device-IOTLB invalidation to the Intel IOMMU, no other issues
appear.)

According to the PCIe spec, Rev. 5.0 v1.0, Sec. 2.4.1, there are two kinds of
TLPs: posted and non-posted. Non-posted TLPs require a completion TLP; posted
TLPs do not.

- A Posted Request is a Memory Write Request or a Message Request.
- A Read Request is a Configuration Read Request, an I/O Read Request, or a
  Memory Read Request.
- An NPR (Non-Posted Request) with Data is a Configuration Write Request, an
  I/O Write Request, or an AtomicOp Request.
- A Non-Posted Request is a Read Request or an NPR with Data.

When the CPU issues a PCIe memory-write TLP (posted) via a MOV instruction,
the instruction retires immediately after the packet reaches the Root Complex;
no Data-Link ACK/NAK is required. A memory-read TLP (non-posted), however, stalls
the core until the corresponding Completion TLP is received - if that Completion
never arrives, the CPU hangs. (The CPU hangs if the LTSSM does not enter the
Disabled state.)

However, if the LTSSM enters the Disabled state, the Root Port returns
Completer-Abort (CA) for any non-posted TLP, so the request completes with status
0xFFFFFFFF without stalling.

I ran some tests on the machine after setting the Link Disable bit in the switch’s
Link Control register (offset 10h).
- setpci -s 0000:3c:08.0 CAP_EXP+10.w=0x0010

 +-[0000:3a]-+-00.0-[3b-3f]----00.0-[3c-3f]--+-00.0-[3d]----
 |           |                               +-04.0-[3e]----
 |           |                               \-08.0-[3f]----00.0  Mellanox Technologies MT27800 Family [ConnectX-5]

 # lspci -vvv -s 0000:3f:00.0
 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
 ...
         Region 0: Memory at 3af804000000 (64-bit, prefetchable) [size=32M]
 ...

1) Issue a PCI config-space read request and it returns 0xFFFFFFFF.
 # lspci -vvv -s 0000:3f:00.0
 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] (rev ff) (prog-if ff)
         !!! Unknown header type 7f
         Kernel driver in use: mlx5_core
         Kernel modules: mlx5_core

2) Issuing a PCI memory read request through /dev/mem also returns 0xFFFFFFFF.
 # ./devmem
 Usage: ./devmem <phys_addr> <size> <offset> [value]
   phys_addr : physical base address of the BAR (hex or decimal)
   size      : mapping length in bytes (hex or decimal)
   offset    : register offset from BAR base (hex or decimal)
   value     : optional 32-bit value to write (hex or decimal)
 Example: ./devmem 0x600000000 0x1000 0x0 0xDEADBEEF
 # ./devmem 0x3af804000000 0x2000000 0x0
 0x3af804000000 = 0xffffffff

 Before the link was disabled, we could read 0x3af804000000 with devmem and
 obtain a valid result.
 # ./devmem 0x3af804000000 0x2000000 0x0
 0x3af804000000 = 0x10002300

Besides, after searching the kernel code, I found many EP drivers already check
whether their endpoint is still present. There may be exception cases in some
PCIe endpoint drivers, such as commit 43bb40c5b926 ("virtio_pci: Support surprise
removal of virtio pci device").

Best Regards,
Jinhui

[PATCH v2 1/2] iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode
[PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode