[PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout

Nicolin Chen posted 24 patches 5 days, 23 hours ago
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  72 +++-
include/linux/iommu.h                         |  18 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 387 ++++++++++++++---
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    |  36 +-
drivers/iommu/iommu.c                         | 406 ++++++++++++++----
drivers/pci/pci-acpi.c                        |   2 +-
drivers/pci/pci.c                             |  21 +-
drivers/pci/quirks.c                          |  43 +-
8 files changed, 820 insertions(+), 165 deletions(-)
[PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Nicolin Chen 5 days, 23 hours ago
Hi all,

This series addresses a critical vulnerability and stability issue where an
unresponsive PCIe device failing to process ATC (Address Translation Cache)
invalidation requests leads to silent data corruption and continuous SMMU
CMDQ error spam.

[ As Jason pointed out, because this series fundamentally introduces a new
  RAS feature to quarantine and recover from hardware faults and relies on
  a recently accepted SMMU driver rework, it is not treated as a standard
  bug fix. Thus, most of the patches here don't carry a "Fixes" tag. ]

Currently, when an ATC invalidation times out, the SMMUv3 driver skips the
CMDQ_ERR_CERROR_ATC_INV_IDX error. This leaves the device's ATS cache state
desynchronized from the SMMU: the device cache may retain stale ATC entries
for memory pages that the OS has already reclaimed and reassigned, creating
a direct vector for data corruption. Furthermore, the driver might continue
issuing ATC_INV commands, resulting in constant CMDQ errors:
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
    ...

To resolve this, introduce a mechanism to quarantine a broken device in the
SMMUv3 driver and the IOMMU core. To achieve this, add preparatory changes:
 - Pass in PCI reset result to pci_dev_reset_iommu_done()
 - Co-clear pending CMDQ_ERR from the cmdq issuer under a raw_spinlock_t,
   so an ATC_INV timeout flagged in cmdq->atc_sync_timeouts is definitive
   when the issuer reads its bit after CMD_SYNC poll
 - Introduce a reset_device_done op, allowing the core to signal the driver
   when the physical hardware has been cleanly recovered (e.g., via AER or
   a manual reset) so the quarantine can be lifted
 - Utilize a per-group_device WQ via an iommu_report_device_broken() helper

On the SMMUv3 driver side, retry the timedout ATC_INV batch to identify the
faulty device(s). Perform a surgical STE update, and flag the ATS as broken
to reject further ATS/ATC requests at HW level and suppress timeout spam.

This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v4

Changelog
v4:
 * Rebase on Joerg's IOMMU "fixes" branch
 * Rebase on Jason's SMMUv3 cmd_ent series
   https://lore.kernel.org/all/0-v2-47b2bf710ad5+716ac-smmu_no_cmdq_ent_jgg@nvidia.com/
 * [PCI] Don't suspend IOMMU in probe mode
 * [iommu] kfree_rcu() iommu_group
 * [iommu] Convert gdev->blocked to enum gdev_blocked
 * [iommu] Use disable_work_sync() to fix UAF and ref leak
 * [iommu] Gate done() transitions to preserve BLOCKED_BROKEN
 * [iommu] Decrement recovery_cnt when unplugging a blocked gdev
 * [iommu] Drop racy dev_has_iommu() in iommu_report_device_broken()
 * [iommu] Add gdev->broken_pending to skip worker after racing recovery
 * [smmuv3] Add master->ats_invs scratch
 * [smmuv3] Add arm_smmu_cmdq_batch_issue() wrapper
 * [smmuv3] Force per-flush sync for has_ats batches
 * [smmuv3] Serialize STE.EATS and ats_broken updates
 * [smmuv3] Co-clear pending CMDQ_ERR from cmdq issuer
 * [smmuv3] Add invs and has_ats to arm_smmu_cmdq_batch
 * [smmuv3] Move arm_smmu_invs_for_each_entry to header
 * [smmuv3] Set master->ats_broken after clearing STE.EATS
 * [smmuv3] Issue CFGI_STE via arm_smmu_cmdq_issue_cmd_with_sync()
 * [smmuv3] Keep "smmu" pointer in arm_smmu_inv but add "master" for ATS
v3:
 https://lore.kernel.org/all/cover.1776381841.git.nicolinc@nvidia.com/
 * Rebase on arm/smmu/updates branch + bug fix
 * Update commit messages and inline comments
 * [iommu] Drop unnecessary ops validation
 * [iommu] Add missed function stub when !CONFIG_IOMMU_API
 * [iommu] Change iommu_report_device_broken() to per gdev
 * [iommu] Separate quarantine from pci_dev_reset_prepare()
 * [iommu] Check reset failure in pci_dev_reset_iommu_done()
 * [smmuv3] Fix STE update with try_cmpxchg64()
 * [smmuv3] Fix "continue" bug when skipping ATC commands
 * [smmuv3] Replace atomic_t prod_err with a lockless bitmap
 * [smmuv3] Drop master->invs_domain; disable ATS per-master directly
 * [smmuv3] Return -EIO for ATC timeout v.s. -ETIMEDOUT for poll timeout
 * [smmuv3] Replace INV_TYPE_ATS_DISABLED with per-master ats_broken flag
v2:
 https://lore.kernel.org/all/cover.1773774441.git.nicolinc@nvidia.com/
 * Rebase on arm_smmu_invs-v13 series
 * Bisect batched atc invalidation commands
 * Drop the direct pci_reset_function() call
 * Move the work queue from SMMUv3 to the core
 * Proceed a surgical STE update to disable EATS
 * Wait for pci_dev_reset_iommu_done() to signal a recovery
v1:
 https://lore.kernel.org/all/cover.1772686998.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (24):
  PCI: Don't suspend IOMMU when probing reset capability
  PCI: Propagate FLR return values to callers
  iommu: Convert gdev->blocked from bool to enum gdev_blocked
  iommu: Pass in reset result to pci_dev_reset_iommu_done()
  iommu: Add reset_device_done callback for hardware fault recovery
  iommu: Defer iommu_group free via kfree_rcu()
  iommu: Defer __iommu_group_free_device() to be outside group->mutex
  iommu: Change group->devices to RCU-protected list
  iommu: Add group pointer to struct group_device
  iommu: Add __iommu_group_block_device helper
  iommu: Add iommu_report_device_broken() to quarantine a broken device
  iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM
  iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback
  iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out
  iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space()
    fails
  iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries
  iommu/arm-smmu-v3: Introduce master->ats_broken flag
  iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch
  iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper
  iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header
  iommu/arm-smmu-v3: Introduce master->ats_invs
  iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates
  iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  72 +++-
 include/linux/iommu.h                         |  18 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 387 ++++++++++++++---
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    |  36 +-
 drivers/iommu/iommu.c                         | 406 ++++++++++++++----
 drivers/pci/pci-acpi.c                        |   2 +-
 drivers/pci/pci.c                             |  21 +-
 drivers/pci/quirks.c                          |  43 +-
 8 files changed, 820 insertions(+), 165 deletions(-)

-- 
2.43.0
RE: [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Tian, Kevin 4 days, 22 hours ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, May 19, 2026 11:39 AM
>
> Changelog
> v4:
>  * Rebase on Joerg's IOMMU "fixes" branch
>  * Rebase on Jason's SMMUv3 cmd_ent series
>    https://lore.kernel.org/all/0-v2-47b2bf710ad5+716ac-
> smmu_no_cmdq_ent_jgg@nvidia.com/
>  * [PCI] Don't suspend IOMMU in probe mode
>  * [iommu] kfree_rcu() iommu_group
>  * [iommu] Convert gdev->blocked to enum gdev_blocked
>  * [iommu] Use disable_work_sync() to fix UAF and ref leak
>  * [iommu] Gate done() transitions to preserve BLOCKED_BROKEN
>  * [iommu] Decrement recovery_cnt when unplugging a blocked gdev
>  * [iommu] Drop racy dev_has_iommu() in iommu_report_device_broken()
>  * [iommu] Add gdev->broken_pending to skip worker after racing recovery
>  * [smmuv3] Add master->ats_invs scratch
>  * [smmuv3] Add arm_smmu_cmdq_batch_issue() wrapper
>  * [smmuv3] Force per-flush sync for has_ats batches
>  * [smmuv3] Serialize STE.EATS and ats_broken updates
>  * [smmuv3] Co-clear pending CMDQ_ERR from cmdq issuer
>  * [smmuv3] Add invs and has_ats to arm_smmu_cmdq_batch
>  * [smmuv3] Move arm_smmu_invs_for_each_entry to header
>  * [smmuv3] Set master->ats_broken after clearing STE.EATS
>  * [smmuv3] Issue CFGI_STE via arm_smmu_cmdq_issue_cmd_with_sync()
>  * [smmuv3] Keep "smmu" pointer in arm_smmu_inv but add "master" for
> ATS

Not check the detail yet, but this v4 more than doubles the number of
patches in v3. Are they all necessary to be in one series? any chance to
split to ease the review...
Re: [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Nicolin Chen 4 days, 20 hours ago
On Wed, May 20, 2026 at 03:59:43AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, May 19, 2026 11:39 AM
> >
> > Changelog
> > v4:
> >  * Rebase on Joerg's IOMMU "fixes" branch
> >  * Rebase on Jason's SMMUv3 cmd_ent series
> >    https://lore.kernel.org/all/0-v2-47b2bf710ad5+716ac-
> > smmu_no_cmdq_ent_jgg@nvidia.com/
> >  * [PCI] Don't suspend IOMMU in probe mode
> >  * [iommu] kfree_rcu() iommu_group
> >  * [iommu] Convert gdev->blocked to enum gdev_blocked
> >  * [iommu] Use disable_work_sync() to fix UAF and ref leak
> >  * [iommu] Gate done() transitions to preserve BLOCKED_BROKEN
> >  * [iommu] Decrement recovery_cnt when unplugging a blocked gdev
> >  * [iommu] Drop racy dev_has_iommu() in iommu_report_device_broken()
> >  * [iommu] Add gdev->broken_pending to skip worker after racing recovery
> >  * [smmuv3] Add master->ats_invs scratch
> >  * [smmuv3] Add arm_smmu_cmdq_batch_issue() wrapper
> >  * [smmuv3] Force per-flush sync for has_ats batches
> >  * [smmuv3] Serialize STE.EATS and ats_broken updates
> >  * [smmuv3] Co-clear pending CMDQ_ERR from cmdq issuer
> >  * [smmuv3] Add invs and has_ats to arm_smmu_cmdq_batch
> >  * [smmuv3] Move arm_smmu_invs_for_each_entry to header
> >  * [smmuv3] Set master->ats_broken after clearing STE.EATS
> >  * [smmuv3] Issue CFGI_STE via arm_smmu_cmdq_issue_cmd_with_sync()
> >  * [smmuv3] Keep "smmu" pointer in arm_smmu_inv but add "master" for
> > ATS
> 
> Not check the detail yet, but this v4 more than doubles the number of
> patches in v3. Are they all necessary to be in one series? any chance to
> split to ease the review...

Ah, sorry. Mostly these are dealing with racing and corner cases in
the async design...

Jason has suggested to drop quite a few things. So, I will respin a
v5 that should be much smaller. Let's ignore this v4 for now.

Thanks
Nicolin