[PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout

Nicolin Chen posted 7 patches 2 weeks, 6 days ago
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   4 +-
include/linux/iommu.h                         |   4 +
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  34 ++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 ++++++++++++++++--
drivers/iommu/iommu.c                         |  87 ++++++++-
drivers/pci/pci-acpi.c                        |  11 +-
drivers/pci/pci.c                             |  50 +++++-
drivers/pci/quirks.c                          |  11 +-
8 files changed, 322 insertions(+), 44 deletions(-)
[PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Nicolin Chen 2 weeks, 6 days ago
Hi all,

This series addresses a critical vulnerability and stability issue where an
unresponsive PCIe device failing to process ATC (Address Translation Cache)
invalidation requests leads to silent data corruption and continuous SMMU
CMDQ error spam.

Currently, when an ATC invalidation times out, the SMMUv3 driver skips the
CMDQ_ERR_CERROR_ATC_INV_IDX error. This leaves the device's ATS cache state
desynchronized from the SMMU: the device cache may retain stale ATC entries
for memory pages that the OS has already reclaimed and reassigned, creating
a direct vector for data corruption. Furthermore, the driver might continue
issuing ATC_INV commands, resulting in constant CMDQ errors:
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
    ...

To resolve this, introduce a mechanism to quarantine a broken device in the
SMMUv3 driver and the IOMMU core. To achive this, some preparatory changes:
 - Tighten the semantics of pci_dev_reset_iommu_done() that is now strictly
   called only upon a successful hardware reset
 - Introduce a reset_device_done op, allowing the core to signal the driver
   when the physical hardware has been cleanly recovered (e.g., via AER or
   a manual reset) so the quarantine can be lifted
 - Utilize a per-iommu_group WQ via an iommu_report_device_broken() helper
   Note that this implementation only supports single-device iommu_groups.

On the SMMUv3 driver side, introduce the bisection logic to identify which
device caused a batched ATC_INV timeout via an atc_sync_timeouts tracker.
Perform a surgical STE update and flag the ATS as broken to reject further
ATS/ATC requests at the hardware level and suppress further timeout spam.

This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v2

Changelog
v2:
 * Rebase on arm_smmu_invs-v13 series [0]
 * Bisect batched atc invalidation commands
 * Drop the direct pci_reset_function() call
 * Move the work queue from SMMUv3 to the core
 * Proceed a surgical STE update to disable EATS
 * Wait for pci_dev_reset_iommu_done() to signal a recovery
v1:
 https://lore.kernel.org/all/cover.1772686998.git.nicolinc@nvidia.com/

[0] https://lore.kernel.org/all/cover.1773733797.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (7):
  iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  iommu: Add reset_device_done callback for hardware fault recovery
  iommu: Add iommu_report_device_broken() to quarantine a broken device
  iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
  iommu/arm-smmu-v3: Introduce master->ats_broken flag
  iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   4 +-
 include/linux/iommu.h                         |   4 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  34 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 ++++++++++++++++--
 drivers/iommu/iommu.c                         |  87 ++++++++-
 drivers/pci/pci-acpi.c                        |  11 +-
 drivers/pci/pci.c                             |  50 +++++-
 drivers/pci/quirks.c                          |  11 +-
 8 files changed, 322 insertions(+), 44 deletions(-)

-- 
2.43.0
RE: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Tian, Kevin 2 weeks, 5 days ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 18, 2026 3:16 AM
> 
> Hi all,
> 
> This series addresses a critical vulnerability and stability issue where an
> unresponsive PCIe device failing to process ATC (Address Translation Cache)
> invalidation requests leads to silent data corruption and continuous SMMU
> CMDQ error spam.
> 

None of the patches in this series contains a Fixed tag and cc stable.
Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Nicolin Chen 2 weeks, 4 days ago
On Wed, Mar 18, 2026 at 07:47:18AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, March 18, 2026 3:16 AM
> > 
> > Hi all,
> > 
> > This series addresses a critical vulnerability and stability issue where an
> > unresponsive PCIe device failing to process ATC (Address Translation Cache)
> > invalidation requests leads to silent data corruption and continuous SMMU
> > CMDQ error spam.
> > 
> 
> None of the patches in this series contains a Fixed tag and cc stable.

Hmm, I guess AI overly polished the cover letter so it sounds too
strong?

This is essentially a vulnerability (potential memory corruption).
And none of these patches actually fixes any regression. The PATCH
7 even requires the arm_smmu_invs series which has not been merged
yet :-/

Nicolin
RE: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Tian, Kevin 2 weeks, 4 days ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 19, 2026 4:05 AM
> 
> On Wed, Mar 18, 2026 at 07:47:18AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, March 18, 2026 3:16 AM
> > >
> > > Hi all,
> > >
> > > This series addresses a critical vulnerability and stability issue where an
> > > unresponsive PCIe device failing to process ATC (Address Translation
> Cache)
> > > invalidation requests leads to silent data corruption and continuous
> SMMU
> > > CMDQ error spam.
> > >
> >
> > None of the patches in this series contains a Fixed tag and cc stable.
> 
> Hmm, I guess AI overly polished the cover letter so it sounds too
> strong?
> 
> This is essentially a vulnerability (potential memory corruption).
> And none of these patches actually fixes any regression. The PATCH
> 7 even requires the arm_smmu_invs series which has not been merged
> yet :-/
> 

Fixes tag and backporting are not just for regression. People certainly
want to see reported vulnerabilities fixed in stable kernels...
Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Nicolin Chen 2 weeks, 4 days ago
On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > This series addresses a critical vulnerability and stability issue where an
> > > > unresponsive PCIe device failing to process ATC (Address Translation
> > Cache)
> > > > invalidation requests leads to silent data corruption and continuous
> > SMMU
> > > > CMDQ error spam.
> > > >
> > >
> > > None of the patches in this series contains a Fixed tag and cc stable.
> > 
> > Hmm, I guess AI overly polished the cover letter so it sounds too
> > strong?
> > 
> > This is essentially a vulnerability (potential memory corruption).
> > And none of these patches actually fixes any regression. The PATCH
> > 7 even requires the arm_smmu_invs series which has not been merged
> > yet :-/
> > 
> 
> Fixes tag and backporting are not just for regression. People certainly
> want to see reported vulnerabilities fixed in stable kernels...

Well, maybe I'll just leave additional line telling people that this
can't be a bug "fix" because it's written on another unmerged series?

Nicolin
Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Jason Gunthorpe 1 week, 6 days ago
On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote:
> On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > > This series addresses a critical vulnerability and stability issue where an
> > > > > unresponsive PCIe device failing to process ATC (Address Translation
> > > Cache)
> > > > > invalidation requests leads to silent data corruption and continuous
> > > SMMU
> > > > > CMDQ error spam.
> > > > >
> > > >
> > > > None of the patches in this series contains a Fixed tag and cc stable.
> > > 
> > > Hmm, I guess AI overly polished the cover letter so it sounds too
> > > strong?
> > > 
> > > This is essentially a vulnerability (potential memory corruption).
> > > And none of these patches actually fixes any regression. The PATCH
> > > 7 even requires the arm_smmu_invs series which has not been merged
> > > yet :-/
> > > 
> > 
> > Fixes tag and backporting are not just for regression. People certainly
> > want to see reported vulnerabilities fixed in stable kernels...
> 
> Well, maybe I'll just leave additional line telling people that this
> can't be a bug "fix" because it's written on another unmerged series?

I think this is more of a feature (RAS support for SMMUv3) than a
specific fix.

Jason
RE: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Tian, Kevin 1 week, 5 days ago
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 24, 2026 8:03 AM
> 
> On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote:
> > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > > > This series addresses a critical vulnerability and stability issue where
> an
> > > > > > unresponsive PCIe device failing to process ATC (Address Translation
> > > > Cache)
> > > > > > invalidation requests leads to silent data corruption and continuous
> > > > SMMU
> > > > > > CMDQ error spam.
> > > > > >
> > > > >
> > > > > None of the patches in this series contains a Fixed tag and cc stable.
> > > >
> > > > Hmm, I guess AI overly polished the cover letter so it sounds too
> > > > strong?
> > > >
> > > > This is essentially a vulnerability (potential memory corruption).
> > > > And none of these patches actually fixes any regression. The PATCH
> > > > 7 even requires the arm_smmu_invs series which has not been merged
> > > > yet :-/
> > > >
> > >
> > > Fixes tag and backporting are not just for regression. People certainly
> > > want to see reported vulnerabilities fixed in stable kernels...
> >
> > Well, maybe I'll just leave additional line telling people that this
> > can't be a bug "fix" because it's written on another unmerged series?
> 
> I think this is more of a feature (RAS support for SMMUv3) than a
> specific fix.
> 

Not a RAS guy, but below is what I got from AI:

"
RAS improvements typically involve better error reporting, graceful
degradation, or improved recovery - but they usually don't involve
scenarios where the system continues operating with compromised
security assumptions."
Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Jason Gunthorpe 1 week, 5 days ago
On Wed, Mar 25, 2026 at 06:55:40AM +0000, Tian, Kevin wrote:
> > I think this is more of a feature (RAS support for SMMUv3) than a
> > specific fix.
> > 
> 
> Not a RAS guy, but below is what I got from AI:
> 
> "
> RAS improvements typically involve better error reporting, graceful
> degradation, or improved recovery - but they usually don't involve
> scenarios where the system continues operating with compromised
> security assumptions."

Right, so currently there is no RAS in smmuv3, if it hits this error
it continues with "compromised security assumptions". Adding RAS
support is to avoid this.

Jason
Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
Posted by Nicolin Chen 1 week, 6 days ago
On Mon, Mar 23, 2026 at 09:03:21PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote:
> > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > > > This series addresses a critical vulnerability and stability issue where an
> > > > > > unresponsive PCIe device failing to process ATC (Address Translation
> > > > Cache)
> > > > > > invalidation requests leads to silent data corruption and continuous
> > > > SMMU
> > > > > > CMDQ error spam.
> > > > > >
> > > > >
> > > > > None of the patches in this series contains a Fixed tag and cc stable.
> > > > 
> > > > Hmm, I guess AI overly polished the cover letter so it sounds too
> > > > strong?
> > > > 
> > > > This is essentially a vulnerability (potential memory corruption).
> > > > And none of these patches actually fixes any regression. The PATCH
> > > > 7 even requires the arm_smmu_invs series which has not been merged
> > > > yet :-/
> > > > 
> > > 
> > > Fixes tag and backporting are not just for regression. People certainly
> > > want to see reported vulnerabilities fixed in stable kernels...
> > 
> > Well, maybe I'll just leave additional line telling people that this
> > can't be a bug "fix" because it's written on another unmerged series?
> 
> I think this is more of a feature (RAS support for SMMUv3) than a
> specific fix.

Adding that to the cover-letter. Thanks for the input.

Nicolin