drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 4 +- include/linux/iommu.h | 4 + .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c | 34 ++-- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 165 ++++++++++++++++-- drivers/iommu/iommu.c | 87 ++++++++- drivers/pci/pci-acpi.c | 11 +- drivers/pci/pci.c | 50 +++++- drivers/pci/quirks.c | 11 +- 8 files changed, 322 insertions(+), 44 deletions(-)
Hi all,
This series addresses a critical vulnerability and stability issue where an
unresponsive PCIe device failing to process ATC (Address Translation Cache)
invalidation requests leads to silent data corruption and continuous SMMU
CMDQ error spam.
Currently, when an ATC invalidation times out, the SMMUv3 driver skips the
CMDQ_ERR_CERROR_ATC_INV_IDX error. This leaves the device's ATS cache state
desynchronized from the SMMU: the device cache may retain stale ATC entries
for memory pages that the OS has already reclaimed and reassigned, creating
a direct vector for data corruption. Furthermore, the driver might continue
issuing ATC_INV commands, resulting in constant CMDQ errors:
unexpected global error reported (0x00000001), this could be serious
CMDQ error (cons 0x0302bb84): ATC invalidate timeout
unexpected global error reported (0x00000001), this could be serious
CMDQ error (cons 0x0302bb88): ATC invalidate timeout
unexpected global error reported (0x00000001), this could be serious
CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
...
To resolve this, introduce a mechanism to quarantine a broken device in the
SMMUv3 driver and the IOMMU core. To achive this, some preparatory changes:
- Tighten the semantics of pci_dev_reset_iommu_done() that is now strictly
called only upon a successful hardware reset
- Introduce a reset_device_done op, allowing the core to signal the driver
when the physical hardware has been cleanly recovered (e.g., via AER or
a manual reset) so the quarantine can be lifted
- Utilize a per-iommu_group WQ via an iommu_report_device_broken() helper
Note that this implementation only supports single-device iommu_groups.
On the SMMUv3 driver side, introduce the bisection logic to identify which
device caused a batched ATC_INV timeout via an atc_sync_timeouts tracker.
Perform a surgical STE update and flag the ATS as broken to reject further
ATS/ATC requests at the hardware level and suppress further timeout spam.
This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v2
Changelog
v2:
* Rebase on arm_smmu_invs-v13 series [0]
* Bisect batched atc invalidation commands
* Drop the direct pci_reset_function() call
* Move the work queue from SMMUv3 to the core
* Proceed a surgical STE update to disable EATS
* Wait for pci_dev_reset_iommu_done() to signal a recovery
v1:
https://lore.kernel.org/all/cover.1772686998.git.nicolinc@nvidia.com/
[0] https://lore.kernel.org/all/cover.1773733797.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (7):
iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
iommu: Add reset_device_done callback for hardware fault recovery
iommu: Add iommu_report_device_broken() to quarantine a broken device
iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
iommu/arm-smmu-v3: Introduce master->ats_broken flag
iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 4 +-
include/linux/iommu.h | 4 +
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c | 34 ++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 165 ++++++++++++++++--
drivers/iommu/iommu.c | 87 ++++++++-
drivers/pci/pci-acpi.c | 11 +-
drivers/pci/pci.c | 50 +++++-
drivers/pci/quirks.c | 11 +-
8 files changed, 322 insertions(+), 44 deletions(-)
--
2.43.0
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, March 18, 2026 3:16 AM > > Hi all, > > This series addresses a critical vulnerability and stability issue where an > unresponsive PCIe device failing to process ATC (Address Translation Cache) > invalidation requests leads to silent data corruption and continuous SMMU > CMDQ error spam. > None of the patches in this series contains a Fixed tag and cc stable.
On Wed, Mar 18, 2026 at 07:47:18AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, March 18, 2026 3:16 AM > > > > Hi all, > > > > This series addresses a critical vulnerability and stability issue where an > > unresponsive PCIe device failing to process ATC (Address Translation Cache) > > invalidation requests leads to silent data corruption and continuous SMMU > > CMDQ error spam. > > > > None of the patches in this series contains a Fixed tag and cc stable. Hmm, I guess AI overly polished the cover letter so it sounds too strong? This is essentially a vulnerability (potential memory corruption). And none of these patches actually fixes any regression. The PATCH 7 even requires the arm_smmu_invs series which has not been merged yet :-/ Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, March 19, 2026 4:05 AM > > On Wed, Mar 18, 2026 at 07:47:18AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, March 18, 2026 3:16 AM > > > > > > Hi all, > > > > > > This series addresses a critical vulnerability and stability issue where an > > > unresponsive PCIe device failing to process ATC (Address Translation > Cache) > > > invalidation requests leads to silent data corruption and continuous > SMMU > > > CMDQ error spam. > > > > > > > None of the patches in this series contains a Fixed tag and cc stable. > > Hmm, I guess AI overly polished the cover letter so it sounds too > strong? > > This is essentially a vulnerability (potential memory corruption). > And none of these patches actually fixes any regression. The PATCH > 7 even requires the arm_smmu_invs series which has not been merged > yet :-/ > Fixes tag and backporting are not just for regression. People certainly want to see reported vulnerabilities fixed in stable kernels...
On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote: > > > > This series addresses a critical vulnerability and stability issue where an > > > > unresponsive PCIe device failing to process ATC (Address Translation > > Cache) > > > > invalidation requests leads to silent data corruption and continuous > > SMMU > > > > CMDQ error spam. > > > > > > > > > > None of the patches in this series contains a Fixed tag and cc stable. > > > > Hmm, I guess AI overly polished the cover letter so it sounds too > > strong? > > > > This is essentially a vulnerability (potential memory corruption). > > And none of these patches actually fixes any regression. The PATCH > > 7 even requires the arm_smmu_invs series which has not been merged > > yet :-/ > > > > Fixes tag and backporting are not just for regression. People certainly > want to see reported vulnerabilities fixed in stable kernels... Well, maybe I'll just leave additional line telling people that this can't be a bug "fix" because it's written on another unmerged series? Nicolin
On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote: > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote: > > > > > This series addresses a critical vulnerability and stability issue where an > > > > > unresponsive PCIe device failing to process ATC (Address Translation > > > Cache) > > > > > invalidation requests leads to silent data corruption and continuous > > > SMMU > > > > > CMDQ error spam. > > > > > > > > > > > > > None of the patches in this series contains a Fixed tag and cc stable. > > > > > > Hmm, I guess AI overly polished the cover letter so it sounds too > > > strong? > > > > > > This is essentially a vulnerability (potential memory corruption). > > > And none of these patches actually fixes any regression. The PATCH > > > 7 even requires the arm_smmu_invs series which has not been merged > > > yet :-/ > > > > > > > Fixes tag and backporting are not just for regression. People certainly > > want to see reported vulnerabilities fixed in stable kernels... > > Well, maybe I'll just leave additional line telling people that this > can't be a bug "fix" because it's written on another unmerged series? I think this is more of a feature (RAS support for SMMUv3) than a specific fix. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, March 24, 2026 8:03 AM > > On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote: > > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote: > > > > > > This series addresses a critical vulnerability and stability issue where > an > > > > > > unresponsive PCIe device failing to process ATC (Address Translation > > > > Cache) > > > > > > invalidation requests leads to silent data corruption and continuous > > > > SMMU > > > > > > CMDQ error spam. > > > > > > > > > > > > > > > > None of the patches in this series contains a Fixed tag and cc stable. > > > > > > > > Hmm, I guess AI overly polished the cover letter so it sounds too > > > > strong? > > > > > > > > This is essentially a vulnerability (potential memory corruption). > > > > And none of these patches actually fixes any regression. The PATCH > > > > 7 even requires the arm_smmu_invs series which has not been merged > > > > yet :-/ > > > > > > > > > > Fixes tag and backporting are not just for regression. People certainly > > > want to see reported vulnerabilities fixed in stable kernels... > > > > Well, maybe I'll just leave additional line telling people that this > > can't be a bug "fix" because it's written on another unmerged series? > > I think this is more of a feature (RAS support for SMMUv3) than a > specific fix. > Not a RAS guy, but below is what I got from AI: " RAS improvements typically involve better error reporting, graceful degradation, or improved recovery - but they usually don't involve scenarios where the system continues operating with compromised security assumptions."
On Wed, Mar 25, 2026 at 06:55:40AM +0000, Tian, Kevin wrote: > > I think this is more of a feature (RAS support for SMMUv3) than a > > specific fix. > > > > Not a RAS guy, but below is what I got from AI: > > " > RAS improvements typically involve better error reporting, graceful > degradation, or improved recovery - but they usually don't involve > scenarios where the system continues operating with compromised > security assumptions." Right, so currently there is no RAS in smmuv3, if it hits this error it continues with "compromised security assumptions". Adding RAS support is to avoid this. Jason
On Mon, Mar 23, 2026 at 09:03:21PM -0300, Jason Gunthorpe wrote: > On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote: > > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote: > > > > > > This series addresses a critical vulnerability and stability issue where an > > > > > > unresponsive PCIe device failing to process ATC (Address Translation > > > > Cache) > > > > > > invalidation requests leads to silent data corruption and continuous > > > > SMMU > > > > > > CMDQ error spam. > > > > > > > > > > > > > > > > None of the patches in this series contains a Fixed tag and cc stable. > > > > > > > > Hmm, I guess AI overly polished the cover letter so it sounds too > > > > strong? > > > > > > > > This is essentially a vulnerability (potential memory corruption). > > > > And none of these patches actually fixes any regression. The PATCH > > > > 7 even requires the arm_smmu_invs series which has not been merged > > > > yet :-/ > > > > > > > > > > Fixes tag and backporting are not just for regression. People certainly > > > want to see reported vulnerabilities fixed in stable kernels... > > > > Well, maybe I'll just leave additional line telling people that this > > can't be a bug "fix" because it's written on another unmerged series? > > I think this is more of a feature (RAS support for SMMUv3) than a > specific fix. Adding that to the cover-letter. Thanks for the input. Nicolin
© 2016 - 2026 Red Hat, Inc.