From nobody Thu Apr 9 23:22:37 2026 Received: from CH4PR04CU002.outbound.protection.outlook.com (mail-northcentralusazon11013032.outbound.protection.outlook.com [40.107.201.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C8502853F8; Thu, 5 Mar 2026 05:22:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.201.32 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772688133; cv=fail; b=aCzggvptW8mrX4Bz7s9WWOcxuoHIroMDbu0xKmsz4A42RI15KP3aECWsSAwVJuae5N2kRYGn5WPzk/Eh6LHXVqtuXz08D4FBUrp7Iqc1FFCaoYk6f8CPEw33t1hPk4kveNGZZuIds5gY9ICAbZba0Ohw6/hp2jbf8QtWEJlAO3U= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772688133; c=relaxed/simple; bh=bh32dXGjPEWhBozn9GCcJtB6VCN9xBzmOm/XGg5WIYI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=TEsBDqZPBTGZPmYDpp+nFKoSEE8iQ8iAumxhefKuCYKvrdj1Ld7GiJfOmLd+dTnwCoCZ2pKkBwdOkBlWS+ELvRVacgNUcLZK9c9O3Y09iNUCjeXB6P2Xj3Qis7I8MgN6T/gn2EAC1dr8YfllpDaocV5NIBcXrKD2dlkhNbt54OQ= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=YAzSBqhe; arc=fail smtp.client-ip=40.107.201.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="YAzSBqhe" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=a0/kfwLQrol0VN/uVOKjJM6Crx6KKV5ofyiKcngMY99w1kVIRtliztSq03C0fse4BwZ5utBj8qO0hgQkeucfGUfjKI1aCHSL7k2Bl3zb4LRIijd9KR2hxNOwlC1dTF2rGTZPE69NpymMXzfYv2wAJxTG5rXH4bG/JmVh9Om3brq5Zp2JWId21UKqljfkHmW7UJGP1eM2QVX8CSeZ66QN2WfjxEiDlSd2E3a1gi7uyYxvo+g8UnFoQEryf2og5qekOcffOtqYAw2Fj2ipOfrrjmFLpI8Dt7/8gbDKR6D2jQgeEyuBZw/aPeRX2JlXHMDL51mb0MaTukBDSBn+h2ymHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=lxMuxiDczSb3OpLj4w14joC17x7U5ue3KuL/gAuIzwE=; b=N3F/jKA0zwGRWRjWoFHIBQOsrk+X5a4k/88YfAPiA0CnYjd4Mahx9NPqi3gJzSiy/fHVi9Ax3j321BENADdC1gXupkJYH/WlwqGto8/tiSqBaTxNBgTYCrPvTs2pI2Wu7Sivgd2up9AjlhRI/c0iQl4dASsOaMIvMp5M2wlEFEfZws0FiK2coMoHfqmOjRx8BeAxUbntYPLnIahbnXb5Jt6ubnwF/lplpKLZoKzpzJ5H2MYxq6sbDVWBOodWTb+zh8YOCxwSeMcZVL9n5R2BcTxIDVc04gq9X6esDzRfD1uScTaIqWjOSCKYTs4Ds+bo2jIQSi4r3A/M4CSvgBovhA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.118.233) smtp.rcpttodomain=kernel.org smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=lxMuxiDczSb3OpLj4w14joC17x7U5ue3KuL/gAuIzwE=; b=YAzSBqhe5hoV9AMh5GwnooLQgzWhUiHbhi+L7DekIaBHxXqdn8zdBoDvOK6DKrc2r3fCHRqkaufM93plcTToIqGcawjVF+7/2+2n1tJgh+CtUJ83E/wFLeO/ry5/QVUkrvwOcKdtzgXr+3LIUqhwlFRUo3Y7a2Fm4MmOR7OxFfOGs6y7dKULve+7f6ng/Byz05vWE9/YneLVJ0LaRdxpoWzKFwIEsmdB6QBHUKXH8Jv/OTjO3M2o1lVi64rYCy7t8zTIpArWok6PuJj+VtDCoOFyTkdBVFIdO3vZKNwBudv9V/RG1Zu0Wl8qb2PwvKoVNWOjmL1eqaiaOS4oxAKCYw== Received: from SJ0PR13CA0134.namprd13.prod.outlook.com (2603:10b6:a03:2c6::19) by CH8PR12MB9768.namprd12.prod.outlook.com (2603:10b6:610:260::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9678.17; Thu, 5 Mar 2026 05:22:05 +0000 Received: from SJ5PEPF000001D0.namprd05.prod.outlook.com (2603:10b6:a03:2c6:cafe::38) by SJ0PR13CA0134.outlook.office365.com (2603:10b6:a03:2c6::19) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9678.13 via Frontend Transport; Thu, 5 Mar 2026 05:22:00 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.118.233) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.233 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C Received: from mail.nvidia.com (216.228.118.233) by SJ5PEPF000001D0.mail.protection.outlook.com (10.167.242.52) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9678.18 via Frontend Transport; Thu, 5 Mar 2026 05:22:05 +0000 Received: from drhqmail202.nvidia.com (10.126.190.181) by mail.nvidia.com (10.127.129.6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.20; Wed, 4 Mar 2026 21:21:54 -0800 Received: from drhqmail202.nvidia.com (10.126.190.181) by drhqmail202.nvidia.com (10.126.190.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.20; Wed, 4 Mar 2026 21:21:54 -0800 Received: from Asurada-Nvidia.nvidia.com (10.127.8.10) by mail.nvidia.com (10.126.190.181) with Microsoft SMTP Server id 15.2.2562.20 via Frontend Transport; Wed, 4 Mar 2026 21:21:53 -0800 From: Nicolin Chen To: , , , , CC: , , , , , , , , , , , , , , Subject: [PATCH v1 1/2] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Date: Wed, 4 Mar 2026 21:21:41 -0800 Message-ID: <58e6266a89ad7855ef0658b2a2bb1f4ee4119e23.1772686998.git.nicolinc@nvidia.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-NV-OnPremToCloud: ExternallySecured X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ5PEPF000001D0:EE_|CH8PR12MB9768:EE_ X-MS-Office365-Filtering-Correlation-Id: f6490c58-961e-409e-3aba-08de7a7723ca X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700016|376014|7416014|82310400026|1800799024; X-Microsoft-Antispam-Message-Info: bDCfTox6UxiiWxtucZ4gn0/IO/DrAYKepyCFEXI0YGVVqI4nLolRISkL9I2i7XkaBHQ5WHbuz0bmaFCG/tXREu1a1XjgoK9VxoqjLYS5jk+WlCLonojdcC7nmnUUIZj8o16TnadrfOyrQiKLCzY7JzTfDrjyg1kKZZIZYcRJpKpWYACo1SRl3qLmWW8yC0jStPf7lYRGG4LFpqzoL4xuK0dT5HQpJ+LcIQGdoHVpIo+fk2uGUUUN9ahT+l9ibfatL0b2O4xcIR6YakZb8dmKOTe+gEpHxVLajE0xscc2jQwO6yP54sJdENAOXbdaLYuXB0T5qzN/BliwdxutVIkHgjRbIVnB/ep3VOMNQ4n12GPbtY9CHMcDM96lfOm4MzokIoWeW7x+4YEhQ1u8u/JFng6EoikXBZruTwQqJhWwgNBXD0tpFeTCFNcfxCS+VSbKYaTOnbAo4vTJzB8qR3NoV8BRiROOWAgMobHC35d6AyBggsU6n0PEekMzrhlE8Duf7ZwxKrBR1g+9y1THv5R4EWm1YYvA/GmoZqNrh8OHqlaTSNQLhf9b0crH+46Zw/YeSFUY8VcBGGNybL558syQ7o+cVNlCK0M4C8oUrCFUHYO60V8MW73wrKpHItiQwYbrf7B/w1JehKFCbyJ9WbXZ/pJ5PzPlg+e/Oo+zEot2gzxSSk+wmzTr+eBnuI7DZ/IZM7CkYByVaO7054j4TpYZybWdxQJPWiqgPUo+ITIjntPavsL0L0YEh1gl+br9XZAYSE+J1Tp0oTYi7rAS/JJ64A== X-Forefront-Antispam-Report: CIP:216.228.118.233;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.nvidia.com;PTR:dc7edge2.nvidia.com;CAT:NONE;SFS:(13230040)(36860700016)(376014)(7416014)(82310400026)(1800799024);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: hG29MThI3zxjLJodcbUbBX5qiQJiNRzXuj7xicSGKNwTt4eH/hXhhElRaT3vfbwbHyEANs11V2EbI80h/iQNwfIIQiAGtxJD1J9U45XeJlzJsGZSf4EKP9F1QwuDO7AGUlkgPz/A2r6qjkBdOhqshriKy4tmuNyc8JYeiPlp7QOVDmSJQN1ESqOAsl4KTSjjit9En6EGKlRsb7AM2EFx+wK2Y78Myp7eNOiCWHJysHgX6WbjbgB8pSmFsLThCZPMYdWaqj4rzMYE5QlFdatsa+hqEsDNOToZRnsxZ7jaf5syRfROx6BKLfzKOPy2OYwgv6I5N/yHRckJeGpalfl5v0p5olmOH/4tOG42vJXVL/p3csNxcFf0el4eUAr/MH/xhLxf/WwT0gBoKG3mWZlu1Xpy4tT30EMu/mzLKTr25/B3Mj/ZfaquWi8prEjDoksi X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Mar 2026 05:22:05.3132 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: f6490c58-961e-409e-3aba-08de7a7723ca X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.118.233];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: SJ5PEPF000001D0.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH8PR12MB9768 Content-Type: text/plain; charset="utf-8" IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related errors (e.g., ATC invalidation request timeout), which are typically sent to the driver's ISR. To recover from such errors, the driver would need to initiate a device reset procedure (I/O waiting) in an asynchronous thread. If somehow the reset procedure fails, the ATC will be out of sync with the OS, since the memory is already unmmaped and could be even re-assigned. In this case, the device must be kept in the resetting domain, to prevent any memory corruption. Yet, currently pci_dev_reset_iommu_done() is called unconditionally: IOMMU recovery thread(): pci_reset_function(): pci_dev_reset_iommu_prepare(); // Block RID/ATS __reset(); // Failed (ATC is still stale) pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha) The simplest fix is to use pci_dev_reset_iommu_done() only on a successful reset: IOMMU recovery thread(): pci_reset_function(): pci_dev_reset_iommu_prepare(); // Block RID/ATS if (!__reset()) pci_dev_reset_iommu_done(); // Unblock RID/ATS else // keep the device blocked by IOMMU However, this breaks the symmetric requirement of these reset APIs so that we have to allow a re-entry to pass a second reset attempt: IOMMU recovery thread(): pci_reset_function(): pci_dev_reset_iommu_prepare(); // Block RID/ATS __reset(); // Failed (ATC is still stale) // Keep the device blocked by IOMMU ... Another thread(): pci_reset_function(): pci_dev_reset_iommu_prepare(); // Re-entry (!) Update the function kdocs and all the existing callers to only unblock ATS when the reset succeeds. Drop the WARN_ON in pci_dev_reset_iommu_prepare() to allow re-entries. Signed-off-by: Nicolin Chen --- drivers/iommu/iommu.c | 16 +++++++++----- drivers/pci/pci-acpi.c | 11 +++++++++- drivers/pci/pci.c | 50 +++++++++++++++++++++++++++++++++++++----- drivers/pci/quirks.c | 11 +++++++++- 4 files changed, 75 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 35db517809540..40a15c9360bd1 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3938,8 +3938,10 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IO= MMUFD_INTERNAL"); * IOMMU activity while leaving the group->domain pointer intact. Later wh= en the * reset is finished, pci_dev_reset_iommu_done() can restore everything. * - * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_= done() - * before/after the core-level reset routine, to unset the resetting_domai= n. + * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level= reset + * to unset the resetting_domain. If the reset fails, caller can choose to= keep + * the device in the resetting_domain to protect system memory using IOMMU= from + * any bad ATS. * * Return: 0 on success or negative error code if the preparation failed. * @@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev) =20 guard(mutex)(&group->mutex); =20 - /* Re-entry is not allowed */ - if (WARN_ON(group->resetting_domain)) - return -EBUSY; + /* Already prepared */ + if (group->resetting_domain) + return 0; =20 ret =3D __iommu_group_alloc_blocking_domain(group); if (ret) @@ -4001,7 +4003,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare); * re-attaching all RID/PASID of the device's back to the domains retained= in * the core-level structure. * - * Caller must pair it with a successful pci_dev_reset_iommu_prepare(). + * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller sh= ould + * use it on a successful PCI-level reset. Otherwise, it's suggested for c= aller + * to keep the device in the resetting_domain to protect system memory. * * Note that, although unlikely, there is a risk that re-attaching domains= might * fail due to some unexpected happening like OOM. diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c index 4d0f2cb6c695b..f1a918938242c 100644 --- a/drivers/pci/pci-acpi.c +++ b/drivers/pci/pci-acpi.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe) ret =3D -ENOTTY; } =20 - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!ret || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return ret; } =20 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 8479c2e1f74f1..80c5cf6eeebdc 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4358,7 +4358,15 @@ int pcie_flr(struct pci_dev *dev) =20 ret =3D pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS); done: - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!ret || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return ret; } EXPORT_SYMBOL_GPL(pcie_flr); @@ -4436,7 +4444,15 @@ static int pci_af_flr(struct pci_dev *dev, bool prob= e) =20 ret =3D pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS); done: - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!ret || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return ret; } =20 @@ -4490,7 +4506,15 @@ static int pci_pm_reset(struct pci_dev *dev, bool pr= obe) pci_dev_d3_sleep(dev); =20 ret =3D pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS); - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!ret || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return ret; } =20 @@ -4933,7 +4957,15 @@ static int pci_reset_bus_function(struct pci_dev *de= v, bool probe) =20 rc =3D pci_parent_bus_reset(dev, probe); done: - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!rc || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return rc; } =20 @@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *de= v, bool probe) pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL, reg); =20 - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!rc || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return rc; } =20 diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 48946cca4be72..d9a03a7772916 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -4269,7 +4270,15 @@ static int __pci_dev_specific_reset(struct pci_dev *= dev, bool probe, } =20 ret =3D i->reset(dev, probe); - pci_dev_reset_iommu_done(dev); + /* + * The reset might be invoked to recover a serious error. E.g. when the + * ATC failed to invalidate its stale entries, which can result in data + * corruption. Thus, do not unblock ATS until a successful reset. + */ + if (!ret || !pci_ats_supported(dev)) + pci_dev_reset_iommu_done(dev); + else + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n"); return ret; } =20 --=20 2.43.0 From nobody Thu Apr 9 23:22:37 2026 Received: from SA9PR02CU001.outbound.protection.outlook.com (mail-southcentralusazon11013037.outbound.protection.outlook.com [40.93.196.37]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40921274FE3; Thu, 5 Mar 2026 05:22:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.93.196.37 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772688136; cv=fail; b=odNrMAaQDXyil8+9C9edhrFsRYXgct96GLyey07melmYcNHti4h7/OE/5HxPCd84IyxagSMGkX7X9n2Gtrl8Qhwn9u0IXflzdBp3wLPUDmfx0SaBVlTk1qX6HXt2P7ZJw0fmOrW3g6sULlAmGV302l1x4PpoBg72o9eKy42H+HY= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772688136; c=relaxed/simple; bh=fNIjFOr8pt9ih87y+a5TLWNaeXGbI5bkasfJnq1GN7o=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=DI8iv/ZFa5sUjeXy+rwtQfP3usgZARngFUp/XeEPcyjZxKscD97cK8WzNCUBPWGDAXcPvgsmaXbjUfgEn3b75sphQqqyAv7hE7WthFv3joDR+Xu1lgvEBDUjZHMZy8LxhQ5MMsbN29jeemK1V+NL45IzmnkqCN2jXPlqnLtsWPg= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=OEtUogeY; arc=fail smtp.client-ip=40.93.196.37 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="OEtUogeY" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Wm7iz4MM1BMJ004uo+YbWq+L+tj06X2jGhlvv+gDNTvqfoXuZFnBF4/O8UZVglVH8sRIpwO/S9uPG7pWq+o886qSHpsWg/MRqx5D0rDszGb73hrM/egXCSVO5GDloNyg/CNRZCLXRiVujO42SshrzP8LDYZfwPIm4ezyHQ2NpO0jsf4TO0S/61LX9d3pwwceAYTDnWbMjGyFbU9kck0aC081zU/2LnN4Z3mwfAKesZ0kMeVBzU90d6skULpR0vW/7qaSv9ICUAj6w0ZQKawQKFBYmA4lJkZxHfXnFG2FfRx0yRX95Ozgh1tgpxS40otCKZeKSwNotPyzxNvm25k1iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=xlaaQZpHLAODjG2gD8zOI9Jd1MHMCbTvn7hp8seQZ8A=; b=Vo0ti9bqrIsSVBEwwxGbbxDqaQfRxs3WkESNYtcXeIt+Qq0OR7P0LnLENxTXhcceKmUOXikTLNdG25mGqIEnVOIGm3orVzNxpb+HQhINSTtBXXOYyiulwPtW5ttoucsDJd/RjXdXwhGWnZLFJxHaAsi3cy85ggHlBIElXKcS4qd0e21PwtzkytLQj55pZOf2B0utcR8/cSj/0roM8MCdtmH23LAdFRGfIDYgDPz8AuKyx+suTTyffgNcmlR2J7hKPr3IXWZV4YPGjF1w1zhoxUXI4NYxnmH9ghnOOvvmibqI7YU7tRHow05/VzI/yQ1EXNfQez9YFc8E1EUryeU2+w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.118.232) smtp.rcpttodomain=kernel.org smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=xlaaQZpHLAODjG2gD8zOI9Jd1MHMCbTvn7hp8seQZ8A=; b=OEtUogeYN7gXhuDYsClTYIJJl3Y8NGS6GSaJUPokcjY6MlY17AKc7bf3uqWzPlfBEfKqVQCvrnsOkZzOMW4zy8zyYLBKv8DbDrKZEz97RcAlAy9/A/I5FmgvZbhhS/pSYgL+1PIH6GG57iCju3hsRX+zxxMOI6BsW8lRmCsMTstuetSs22g1U6KZbcM8K+8jlIWQrL0mxikDMDLdQ1Xxs7Oy0lmw12quamX9I+FB+90tnEp2/Sx9cfdEs2ywJBBtiv6lt92lgz/1XVL+makoyJ0b97NhoOoe1GfP8jVwQi1Dcfb3x7D2MT6kDbmcvLAJjSe73NUe0C0DRTTtKT6ogA== Received: from BY3PR10CA0018.namprd10.prod.outlook.com (2603:10b6:a03:255::23) by LV5PR12MB9827.namprd12.prod.outlook.com (2603:10b6:408:305::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9678.17; Thu, 5 Mar 2026 05:22:07 +0000 Received: from SJ1PEPF0000231B.namprd03.prod.outlook.com (2603:10b6:a03:255:cafe::48) by BY3PR10CA0018.outlook.office365.com (2603:10b6:a03:255::23) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9654.22 via Frontend Transport; Thu, 5 Mar 2026 05:21:50 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.118.232) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.232 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.232; helo=mail.nvidia.com; pr=C Received: from mail.nvidia.com (216.228.118.232) by SJ1PEPF0000231B.mail.protection.outlook.com (10.167.242.232) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9678.18 via Frontend Transport; Thu, 5 Mar 2026 05:22:07 +0000 Received: from drhqmail202.nvidia.com (10.126.190.181) by mail.nvidia.com (10.127.129.5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.20; Wed, 4 Mar 2026 21:21:55 -0800 Received: from drhqmail202.nvidia.com (10.126.190.181) by drhqmail202.nvidia.com (10.126.190.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.20; Wed, 4 Mar 2026 21:21:55 -0800 Received: from Asurada-Nvidia.nvidia.com (10.127.8.10) by mail.nvidia.com (10.126.190.181) with Microsoft SMTP Server id 15.2.2562.20 via Frontend Transport; Wed, 4 Mar 2026 21:21:54 -0800 From: Nicolin Chen To: , , , , CC: , , , , , , , , , , , , , , Subject: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Date: Wed, 4 Mar 2026 21:21:42 -0800 Message-ID: X-Mailer: git-send-email 2.43.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-NV-OnPremToCloud: ExternallySecured X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ1PEPF0000231B:EE_|LV5PR12MB9827:EE_ X-MS-Office365-Filtering-Correlation-Id: a4693288-f725-41ba-4d87-08de7a772502 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|376014|7416014|1800799024|36860700016; X-Microsoft-Antispam-Message-Info: 0K4lmuGVAKugsPo7cQ20tITaYdoLmlFdfLe68h3swP2ci5XybWv4N3zelycboea6LWX40vkIge5ZViKKg8RDLO/6oFJzcLXo6C6KRyisfLw6d2FsADaLarABj9AOPGsKzVEIY4V7AY0ZZ952gABiVEqkyE7oIa8Ocr5NJgvzZdY38iGdsWDeqixwEmzE2A21M9l7a/zdhvB0frc9Qjf8uZZ9cSDPsLMx023/JQaX2GfDvnjIF7pbRSwQQzQOSaXpPPgYfmBvRH4hpUCkKuv9dVsfDNVRp5KVzWEOrzwkyBVC8V6SS8kYDc+wG1I5O15bAgS2RhvOvBZcdVoFW8HvvDlaODhwQFbDwpvkhVEdy/uKyyC7Une/g8Ar0Fq4aUC2T2zOQRYH86VYE39EMoOaVPAgpyJRMJhtrRN3ttHeiuGVKST0oJyVWGgV9A09NY0UZTQdWL0/cvdORVRmFb5rovIMetydr/67A/NzlOBxJoiNPIBYwe/+KXUDTY3kC3sqTQhKW7cKqMvxDBmF2Y3SKhvHvTGg9XglIBh7vFhaH3i1UcZ7ZuL+XajIx5SGb+lqKHJAES5Ws2opqK97U9tugvXzbsz5Jkri8xkjUgzje4CnWZXd+t/h1f/3VIRai/8XDgRmx+VClhYK4VeJH/O6jE6VFo39SR8cJOA4Gv0kCYEt3hz0lHZDh3rNi0vy7fVpRyuii7tM/ZZOchre3OY0UK/hGXUh1HFn2WNhhotrm8d7hMglPcEihCPFF8IlqFDSzXYrQjnXo7oDroBzBx5JYQ== X-Forefront-Antispam-Report: CIP:216.228.118.232;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.nvidia.com;PTR:dc7edge1.nvidia.com;CAT:NONE;SFS:(13230040)(82310400026)(376014)(7416014)(1800799024)(36860700016);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: jY9Zn4UEISErbFAkUAEw9ZKyzFR3NdyJZAU/xM2fv0c0T0EYlWtNDIar2Ja6GuX4o+piqcA4TBkHVpUUy2uTTVTUgkPFj7/quxpjEqD01tRyMP6VfynDslv750MNxGjVpCzW2uqzwywG8lgWMUBS/apO+22W1VWfcUbI6J52iD97iKJySe/r1PI/UB0P0zHBh8jYl449aBU7U6hOMg8S8wjsJs6FGQinKRN7Seh5H7To19gzrFs1tBDZ+85vLb6VyKLE/5sVo3s3GI4GVCIMMHuIob57yKRI2Avi0S+FfloJXIGtuMgHDOW+YdSkwCC1VIU7gKYqM6Bxp2LF67evHyJj+8+hzsNAmuJf79aKV6YcDB9gfMjr5RnF+IQ/smON1TT85amfcW7yxxrj/DAjQ5SIILfacQ7dklbQ7VjPkcyxELbaDLFkfOP8EUtytOKi X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Mar 2026 05:22:07.3580 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: a4693288-f725-41ba-4d87-08de7a772502 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.118.232];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF0000231B.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV5PR12MB9827 Content-Type: text/plain; charset="utf-8" Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX. When a device wasn't responsive to an ATC invalidation request, this often results in constant CMDQ errors: unexpected global error reported (0x00000001), this could be serious CMDQ error (cons 0x0302bb84): ATC invalidate timeout unexpected global error reported (0x00000001), this could be serious CMDQ error (cons 0x0302bb88): ATC invalidate timeout unexpected global error reported (0x00000001), this could be serious CMDQ error (cons 0x0302bb8c): ATC invalidate timeout ... An ATC invalidation timeout indicates that the device failed to respond to a protocol-critical coherency request, which means that device's internal ATS state is desynchronized from the SMMU. Furthermore, ignoring the timeout leaves the system in an unsafe state, as the device cache may retain stale ATC entries for memory pages that the OS has already reclaimed and reassigned. This might lead to data corruption. The only safe recovery action is to issue a PCI reset, which guarantees to flush all internal device caches and recover the device. Read the ATC_INV command that led to the timeouts, and schedule a recovery worker to reset the device corresponding to the Stream ID. If reset fails, keep the device in the resetting/blocking domain to avoid data corruption. Though it'd be ideal to block it immediately in the ISR, it cannot be done because an STE update would require another CFIG_STE command that couldn't finish in the context of an ISR handling a CMDQ error. Signed-off-by: Nicolin Chen --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 5 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 131 +++++++++++++++++++- 2 files changed, 132 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/ar= m/arm-smmu-v3/arm-smmu-v3.h index 3c6d65d36164f..8789cf8294504 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -803,6 +803,11 @@ struct arm_smmu_device { =20 struct rb_root streams; struct mutex streams_mutex; + + struct { + struct list_head list; + spinlock_t lock; /* Lock the list */ + } atc_recovery; }; =20 struct arm_smmu_stream { diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/ar= m/arm-smmu-v3/arm-smmu-v3.c index 4d00d796f0783..de182c27c77c4 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -106,6 +106,8 @@ static const char * const event_class_str[] =3D { [3] =3D "Reserved", }; =20 +static struct arm_smmu_master * +arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid); static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master); =20 static void parse_driver_options(struct arm_smmu_device *smmu) @@ -174,6 +176,13 @@ static void queue_inc_cons(struct arm_smmu_ll_queue *q) q->cons =3D Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons); } =20 +static u32 queue_prev_cons(struct arm_smmu_ll_queue *q, u32 cons) +{ + u32 idx_wrp =3D (Q_WRP(q, cons) | Q_IDX(q, cons)) - 1; + + return Q_OVF(cons) | Q_WRP(q, idx_wrp) | Q_IDX(q, idx_wrp); +} + static void queue_sync_cons_ovf(struct arm_smmu_queue *q) { struct arm_smmu_ll_queue *llq =3D &q->llq; @@ -410,6 +419,97 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, str= uct arm_smmu_device *smmu, u64p_replace_bits(cmd, CMDQ_SYNC_0_CS_NONE, CMDQ_SYNC_0_CS); } =20 +/* ATC recovery upon ATC invalidation timeout */ +struct arm_smmu_atc_recovery_param { + struct arm_smmu_device *smmu; + struct pci_dev *pdev; + u32 sid; + + struct work_struct work; + struct list_head node; +}; + +static void arm_smmu_atc_recovery_worker(struct work_struct *work) +{ + struct arm_smmu_atc_recovery_param *param =3D + container_of(work, struct arm_smmu_atc_recovery_param, work); + struct pci_dev *pdev; + + scoped_guard(mutex, ¶m->smmu->streams_mutex) { + struct arm_smmu_master *master; + + master =3D arm_smmu_find_master(param->smmu, param->sid); + if (!master || WARN_ON(!dev_is_pci(master->dev))) + goto free_param; + pdev =3D to_pci_dev(master->dev); + pci_dev_get(pdev); + } + + scoped_guard(spinlock_irqsave, ¶m->smmu->atc_recovery.lock) { + struct arm_smmu_atc_recovery_param *e; + + list_for_each_entry(e, ¶m->smmu->atc_recovery.list, node) { + /* Device is already being recovered */ + if (e->pdev =3D=3D pdev) + goto put_pdev; + } + param->pdev =3D pdev; + list_add(¶m->node, ¶m->smmu->atc_recovery.list); + } + + /* + * Stop DMA (PCI) and block ATS (IOMMU) immediately, to prevent memory + * corruption. This must take pci_dev_lock to prevent any racy unplug. + * + * If pci_dev_reset_iommu_prepare() fails, pci_reset_function will call + * it again internally. + */ + pci_dev_lock(pdev); + pci_clear_master(pdev); + if (pci_dev_reset_iommu_prepare(pdev)) + pci_err(pdev, "failed to block ATS!\n"); + pci_dev_unlock(pdev); + + /* + * ATC timeout indicates the device has stopped responding to coherence + * protocol requests. The only safe recovery is a reset to flush stale + * cached translations. Note that pci_reset_function() internally calls + * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS + * if PCI-level reset fails. + */ + if (!pci_reset_function(pdev)) { + /* + * If reset succeeds, set BME back. Otherwise, fence the system + * from a faulty device, in which case user will have to replug + * the device to invoke pci_set_master(). + */ + pci_dev_lock(pdev); + pci_set_master(pdev); + pci_dev_unlock(pdev); + } + scoped_guard(spinlock_irqsave, ¶m->smmu->atc_recovery.lock) + list_del(¶m->node); +put_pdev: + pci_dev_put(pdev); +free_param: + kfree(param); +} + +static int arm_smmu_sched_atc_recovery(struct arm_smmu_device *smmu, u32 s= id) +{ + struct arm_smmu_atc_recovery_param *param; + + param =3D kzalloc_obj(*param, GFP_ATOMIC); + if (!param) + return -ENOMEM; + param->smmu =3D smmu; + param->sid =3D sid; + + INIT_WORK(¶m->work, arm_smmu_atc_recovery_worker); + queue_work(system_unbound_wq, ¶m->work); + return 0; +} + void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq) { @@ -441,11 +541,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device = *smmu, case CMDQ_ERR_CERROR_ATC_INV_IDX: /* * ATC Invalidation Completion timeout. CONS is still pointing - * at the CMD_SYNC. Attempt to complete other pending commands - * by repeating the CMD_SYNC, though we might well end up back - * here since the ATC invalidation may still be pending. + * at the CMD_SYNC. Rewind it to read the ATC_INV command. */ - return; + cons =3D queue_prev_cons(&q->llq, cons); + fallthrough; case CMDQ_ERR_CERROR_ILL_IDX: default: break; @@ -456,6 +555,27 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *= smmu, * not to touch any of the shadow cmdq state. */ queue_read(cmd, Q_ENT(q, cons), q->ent_dwords); + + if (idx =3D=3D CMDQ_ERR_CERROR_ATC_INV_IDX) { + /* + * Since commands can be issued in batch making it difficult to + * identify which CMDQ_OP_ATC_INV actually timed out, the driver + * must ensure only CMDQ_OP_ATC_INV commands for the same device + * can be batched. + */ + WARN_ON(FIELD_GET(CMDQ_0_OP, cmd[0]) !=3D CMDQ_OP_ATC_INV); + + /* + * If we failed to schedule a recovery worker, we would well end + * up back here since the ATC invalidation may still be pending. + * This gives us another chance to reschedule a recovery worker. + */ + arm_smmu_sched_atc_recovery(smmu, + FIELD_GET(CMDQ_ATC_0_SID, cmd[0])); + return; + } + + /* idx =3D=3D CMDQ_ERR_CERROR_ILL_IDX */ dev_err(smmu->dev, "skipping command in error state:\n"); for (i =3D 0; i < ARRAY_SIZE(cmd); ++i) dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]); @@ -3942,6 +4062,9 @@ static int arm_smmu_init_structures(struct arm_smmu_d= evice *smmu) { int ret; =20 + INIT_LIST_HEAD(&smmu->atc_recovery.list); + spin_lock_init(&smmu->atc_recovery.lock); + mutex_init(&smmu->streams_mutex); smmu->streams =3D RB_ROOT; =20 --=20 2.43.0