From nobody Sat Jun 13 10:58:51 2026 Received: from pdx-out-007.esa.us-west-2.outbound.mail-perimeter.amazon.com (pdx-out-007.esa.us-west-2.outbound.mail-perimeter.amazon.com [52.34.181.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E519D332EBD; Thu, 7 May 2026 18:31:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.34.181.151 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778178689; cv=none; b=O49tTahfDSKScl0ykTuqmBIhwk6ZbahPyHyuxiCqgLq0xMRvTJz3gRSYQcrWfp98AJ77XYBH8Mn9HK9nV3NcppaTP81mCqsu9FC/UgTAXTqd5RlM6Dw2EK464DMjPXY1u1kwqmTvNrhqtvv/Tv5PFdrZhtilrWp/1AhPPHwsKow= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778178689; c=relaxed/simple; bh=2MlQS6k4Q7bImKdQOmjIjSWRgv9LVbB46lZqDQDejv0=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=KzJ45kZlmEX8joqRV8dWC8CT8GOhuUa6P4zgXSzt0xQ6gSY2cfLQo0JfNKIB29iJ6L4iNHAupEyiiCx6pvL5RTyfaqsJVpOCWHKcx5wBvCIn3WPgQO765ia1pO4FGlE7GfQ7vWft0iuToFbzFAIn8U3c0z0aoJRMxIseNETEYss= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.de; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b=EH5sdUcq; arc=none smtp.client-ip=52.34.181.151 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=amazon.de header.i=@amazon.de header.b="EH5sdUcq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazoncorp2; t=1778178686; x=1809714686; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=2MlQS6k4Q7bImKdQOmjIjSWRgv9LVbB46lZqDQDejv0=; b=EH5sdUcqfxYQZ8mXAvCeZolrXkKQimu+ZAfC6coHPuCCk9U6o6YlX1tm NuaYrXkiiGBUYASlhffHX/roHPa2FWjkNz8Zq1IwmkQs6brMrXrAoTKJG omRPUBm1aoFvZHbu0pHFWT9r0PfU0oKrqzlOWk9kjRIR2qLfIdl54hG3W rT0SzP8y4a2Zb3jdG7Qp1RpbDgY+8dqJW807dmQ8YZ38fJaOu9xoLw5VX PDDJBGUWWWXFN+4XF6dFC+nLM5iUmygYcrG7ykh7qvZquT2Zjt9V7AnhL TBnRBO8yB1Y7N/KGLD5WD2l1HKXUFTI5pnoVTrs4b/5dODoJh0SDUFbcK g==; X-CSE-ConnectionGUID: vofQzfIUTZWKzWuHUPoUJA== X-CSE-MsgGUID: yfGYj/4cSmWVDPcxIjtQLQ== X-IronPort-AV: E=Sophos;i="6.23,222,1770595200"; d="scan'208";a="19100460" Received: from ip-10-5-6-203.us-west-2.compute.internal (HELO smtpout.naws.us-west-2.prod.farcaster.email.amazon.dev) ([10.5.6.203]) by internal-pdx-out-007.esa.us-west-2.outbound.mail-perimeter.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 May 2026 18:31:21 +0000 Received: from EX19MTAUWB001.ant.amazon.com [205.251.233.51:15821] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.52.175:2525] with esmtp (Farcaster) id 10439983-83ea-4143-a0c1-80b279d941f4; Thu, 7 May 2026 18:31:21 +0000 (UTC) X-Farcaster-Flow-ID: 10439983-83ea-4143-a0c1-80b279d941f4 Received: from EX19D001UWA001.ant.amazon.com (10.13.138.214) by EX19MTAUWB001.ant.amazon.com (10.250.64.248) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.37; Thu, 7 May 2026 18:31:18 +0000 Received: from dev-dsk-doebel-1a-7b355d76.us-east-1.amazon.com (10.169.119.5) by EX19D001UWA001.ant.amazon.com (10.13.138.214) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.2562.37; Thu, 7 May 2026 18:31:17 +0000 From: Bjoern Doebel To: CC: Bjoern Doebel , , Marc Zyngier , Thomas Gleixner , , , "David Woodhouse" , Ali Saidi , David Arinzon , Zeev Zilberman Subject: [PATCH] irqchip/gic-v3-its: Reconfigure ITS from software state on resume Date: Thu, 7 May 2026 18:31:00 +0000 Message-ID: <20260507183102.1897629-1-doebel@amazon.de> X-Mailer: git-send-email 2.50.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D039UWB002.ant.amazon.com (10.13.138.79) To EX19D001UWA001.ant.amazon.com (10.13.138.214) Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable After resume, MSI-X interrupts can be silently dropped because the ITS hardware state does not match the software its_device state. The in-memory tables pointed at by GITS_BASER survive suspend, but the ITS has been reset and must be reconfigured via ITS commands, per the GICv3 ITS architecture specification =C2=A75.6.1 (Enabling an ITS). Some ITS implementations also keep internal state that is only populated via ITS commands rather than by reading guest/firmware memory on demand, so restoring GITS_BASER alone is not enough. Before commit 713335b6ee29 ("irqchip/gic-v3-its: Implement .msi_teardown() callback"), pci_free_irq_vectors() tore down the its_device (MAPD valid=3D0, ITT freed) and pci_alloc_irq_vectors() rebuilt it (MAPD valid=3D1). Drivers that disabled/re-enabled MSI-X across suspend/resume (e.g. ENA, NVMe) thus reprogrammed the ITS as a side effect. After commit 713335b6ee29 ("irqchip/gic-v3-its: Implement .msi_teardown() callback"), device teardown moved to .msi_teardown(), which only runs when the MSI domain is removed (driver unbind). Since the MSI domain persists across suspend/resume, MAPD is never replayed. Fix this in its_restore_enable() and its_cpu_init_collection() by walking the preserved software state and re-issuing the ITS commands needed to bring the hardware back in sync: 1. For each device, issue MAPD(V=3D0), zero the ITT, then MAPD(V=3D1) with the same parameters. =C2=A75.3.10 makes MAPD(V=3D1) with a non-ze= ro ITT UNPREDICTABLE, so the ITT must be zeroed first. 2. Restore every CPU's collection (MAPC) and replay MAPTI for events targeting that CPU, once all target collections have been mapped. The per-event replay is driven by a bool parameter to its_cpu_init_collection() so that every ITS on a given CPU gets its MAPTIs restored. For the boot CPU, which does not traverse its_cpu_init_collections() on resume, replay is driven directly from its_restore_enable() for every ITS; the HCC optimisation that previously skipped MAPC for memory-resident collections on the boot CPU is dropped, matching what secondary CPUs already do in their normal cpuhp startup path. For secondary CPUs, replay is gated by a cpumask armed by its_restore_enable() once per resume cycle so that normal CPU hotplug is unaffected. GICv4 vLPI state is skipped here: vLPIs are hypervisor-only, replayed through separate GICv4 VM resume paths, and not relevant to guest kernels or to this fix. Tested on EC2 c6gn.16xlarge (ARM64 Graviton). Without the fix, hibernation resume fails 100% with: ena 0000:00:05.0: The ena device sent a completion but the driver didn't receive a MSI-X interrupt (cmd 3) ena 0000:00:05.0: Failed to create IO CQ. error: -62 With the fix, hibernation resume works reliably. Fixes: 713335b6ee29 ("irqchip/gic-v3-its: Implement .msi_teardown() callbac= k") Cc: stable@vger.kernel.org Cc: Marc Zyngier Cc: Thomas Gleixner Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: David Woodhouse Cc: Ali Saidi Co-developed-by: David Arinzon Signed-off-by: David Arinzon Co-developed-by: Zeev Zilberman Signed-off-by: Zeev Zilberman Signed-off-by: Bjoern Doebel Assisted-by: Kiro:claude-opus-4.6 --- Testing: Tested hibernation using Amazon Linux 2023 and kernel 7.1-rc2 on EC2 c6gn, c7gn, and c8gn instances. Without the patch, hibernation failed to bring up the ENA network device. With the patch, ENA devices are properly re-initialized on resume. --- drivers/irqchip/irq-gic-v3-its.c | 124 ++++++++++++++++++++++++++++--- 1 file changed, 114 insertions(+), 10 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-= its.c index 291d7668cc8da..0d240230037cd 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -3283,7 +3283,66 @@ static void its_cpu_init_lpis(void) &paddr); } =20 -static void its_cpu_init_collection(struct its_node *its) +static cpumask_var_t its_restore_pending_cpus; + +static void its_restore_device(struct its_device *its_dev) +{ + /* + * Bring each device back to a quiescent mapping state, as required + * by GICv3 ITS architecture =C2=A75.6.1 (Enabling an ITS) after an ITS + * reset: the device table entries are gone, so software must + * reconfigure them with ITS commands. MAPD(V=3D1) with a non-zero + * ITT is UNPREDICTABLE (=C2=A75.3.10, =C2=A75.2.4), so unmap first, zero= the + * ITT, and map again with a clean ITT. MAPTI replay is deferred to + * its_cpu_init_collection() so that the collection a given event + * targets is MAPC'd before MAPTI is issued for it. + */ + its_send_mapd(its_dev, 0); + memset(its_dev->itt, 0, its_dev->itt_sz); + gic_flush_dcache_to_poc(its_dev->itt, its_dev->itt_sz); + its_send_mapd(its_dev, 1); +} + +static void its_cpu_replay_mapti(struct its_node *its) +{ + int cpu =3D smp_processor_id(); + struct its_device *its_dev; + int event; + + /* + * Walk its_device_list without holding its->dev_alloc_lock. + * Device add/remove normally requires that mutex, but this + * function only runs on the resume path, from + * its_cpu_init_collection() on either the boot CPU (called + * directly from its_restore_enable() under its_lock) or a + * secondary CPU (called from its_cpu_init_collections() under + * its_lock). Concurrency with driver MSI alloc/free is excluded + * by the hibernate sequence: + * + * syscore_resume() <- its_restore_enable() runs here + * pm_sleep_enable_secondary_cpus() <- its_cpu_init() on each CPU + * dpm_resume_start() / dpm_resume() <- driver .resume callbacks + * + * See kernel/power/hibernate.c:resume_target_kernel(). Drivers + * cannot add or remove MSI allocations until their .resume + * callbacks run, which is strictly after every CPU has passed + * through its_cpu_init_collection(). + */ + list_for_each_entry(its_dev, &its->its_device_list, entry) { + if (its_dev->event_map.vm) + continue; + for_each_set_bit(event, its_dev->event_map.lpi_map, + its_dev->event_map.nr_lpis) { + if (its_dev->event_map.col_map[event] !=3D cpu) + continue; + its_send_mapti(its_dev, + its_dev->event_map.lpi_base + event, + event); + } + } +} + +static void its_cpu_init_collection(struct its_node *its, bool replay) { int cpu =3D smp_processor_id(); u64 target; @@ -3320,17 +3379,33 @@ static void its_cpu_init_collection(struct its_node= *its) =20 its_send_mapc(its, &its->collections[cpu], 1); its_send_invall(its, &its->collections[cpu]); + + /* + * On resume from hibernation, its_restore_enable() has reprogrammed + * the device table but deferred per-event MAPTI replay until each + * target collection is MAPC'd. Now that the local collection is + * mapped, replay MAPTIs for events targeting this CPU on this ITS. + */ + if (replay) + its_cpu_replay_mapti(its); } =20 static void its_cpu_init_collections(void) { struct its_node *its; + bool replay; =20 - raw_spin_lock(&its_lock); + /* + * On resume from hibernation, its_restore_enable() arms this cpumask + * for every secondary CPU that still needs MAPTI replay. Test-and- + * clear once per CPU and propagate the flag to every ITS on this CPU. + */ + replay =3D cpumask_test_and_clear_cpu(smp_processor_id(), + its_restore_pending_cpus); =20 + raw_spin_lock(&its_lock); list_for_each_entry(its, &its_nodes, entry) - its_cpu_init_collection(its); - + its_cpu_init_collection(its, replay); raw_spin_unlock(&its_lock); } =20 @@ -5036,8 +5111,22 @@ static void its_restore_enable(void *data) struct its_node *its; int ret; =20 + /* + * Arm MAPTI replay for every secondary CPU. The boot CPU does not + * go through its_cpu_init_collections() on resume, so it is handled + * directly in the per-ITS loop below; exclude it here to avoid + * leaving a stale bit set. + * + * See =C2=A75.6.1 of the GICv3 ITS architecture specification: after an + * ITS reset, software must reconfigure devices, collections and + * translations via ITS commands. + */ + cpumask_copy(its_restore_pending_cpus, cpu_possible_mask); + cpumask_clear_cpu(smp_processor_id(), its_restore_pending_cpus); + raw_spin_lock(&its_lock); list_for_each_entry(its, &its_nodes, entry) { + struct its_device *its_dev; void __iomem *base; int i; =20 @@ -5080,13 +5169,23 @@ static void its_restore_enable(void *data) writel_relaxed(its->ctlr_save, base + GITS_CTLR); =20 /* - * Reinit the collection if it's stored in the ITS. This is - * indicated by the col_id being less than the HCC field. - * CID < HCC as specified in the GIC v3 Documentation. + * Reset and remap each device on this ITS. After resume, + * the ITS has no device table entries and ITT contents may + * be stale; per GICv3 ITS =C2=A75.3.10, MAPD(V=3D1) with a non-zero + * ITT is UNPREDICTABLE. Unmap first, zero the ITT, then map + * again. */ - if (its->collections[smp_processor_id()].col_id < - GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER))) - its_cpu_init_collection(its); + list_for_each_entry(its_dev, &its->its_device_list, entry) + its_restore_device(its_dev); + + /* + * Unconditionally MAPC the boot CPU's collection and replay + * MAPTIs for events targeting it, on every ITS. This mirrors + * the unconditional MAPC that secondary CPUs do in their + * cpuhp startup path, and covers both HW-resident and + * memory-resident collections. + */ + its_cpu_init_collection(its, true); } raw_spin_unlock(&its_lock); } @@ -5826,6 +5925,11 @@ int __init its_init(struct fwnode_handle *handle, st= ruct rdists *rdists, if (!itt_pool) return -ENOMEM; =20 + if (!zalloc_cpumask_var(&its_restore_pending_cpus, GFP_KERNEL)) { + gen_pool_destroy(itt_pool); + return -ENOMEM; + } + gic_rdists =3D rdists; =20 lpi_prop_prio =3D irq_prio; --=20 2.48.2 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597