From nobody Tue Oct 7 03:47:47 2025 Received: from raptorengineering.com (mail.raptorengineering.com [23.155.224.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A1511DC9B8; Tue, 15 Jul 2025 21:39:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=23.155.224.40 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752615551; cv=none; b=aPpHfpZu9UHes/Hg9XqrPWEzw6BDOob/1nXzVrjUq+7ySsbfqr9UhfKAmAOQsJWq/8alH8/LJ/2BU/TjgOMysL/i5zQq5vXdga7ksYYQV9OhYYcbtMX8TMMr/+G8+hvPJeroWstDEumbTem9AB1ZL5aBfPUwGZ0s+Ut93KvUuwA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752615551; c=relaxed/simple; bh=UIpyUk1ArThj4w4ht64zUSfWFZzv5915C1NSvpvwuxA=; h=Date:From:To:Cc:Message-ID:In-Reply-To:References:Subject: MIME-Version:Content-Type; b=cdokaV5887LiGzjNM7ve3Lunb7FCnHVJ7CWbWRzPp4ZhhfI4gaiOxUDlT+A4ayAExHfFA0n04mHblFAT13w8HlhGGXvqivEkY/ywgPZd/sUtfOsU4rIxL/Z+5tWdhiA/WlvN8qfL6Gq4y3x1/mrKOuaCeHTwW8sVlgfJuyUsy/E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=raptorengineering.com; spf=pass smtp.mailfrom=raptorengineering.com; dkim=pass (1024-bit key) header.d=raptorengineering.com header.i=@raptorengineering.com header.b=NqKmlqCb; arc=none smtp.client-ip=23.155.224.40 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=raptorengineering.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=raptorengineering.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=raptorengineering.com header.i=@raptorengineering.com header.b="NqKmlqCb" Received: from localhost (localhost [127.0.0.1]) by mail.rptsys.com (Postfix) with ESMTP id 902FC8287698; Tue, 15 Jul 2025 16:39:08 -0500 (CDT) Received: from mail.rptsys.com ([127.0.0.1]) by localhost (vali.starlink.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 5U_tYvSKFoQ6; Tue, 15 Jul 2025 16:39:07 -0500 (CDT) Received: from localhost (localhost [127.0.0.1]) by mail.rptsys.com (Postfix) with ESMTP id 47CD68288591; Tue, 15 Jul 2025 16:39:07 -0500 (CDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.rptsys.com 47CD68288591 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=raptorengineering.com; s=B8E824E6-0BE2-11E6-931D-288C65937AAD; t=1752615547; bh=8PjzjruFthJ/9WkYWAPNc3CMtV9pv0s1uiS/VXTCoIY=; h=Date:From:To:Message-ID:MIME-Version; b=NqKmlqCb9OR06Ks97mDo2JSnewgMuR2fKo0w3RKxNZNf1DqsvtQIgGruixPvIsNYd VSStFlkxfJqR0Y0qAoDiJOaCyGeHRa7y8k4CC77fjM710BMfVRd0J3h+s25/gtH8Pb gvjUKpnCcUWccv+eRoC3I4C6HNGeanl7DZpPfn/0= X-Virus-Scanned: amavisd-new at rptsys.com Received: from mail.rptsys.com ([127.0.0.1]) by localhost (vali.starlink.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id gc4u1bx2RXq7; Tue, 15 Jul 2025 16:39:07 -0500 (CDT) Received: from vali.starlink.edu (localhost [127.0.0.1]) by mail.rptsys.com (Postfix) with ESMTP id 118AF8287698; Tue, 15 Jul 2025 16:39:07 -0500 (CDT) Date: Tue, 15 Jul 2025 16:39:06 -0500 (CDT) From: Timothy Pearson To: Timothy Pearson Cc: linuxppc-dev , linux-kernel , linux-pci , Madhavan Srinivasan , Michael Ellerman , christophe leroy , Naveen N Rao , Bjorn Helgaas , Shawn Anastasio Message-ID: <171044224.1359864.1752615546988.JavaMail.zimbra@raptorengineeringinc.com> In-Reply-To: <1268570622.1359844.1752615109932.JavaMail.zimbra@raptorengineeringinc.com> References: <1268570622.1359844.1752615109932.JavaMail.zimbra@raptorengineeringinc.com> Subject: [PATCH v3 5/6] PCI: pnv_php: Fix surprise plug detection and recovery Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Mailer: Zimbra 8.5.0_GA_3042 (ZimbraWebClient - GC138 (Linux)/8.5.0_GA_3042) Thread-Topic: pnv_php: Fix surprise plug detection and recovery Thread-Index: XyF2OaMn/3q+H+nwsGaxXLVF4U4PFwXmHrh/ Content-Type: text/plain; charset="utf-8" The existing PowerNV hotplug code did not handle surprise plug events correctly, leading to a complete failure of the hotplug system after device removal and a required reboot to detect new devices. This comes down to two issues: 1.) When a device is surprise removed, oftentimes the bridge upstream port will cause a PE freeze on the PHB. If this freeze is not cleared, the MSI interrupts from the bridge hotplug notification logic will not be received by the kernel, stalling all plug events on all slots associated with the PE. 2.) When a device is removed from a slot, regardless of surprise or programmatic removal, the associated PHB/PE ls left frozen. If this freeze is not cleared via a fundamental reset, skiboot is unable to clear the freeze and cannot retrain / rescan the slot. This also requires a reboot to clear the freeze and redetect the device in the slot. Issue the appropriate unfreeze and rescan commands on hotplug events, and don't oops on hotplug if pci_bus_to_OF_node() returns NULL. Signed-off-by: Timothy Pearson --- arch/powerpc/kernel/pci-hotplug.c | 3 + drivers/pci/hotplug/pnv_php.c | 108 +++++++++++++++++++++++++++++- 2 files changed, 108 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-ho= tplug.c index 9ea74973d78d..6f444d0822d8 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -141,6 +141,9 @@ void pci_hp_add_devices(struct pci_bus *bus) struct pci_controller *phb; struct device_node *dn =3D pci_bus_to_OF_node(bus); =20 + if (!dn) + return; + phb =3D pci_bus_to_host(bus); =20 mode =3D PCI_PROBE_NORMAL; diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c index bac8af3df41a..3533f7f23b71 100644 --- a/drivers/pci/hotplug/pnv_php.c +++ b/drivers/pci/hotplug/pnv_php.c @@ -4,12 +4,14 @@ * * Copyright Gavin Shan, IBM Corporation 2016. * Copyright (C) 2025 Raptor Engineering, LLC + * Copyright (C) 2025 Raptor Computing Systems, LLC */ =20 #include #include #include #include +#include #include #include =20 @@ -469,6 +471,59 @@ static int pnv_php_set_attention_state(struct hotplug_= slot *slot, u8 state) return 0; } =20 +static int pnv_php_activate_slot(struct pnv_php_slot *php_slot, + struct hotplug_slot *slot) +{ + int ret, i; + + /* + * Issue initial slot activation command to firmware + * + * Firmware will power slot on, attempt to train the link, and discover a= ny downstream devices + * If this process fails, firmware will return an error code and an inval= id device tree + * Failure can be caused for multiple reasons, including a faulty downstr= eam device, + * poor connection to the downstream device, or a previously latched PHB = fence. + * On failure, issue fundamental reset up to three times before aborting. + */ + ret =3D pnv_php_set_slot_power_state(slot, OPAL_PCI_SLOT_POWER_ON); + if (ret) { + SLOT_WARN( + php_slot, + "PCI slot activation failed with error code %d, possible frozen PHB", + ret); + SLOT_WARN( + php_slot, + "Attempting complete PHB reset before retrying slot activation\n"); + for (i =3D 0; i < 3; i++) { + /* + * Slot activation failed, PHB may be fenced from a + * prior device failure. + * + * Use the OPAL fundamental reset call to both try a + * device reset and clear any potentially active PHB + * fence / freeze. + */ + SLOT_WARN(php_slot, "Try %d...\n", i + 1); + pci_set_pcie_reset_state(php_slot->pdev, + pcie_warm_reset); + msleep(250); + pci_set_pcie_reset_state(php_slot->pdev, + pcie_deassert_reset); + + ret =3D pnv_php_set_slot_power_state( + slot, OPAL_PCI_SLOT_POWER_ON); + if (!ret) + break; + } + + if (i >=3D 3) + SLOT_WARN(php_slot, + "Failed to bring slot online, aborting!\n"); + } + + return ret; +} + static int pnv_php_enable(struct pnv_php_slot *php_slot, bool rescan) { struct hotplug_slot *slot =3D &php_slot->slot; @@ -531,7 +586,7 @@ static int pnv_php_enable(struct pnv_php_slot *php_slot= , bool rescan) goto scan; =20 /* Power is off, turn it on and then scan the slot */ - ret =3D pnv_php_set_slot_power_state(slot, OPAL_PCI_SLOT_POWER_ON); + ret =3D pnv_php_activate_slot(php_slot, slot); if (ret) return ret; =20 @@ -836,16 +891,63 @@ static int pnv_php_enable_msix(struct pnv_php_slot *p= hp_slot) return entry.vector; } =20 +static void +pnv_php_detect_clear_suprise_removal_freeze(struct pnv_php_slot *php_slot) +{ + struct pci_dev *pdev =3D php_slot->pdev; + struct eeh_dev *edev; + struct eeh_pe *pe; + int i, rc; + + /* + * When a device is surprise removed from a downstream bridge slot, + * the upstream bridge port can still end up frozen due to related EEH + * events, which will in turn block the MSI interrupts for slot hotplug + * detection. + * + * Detect and thaw any frozen upstream PE after slot deactivation... + */ + edev =3D pci_dev_to_eeh_dev(pdev); + pe =3D edev ? edev->pe : NULL; + rc =3D eeh_pe_get_state(pe); + if ((rc =3D=3D -ENODEV) || (rc =3D=3D -ENOENT)) { + SLOT_WARN( + php_slot, + "Upstream bridge PE state unknown, hotplug detect may fail\n"); + } else { + if (pe->state & EEH_PE_ISOLATED) { + SLOT_WARN( + php_slot, + "Upstream bridge PE %02x frozen, thawing...\n", + pe->addr); + for (i =3D 0; i < 3; i++) + if (!eeh_unfreeze_pe(pe)) + break; + if (i >=3D 3) + SLOT_WARN( + php_slot, + "Unable to thaw PE %02x, hotplug detect will fail!\n", + pe->addr); + else + SLOT_WARN(php_slot, + "PE %02x thawed successfully\n", + pe->addr); + } + } +} + static void pnv_php_event_handler(struct work_struct *work) { struct pnv_php_event *event =3D container_of(work, struct pnv_php_event, work); struct pnv_php_slot *php_slot =3D event->php_slot; =20 - if (event->added) + if (event->added) { pnv_php_enable_slot(&php_slot->slot); - else + } else { pnv_php_disable_slot(&php_slot->slot); + pnv_php_detect_clear_suprise_removal_freeze(php_slot); + } =20 kfree(event); } --=20 2.39.5