From nobody Tue Jun 30 05:25:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3112C433EF for ; Mon, 24 Jan 2022 18:18:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245210AbiAXSSB (ORCPT ); Mon, 24 Jan 2022 13:18:01 -0500 Received: from mail-bn8nam12on2083.outbound.protection.outlook.com ([40.107.237.83]:44865 "EHLO NAM12-BN8-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S245172AbiAXSRx (ORCPT ); Mon, 24 Jan 2022 13:17:53 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gostEVAH2DM7xJv2wzL0XjZlXC2JOqDPhBKwekWqj6W422uQps79q7QctxX6M7e7NtxwQzVIUO78M2j631uVRLHqaKiF4to/fuVD87dy28J5BoMYgvU18kZP1t2EjRoamlRmLWaXEfHSQ4N4mkT8i347Jl/LH0StNij2WRbp7Fxhq6UjZEtru3qyEJRDIjHHS4Z4RZggfKjcJOQnXTevdnGGF8IGb5HMI3kvNKKTIBC7JgnCq7eNfPo1w4KO4Z/cKM1akQ/bij1cJiaa7C7xoR5CdOnMqSu5m2udSWp1XOYd45aP0eJbXHV6+uWKX5P5E3i5xg9b2V4BZ2Eyy2mvmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=zkVdGMLKC2SYT8bHfypRNDWu6RqXsLY86Q9yCVu6iK4=; b=RWlJ+qZvyRBYNUaU5pmIWDSNW3/FQZl6g2dwcFaX4MjjC3XD2chWj+kHPUeEOhxsTaRg52hMqSm7jCnKMrAY/qlkhya36PrkpVl63Iv1ez+AhK1V4g9DCtwgFUNHIcY9XqyW5++aSLsKUlMc/svOB4krPsA86VePjtpObs6QcQDW8F1AOOx2Dfju0S+ryq50ufvTCJh9gTrlqeCbBgUEXm3k/BxyDqAaoMqH5Lp2U/omV8SUc55vdkQPHxuGOaRK2aZkSKWjR6MvejS3yS+HWghSTH4oVS/YZkna4DZYkzhS3aLmJuwZkHb0VZnRDKxOvaEMgXl0iZKKjjiN8OFQgg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.236) smtp.rcpttodomain=huawei.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zkVdGMLKC2SYT8bHfypRNDWu6RqXsLY86Q9yCVu6iK4=; b=aZJ02EAay1lmQaFtng85QhrAIa8YlO3PjkwZH8QYq32ZcHqYcUNh8CgeSCeYYlTnLiAxLfs10i8bbN7Wxixz2cw9yHSJ5zYFVNDIp3/zZa15HLTeN/CZv5w0u8yYv3PHCqvjVvwkqfaYuQ1ZCZkqUsH+8D5sLEu7JBBUHLDAVyh8V4OXdN3mbh8Z+YnHkSVTgO9Qsy9J150gWVTueeNXWBtrNaNwVJ6z9fpLaYGAE6MizY1k8qzDLFXhF8zfl1ULEy1q68fjT47gPWIH3+M1fXu0/mYbIqgjx4jKSwjPqrQb+E64gdH9ires2Eym2waQCwHbY9ThX5gRA/p/eCy2KA== Received: from DM5PR06CA0078.namprd06.prod.outlook.com (2603:10b6:3:4::16) by DM5PR12MB1913.namprd12.prod.outlook.com (2603:10b6:3:10d::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.7; Mon, 24 Jan 2022 18:17:42 +0000 Received: from DM6NAM11FT060.eop-nam11.prod.protection.outlook.com (2603:10b6:3:4:cafe::31) by DM5PR06CA0078.outlook.office365.com (2603:10b6:3:4::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.7 via Frontend Transport; Mon, 24 Jan 2022 18:17:42 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.236) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.236 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.236; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.236) by DM6NAM11FT060.mail.protection.outlook.com (10.13.173.63) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4909.7 via Frontend Transport; Mon, 24 Jan 2022 18:17:42 +0000 Received: from drhqmail202.nvidia.com (10.126.190.181) by DRHQMAIL109.nvidia.com (10.27.9.19) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 24 Jan 2022 18:17:41 +0000 Received: from drhqmail201.nvidia.com (10.126.190.180) by drhqmail202.nvidia.com (10.126.190.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.9; Mon, 24 Jan 2022 10:17:41 -0800 Received: from nvidia-Inspiron-15-7510.nvidia.com (10.127.8.13) by mail.nvidia.com (10.126.190.180) with Microsoft SMTP Server id 15.2.986.9 via Frontend Transport; Mon, 24 Jan 2022 10:17:37 -0800 From: Abhishek Sahu To: , Alex Williamson , Cornelia Huck CC: Max Gurtovoy , Yishai Hadas , Zhen Lei , Jason Gunthorpe , , Abhishek Sahu Subject: [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Date: Mon, 24 Jan 2022 23:47:22 +0530 Message-ID: <20220124181726.19174-2-abhsahu@nvidia.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220124181726.19174-1-abhsahu@nvidia.com> References: <20220124181726.19174-1-abhsahu@nvidia.com> X-NVConfidentiality: public MIME-Version: 1.0 X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 2f7c9dc4-5f45-4318-532c-08d9df65d013 X-MS-TrafficTypeDiagnostic: DM5PR12MB1913:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: On2/JBp3l54X9QuKDHmHZXPt+UL6AI1iYy7XThbn0PLabdHkhl4gPD5egS7tRZZN2zRwkhRrlHk/Hh871/geS/30BdW2lKBW7eajfkELOfSJL6qDwsG2lodQfN5ywAlGlPd82cc+MLTirpd+6rMsagxKzstlTqdxGBkgl+qb8b1SbnPTva6Ve5rIl87QgRzvMzUmtz50lCN0tM1MzjDUtheFSUJOhPh+z0jxdBIm6CjSD3fC+VjkQ195hW1i/q0ZV29qFl9p7VkL0n2cC15CyOgx/veXMEl4e7nwdHEjDcWz5kUueWy7jlEYvSy+d4HqVsbYRNlxf8TqfpOkSksmHlHQNvgUFjXuh+a62uGwMloJQl+hgjU0EGkIWJ9QztrnYjrya4U6a1KiIHEZQpfu2EIHbBll0tXyUenONUtJQ1J26Ws3epaPMqSHgA/5aDjKvCVe8kQgSBAnu0E8HgJEd0DEy0gRDbTUfmy82DLAdZp0xNFxnhL2FgSsezBGEWKvZ6KLiBfUVmSQJkV27H26cmDAyHDonapmqf6SR8g+v/U8syIMc9l6Ugp9HJHZk8F7+ORmQMieUvEM1z7fGHCco7pbE+kK1zytg+IEfHzrsocSFx7cKQG/qK8pdaOftcRxflFwEZ0XGjSRGUR1u6VfjGBX9GbjpyYjiYwner2m0mVbN4/iWoguEw3g/mNk/CIMezOLJqUzWuhb/wUrScNHhdabRR+P/1iulU0L+xSS8g3fHxN8kPoiz5P5EXl78PnN4Tn3FU6WHyqXdwygAblPgEanuj4fS9pzTX1cdyb7SCg= X-Forefront-Antispam-Report: CIP:12.22.5.236;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:mail.nvidia.com;PTR:InfoNoRecords;CAT:NONE;SFS:(4636009)(46966006)(36840700001)(40470700004)(2616005)(6666004)(1076003)(5660300002)(110136005)(186003)(356005)(426003)(508600001)(36756003)(54906003)(8676002)(47076005)(107886003)(336012)(316002)(40460700003)(82310400004)(4326008)(36860700001)(81166007)(86362001)(2906002)(70586007)(7696005)(83380400001)(70206006)(26005)(8936002)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jan 2022 18:17:42.3223 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 2f7c9dc4-5f45-4318-532c-08d9df65d013 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[12.22.5.236];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT060.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR12MB1913 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, there is very limited power management support available in the upstream vfio-pci driver. If there is no user of vfio-pci device, then the PCI device will be moved into D3Hot state by writing directly into PCI PM registers. This D3Hot state help in saving power but we can achieve zero power consumption if we go into the D3cold state. The D3cold state cannot be possible with native PCI PM. It requires interaction with platform firmware which is system-specific. To go into low power states (including D3cold), the runtime PM framework can be used which internally interacts with PCI and platform firmware and puts the device into the lowest possible D-States. This patch registers vfio-pci driver with the runtime PM framework. 1. The PCI core framework takes care of most of the runtime PM related things. For enabling the runtime PM, the PCI driver needs to decrement the usage count and needs to register the runtime suspend/resume callbacks. For vfio-pci based driver, these callback routines can be stubbed in this patch since the vfio-pci driver is not doing the PCI device initialization. All the config state saving, and PCI power management related things will be done by PCI core framework itself inside its runtime suspend/resume callbacks. 2. Inside pci_reset_bus(), all the devices in bus/slot will be moved out of D0 state. This state change to D0 can happen directly without going through the runtime PM framework. So if runtime PM is enabled, then pm_runtime_resume() makes the runtime state active. Since the PCI device power state is already D0, so it should return early when it tries to change the state with pci_set_power_state(). Then pm_request_idle() can be used which will internally check for device usage count and will move the device again into the low power state. 3. Inside vfio_pci_core_disable(), the device usage count always needs to be decremented which was incremented in vfio_pci_core_enable(). 4. Since the runtime PM framework will provide the same functionality, so directly writing into PCI PM config register can be replaced with the use of runtime PM routines. Also, the use of runtime PM can help us in more power saving. In the systems which do not support D3Cold, With the existing implementation: // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3hot // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D0 With runtime PM: // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3hot // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D3hot So, with runtime PM, the upstream bridge or root port will also go into lower power state which is not possible with existing implementation. In the systems which support D3Cold, // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3hot // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D0 With runtime PM: // PCI device # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state D3cold // upstream bridge # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state D3cold So, with runtime PM, both the PCI device and upstream bridge will go into D3cold state. 5. If 'disable_idle_d3' module parameter is set, then also the runtime PM will be enabled, but in this case, the usage count should not be decremented. 6. vfio_pci_dev_set_try_reset() return value is unused now, so this function return type can be changed to void. Signed-off-by: Abhishek Sahu --- drivers/vfio/pci/vfio_pci.c | 3 + drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++--------- include/linux/vfio_pci_core.h | 4 ++ 3 files changed, 75 insertions(+), 27 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index a5ce92beb655..c8695baf3b54 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver =3D { .remove =3D vfio_pci_remove, .sriov_configure =3D vfio_pci_sriov_configure, .err_handler =3D &vfio_pci_core_err_handlers, +#if defined(CONFIG_PM) + .driver.pm =3D &vfio_pci_core_pm_ops, +#endif }; =20 static void __init vfio_pci_fill_ids(void) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index f948e6cd2993..c6e4fe9088c3 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_d= evice *vdev) } =20 struct vfio_pci_group_info; -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, struct vfio_pci_group_info *groups); =20 @@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *= vdev) u16 cmd; u8 msix_pos; =20 - vfio_pci_set_power_state(vdev, PCI_D0); + if (!disable_idle_d3) { + ret =3D pm_runtime_resume_and_get(&pdev->dev); + if (ret < 0) + return ret; + } =20 /* Don't allow our initial saved state to include busmaster */ pci_clear_master(pdev); @@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device= *vdev) out: pci_disable_device(pdev); =20 - if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3) - vfio_pci_set_power_state(vdev, PCI_D3hot); + vfio_pci_dev_set_try_reset(vdev->vdev.dev_set); + + /* Put the pm-runtime usage counter acquired during enable */ + if (!disable_idle_d3) + pm_runtime_put(&pdev->dev); } EXPORT_SYMBOL_GPL(vfio_pci_core_disable); =20 @@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_c= ore_device *vdev) =20 vfio_pci_probe_power_state(vdev); =20 - if (!disable_idle_d3) { - /* - * pci-core sets the device power state to an unknown value at - * bootup and after being removed from a driver. The only - * transition it allows from this unknown state is to D0, which - * typically happens when a driver calls pci_enable_device(). - * We're not ready to enable the device yet, but we do want to - * be able to get to D3. Therefore first do a D0 transition - * before going to D3. - */ - vfio_pci_set_power_state(vdev, PCI_D0); - vfio_pci_set_power_state(vdev, PCI_D3hot); - } + /* + * pci-core sets the device power state to an unknown value at + * bootup and after being removed from a driver. The only + * transition it allows from this unknown state is to D0, which + * typically happens when a driver calls pci_enable_device(). + * We're not ready to enable the device yet, but we do want to + * be able to get to D3. Therefore first do a D0 transition + * before enabling runtime PM. + */ + vfio_pci_set_power_state(vdev, PCI_D0); + pm_runtime_allow(&pdev->dev); + + if (!disable_idle_d3) + pm_runtime_put(&pdev->dev); =20 ret =3D vfio_register_group_dev(&vdev->vdev); if (ret) @@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_cor= e_device *vdev) =20 out_power: if (!disable_idle_d3) - vfio_pci_set_power_state(vdev, PCI_D0); + pm_runtime_get_noresume(&pdev->dev); + + pm_runtime_forbid(&pdev->dev); out_vf: vfio_pci_vf_uninit(vdev); return ret; @@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_= core_device *vdev) vfio_pci_vga_uninit(vdev); =20 if (!disable_idle_d3) - vfio_pci_set_power_state(vdev, PCI_D0); + pm_runtime_get_noresume(&pdev->dev); + + pm_runtime_forbid(&pdev->dev); } EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); =20 @@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfi= o_device_set *dev_set) * - At least one of the affected devices is marked dirty via * needs_reset (such as by lack of FLR support) * Then attempt to perform that bus or slot reset. - * Returns true if the dev_set was reset. */ -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) { struct vfio_pci_core_device *cur; struct pci_dev *pdev; int ret; =20 if (!vfio_pci_dev_set_needs_reset(dev_set)) - return false; + return; =20 pdev =3D vfio_pci_dev_set_resettable(dev_set); if (!pdev) - return false; + return; =20 ret =3D pci_reset_bus(pdev); if (ret) - return false; + return; =20 list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { cur->needs_reset =3D false; - if (!disable_idle_d3) - vfio_pci_set_power_state(cur, PCI_D3hot); + if (!disable_idle_d3) { + /* + * Inside pci_reset_bus(), all the devices in bus/slot + * will be moved out of D0 state. This state change to + * D0 can happen directly without going through the + * runtime PM framework. pm_runtime_resume() will + * help make the runtime state as active and then + * pm_request_idle() can be used which will + * internally check for device usage count and will + * move the device again into the low power state. + */ + pm_runtime_resume(&pdev->dev); + pm_request_idle(&pdev->dev); + } } - return true; } =20 +#ifdef CONFIG_PM +static int vfio_pci_core_runtime_suspend(struct device *dev) +{ + return 0; +} + +static int vfio_pci_core_runtime_resume(struct device *dev) +{ + return 0; +} + +const struct dev_pm_ops vfio_pci_core_pm_ops =3D { + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend, + vfio_pci_core_runtime_resume, + NULL) +}; +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops); +#endif + void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga, bool is_disable_idle_d3) { diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index ef9a44b6cf5d..aafe09c9fa64 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *= vdev); void vfio_pci_core_disable(struct vfio_pci_core_device *vdev); void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev); =20 +#ifdef CONFIG_PM +extern const struct dev_pm_ops vfio_pci_core_pm_ops; +#endif + static inline bool vfio_pci_is_vga(struct pci_dev *pdev) { return (pdev->class >> 8) =3D=3D PCI_CLASS_DISPLAY_VGA; --=20 2.17.1 From nobody Tue Jun 30 05:25:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFE29C433F5 for ; Mon, 24 Jan 2022 18:17:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245016AbiAXSRu (ORCPT ); Mon, 24 Jan 2022 13:17:50 -0500 Received: from mail-mw2nam12on2065.outbound.protection.outlook.com ([40.107.244.65]:51048 "EHLO NAM12-MW2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S245165AbiAXSRs (ORCPT ); Mon, 24 Jan 2022 13:17:48 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=d8VS6ESPbmx8/CA86J2Wr0o0wB1YMDLnljSpCcTitz1eaGqgMLpsxQEYfPQLoUJ+c4tYcqXfHuVeKBXkLKFX+VHVUhdseeqMpV0xPHVtS4yNye4HNhFsqszLAaqjkdBEaDhqHOQPE7GJbSCl0+d2J9rN43L35GQcuwVGqQER31DdtgJHolwNgKWr0cRpZDe2CA5rVlJxd6YrQaoW5+c2Yg4/nEdO32T90yfVvdexaE8KL6U6QAMN9M+jl8snd/sCS4ur0imm5GTM6+mvpBZtW7F42Q//Ca4kO14WQlpYNTkkyrHi/yAcKaie2qc5E72yWncRr1Ad+wXFW6jaU42PEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=nVVC8wO3XipGN4e/nucddVGaDwTmNw9cj+G8smfywpA=; b=hwL4AgMUYxLBUV7AWt9VJHnlMD8kXEBB+CrGsWmb7Oo5T9QL6QTTjBOflXt4tPmSgMpqh3zCZhskYA/QlXHz5lZ4RTLSAPncG+P6vWygm218sZEElsKrogrGc4oMTdnFQDAZcNtmYyrq/eVTCJiG83mCWjzIQ11fSrLWhmXOkU3krtaM4dtwuZQtlnaVfhoD/j/FWD1oL2mzfL938IrGaM0SljwK3RdBeIsokZnhYkjfZxUvHARj22RZQlDCCTW5DwVkgbrf1FpLwSwijqWifhT+Rl3r9UPeFTYxjueEcOnLu9cJrZj/KMMEScXkJ5b5oj4JIQC4ASbRnJ7AEohK0w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.238) smtp.rcpttodomain=huawei.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=nVVC8wO3XipGN4e/nucddVGaDwTmNw9cj+G8smfywpA=; b=m1zLQO6KqxXSKMzskDIYLRchmqNterAmW8jyvrKhhkuxVMM4XtbrEVL5R5VxDZGjYXYBIhhfkCv3ukTT0JbbMvEyeC3NOJqEUecDyXZfl91gOGcbtDk2UIAnl2oK7J5pbUWaHAyWhb7fbWLyJ9Oq5moLrcUQ1F+Q3PSadpwivmT2+eR8N5vMcS9BmI+F/KFUN6LnKJwpsgpEjuL3NTIgQ6j0xmuNDVKNNTpqRolWW3FRsv0k8z64nFnRX3dXvUeUJ31z5elTmF9x6jGu29uLuM19xrddEFqcYhPm2htbzVoHWV1qnJxBh2jlKPXxOB+/VRCDLHnVQjl1AE+KaA1SHw== Received: from DM5PR21CA0041.namprd21.prod.outlook.com (2603:10b6:3:ed::27) by DM5PR12MB1131.namprd12.prod.outlook.com (2603:10b6:3:73::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.10; Mon, 24 Jan 2022 18:17:47 +0000 Received: from DM6NAM11FT049.eop-nam11.prod.protection.outlook.com (2603:10b6:3:ed:cafe::65) by DM5PR21CA0041.outlook.office365.com (2603:10b6:3:ed::27) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4951.3 via Frontend Transport; Mon, 24 Jan 2022 18:17:47 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.238) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.238) by DM6NAM11FT049.mail.protection.outlook.com (10.13.172.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4909.7 via Frontend Transport; Mon, 24 Jan 2022 18:17:46 +0000 Received: from drhqmail203.nvidia.com (10.126.190.182) by DRHQMAIL105.nvidia.com (10.27.9.14) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 24 Jan 2022 18:17:45 +0000 Received: from drhqmail201.nvidia.com (10.126.190.180) by drhqmail203.nvidia.com (10.126.190.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.9; Mon, 24 Jan 2022 10:17:45 -0800 Received: from nvidia-Inspiron-15-7510.nvidia.com (10.127.8.13) by mail.nvidia.com (10.126.190.180) with Microsoft SMTP Server id 15.2.986.9 via Frontend Transport; Mon, 24 Jan 2022 10:17:41 -0800 From: Abhishek Sahu To: , Alex Williamson , Cornelia Huck CC: Max Gurtovoy , Yishai Hadas , Zhen Lei , Jason Gunthorpe , , Abhishek Sahu Subject: [RFC PATCH v2 2/5] vfio/pci: virtualize PME related registers bits and initialize to zero Date: Mon, 24 Jan 2022 23:47:23 +0530 Message-ID: <20220124181726.19174-3-abhsahu@nvidia.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220124181726.19174-1-abhsahu@nvidia.com> References: <20220124181726.19174-1-abhsahu@nvidia.com> X-NVConfidentiality: public MIME-Version: 1.0 X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 92b280e7-315d-49a0-eb60-08d9df65d270 X-MS-TrafficTypeDiagnostic: DM5PR12MB1131:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: jOEu4acTUta+EpLEwJNVcuEH5Bhqo8eqAZahG+lCdTH4o3YfjDT4qMp/CXjK7lj61m/QeQDrhuM1NpKRr/JePziJKlNl9vsCDuNF6EDgrYYRu0O80hODZZjYd2Q30NYF7cUSznxMJu/5MxgxEQGTHK8SwayMSRJHmHX/n3QOyUoerb7gpaZhXHApEElhICwl6Yxjv8S/DqZ3R18v2KCTrmrT7jxdkONDXmCS3cbfg6OsrgUalGMp+ntUak28Dj0aOe2Sd9berfDqy2GpvWjsPgWjuSlci24pCz5BY4hPZ3WZ2DHBJLsHj3qHyC2viEOogE8Y+Onra+o8IkKWY61Fwm0b1K+oja0bGUcdICzi+Q/0g68T7+Blal/xi4NeZvZCzkY/9Wy8RVRQS3WRObelIaaWu5/DStIOZwtedPhNVwmtH22z0cK1Lk7uY4t7EbPHrSylHr2LpZ2jftd3jjRW52i4Rop4WCCpmjd7fQ8nyZl4PtQh0hV/fEzJkUKubgRAF+Yuz45TCqdwEjaz12+z4NM4KpmeIAeQDRBjxzoK3n5uPDWi662FL0fqTl3BGY4yIg+5cwGSAAlSSOC8QTL0I5Dl0m7S3LKUQXnWwylMBLsDfDtBO4DMdeDvHI9D9N6FODIaR03mST1VbhgiPZj5IrmfPJ6UplIRDQHWlQZRp09NB3t2PjDSVU8RSOrPWSU1IQ72zeV2e+N4J286DhLRjpM7LqonJn5lXW5Ops+n6cePJ69FF2cP54mY9LKH3L/ezanHB/lvkZC/v2tK2rMRxpDiJ8PCowNyM5gUslusBtI= X-Forefront-Antispam-Report: CIP:12.22.5.238;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:mail.nvidia.com;PTR:InfoNoRecords;CAT:NONE;SFS:(4636009)(46966006)(40470700004)(36840700001)(47076005)(54906003)(1076003)(36756003)(356005)(83380400001)(336012)(40460700003)(82310400004)(316002)(107886003)(6666004)(5660300002)(110136005)(70586007)(70206006)(4326008)(81166007)(36860700001)(426003)(508600001)(86362001)(8936002)(2906002)(2616005)(26005)(7696005)(186003)(8676002)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jan 2022 18:17:46.2755 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 92b280e7-315d-49a0-eb60-08d9df65d270 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[12.22.5.238];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT049.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR12MB1131 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If any PME event will be generated by PCI, then it will be mostly handled in the host by the root port PME code. For example, in the case of PCIe, the PME event will be sent to the root port and then the PME interrupt will be generated. This will be handled in drivers/pci/pcie/pme.c at the host side. Inside this, the pci_check_pme_status() will be called where PME_Status and PME_En bits will be cleared. So, the guest OS which is using vfio-pci device will not come to know about this PME event. To handle these PME events inside guests, we need some framework so that if any PME events will happen, then it needs to be forwarded to virtual machine monitor. We can virtualize PME related registers bits and initialize these bits to zero so vfio-pci device user will assume that it is not capable of asserting the PME# signal from any power state. Signed-off-by: Abhishek Sahu --- drivers/vfio/pci/vfio_pci_config.c | 33 +++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci= _config.c index 6e58b4bf7a60..dd9ed211ba6f 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -738,12 +738,29 @@ static int __init init_pci_cap_pm_perm(struct perm_bi= ts *perm) */ p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE); =20 + /* + * The guests can't process PME events. If any PME event will be + * generated, then it will be mostly handled in the host and the + * host will clear the PME_STATUS. So virtualize PME_Support bits. + * The vconfig bits will be cleared during device capability + * initialization. + */ + p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE); + /* * Power management is defined *per function*, so we can let * the user change power state, but we trap and initiate the * change ourselves, so the state bits are read-only. + * + * The guest can't process PME from D3cold so virtualize PME_Status + * and PME_En bits. The vconfig bits will be cleared during device + * capability initialization. */ - p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK); + p_setd(perm, PCI_PM_CTRL, + PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS, + ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS | + PCI_PM_CTRL_STATE_MASK)); + return 0; } =20 @@ -1412,6 +1429,17 @@ static int vfio_ext_cap_len(struct vfio_pci_core_dev= ice *vdev, u16 ecap, u16 epo return 0; } =20 +static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev, + int offset) +{ + __le16 *pmc =3D (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC]; + __le16 *ctrl =3D (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL]; + + /* Clear vconfig PME_Support, PME_Status, and PME_En bits */ + *pmc &=3D ~cpu_to_le16(PCI_PM_CAP_PME_MASK); + *ctrl &=3D ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS); +} + static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev, int offset, int size) { @@ -1535,6 +1563,9 @@ static int vfio_cap_init(struct vfio_pci_core_device = *vdev) if (ret) return ret; =20 + if (cap =3D=3D PCI_CAP_ID_PM) + vfio_update_pm_vconfig_bytes(vdev, pos); + prev =3D &vdev->vconfig[pos + PCI_CAP_LIST_NEXT]; pos =3D next; caps++; --=20 2.17.1 From nobody Tue Jun 30 05:25:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD147C433F5 for ; Mon, 24 Jan 2022 18:17:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245186AbiAXSR4 (ORCPT ); Mon, 24 Jan 2022 13:17:56 -0500 Received: from mail-co1nam11on2055.outbound.protection.outlook.com ([40.107.220.55]:26401 "EHLO NAM11-CO1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S235979AbiAXSRx (ORCPT ); Mon, 24 Jan 2022 13:17:53 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=M8ti58uCR6gfMCUvdVBq9ucN/fZ1bJocRZe0Bou+m15ukMUlInTp1lJTndyfPeCeKBSRy7zphkYnDYMQ9rLtfpNoPua9qGfEg8Po21agOyDFOmapdN52i5veDCyQIF7h2iUaMXf8Xk62BpKcS5fzTQt3qGJ3G+2GbiTc62128YMtTHZRj9MPpUtWuQTR+zpib7euKzL/fogPg8eX1HH/8gLRB4gKk2iuszy6L+Aqu/bLXSPtOkiArcmhx/m000a1m/FgFYg82pt7x3xH8D1hB3ykvyZ0wb+C1ECQhrCDQNRDn+ZbY8tsHMxt51zcbkxkk4R+oHeXzCL40qIfrTTLVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=AFvtPRPnEp+41Zbi3/aZszS0oli6S3TWuL1J9gO87r4=; b=O1iuHhaXR4Uh+RRhwitNi7eq9a+nR/rBCWc30jISWjrTQkfGfoySEfdGF3jEKVaV2AhruDenhnC5/7gpeLEIOzOWyMke0DQBMn0dSNADyw4N/XhP/IQ8Jz6r/yXVqc9YjL/B7rv8Bteb9lDD8M6SI/gn2pVesiXndTRwBC5iClszsd/urLgT1rPm87+UQEK1zBrhWP1iKxnS6rYU1QCLH8kYipnhgl/vZUQBtspsyMtcw4ufgd8yrPqWBdMSj5DS42VXueYuw5owqpcKqmg5QbpGVm+LGvoHvfFaWskqX1Poc2LTmDtEA2mbb+XtNqIXOBhGFjtnZFrD1eMOs6Xc6A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.235) smtp.rcpttodomain=huawei.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=AFvtPRPnEp+41Zbi3/aZszS0oli6S3TWuL1J9gO87r4=; b=HYjN56F4pVDUaKRRvPRCYhkyT1x6zTHV9O6RfRD4NpPtCaTfICimh5+IbzrNb48uUBv/tGDBFBf1D5s+HgfagijDZEO8L4NcRdfXPb9NJDW827Fev3UA47Y9CstSKNTX3ykkXk95PS3jAasjRGlv7lsLpogb6p3oWzDKH7oQEdacqUHX1zVo2pnIRv0e6QxF7NnH3dYyQFb4oPs1yKlP1UYc3i4mFeQRalPf9nYqFdztA9HLWzVJ6xe5BUlDm+MzekQbGfRNLO11c3K+uJCTJjHDpAzZupLL7L0M7peksDA0ImYzU5SDXRi3S6dP6C1md6Q1Vvo2EkBvzLbdNQpOqw== Received: from DS7PR03CA0104.namprd03.prod.outlook.com (2603:10b6:5:3b7::19) by BN7PR12MB2833.namprd12.prod.outlook.com (2603:10b6:408:27::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.8; Mon, 24 Jan 2022 18:17:51 +0000 Received: from DM6NAM11FT046.eop-nam11.prod.protection.outlook.com (2603:10b6:5:3b7:cafe::c3) by DS7PR03CA0104.outlook.office365.com (2603:10b6:5:3b7::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.17 via Frontend Transport; Mon, 24 Jan 2022 18:17:51 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.235) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.235 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.235; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.235) by DM6NAM11FT046.mail.protection.outlook.com (10.13.172.121) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4909.7 via Frontend Transport; Mon, 24 Jan 2022 18:17:50 +0000 Received: from drhqmail202.nvidia.com (10.126.190.181) by DRHQMAIL107.nvidia.com (10.27.9.16) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 24 Jan 2022 18:17:50 +0000 Received: from drhqmail201.nvidia.com (10.126.190.180) by drhqmail202.nvidia.com (10.126.190.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.9; Mon, 24 Jan 2022 10:17:49 -0800 Received: from nvidia-Inspiron-15-7510.nvidia.com (10.127.8.13) by mail.nvidia.com (10.126.190.180) with Microsoft SMTP Server id 15.2.986.9 via Frontend Transport; Mon, 24 Jan 2022 10:17:45 -0800 From: Abhishek Sahu To: , Alex Williamson , Cornelia Huck CC: Max Gurtovoy , Yishai Hadas , Zhen Lei , Jason Gunthorpe , , Abhishek Sahu Subject: [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion Date: Mon, 24 Jan 2022 23:47:24 +0530 Message-ID: <20220124181726.19174-4-abhsahu@nvidia.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220124181726.19174-1-abhsahu@nvidia.com> References: <20220124181726.19174-1-abhsahu@nvidia.com> X-NVConfidentiality: public MIME-Version: 1.0 X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 353910cb-8374-471f-9426-08d9df65d526 X-MS-TrafficTypeDiagnostic: BN7PR12MB2833:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: E8gq5TRHkM8cyov4z3Iij87hxvYvQEoYhhCuk2vBgSdmQxuydcY3mTLde9Yw7Qq4Ka91+phIj6uq161EuP+zNDdU7mKXTiOk9vM21IWHFC1GS5yoVeXu9BnlhgXbmAZmamZVr2U2xvHdDIy5fVWnyu9JYeKl++O1S88bOboXFvJbXOxsOiJkkyI49/MlWyCwUEFDyM0oAYYsgDdNZH1skmaz1cpE/VKB2N5AAQeoaM09RrE8Yhpa1b0nzPto/dof9cb9VYfBhOGe2j1LmBC9Cwqa9rG3n9WsMRyTLrpdECdXX/jv1qbLziD8ppfNNkW8NDtBCWUookoIir6dywHQXCtJoUF1T8vraBiYKYl0yor8vt2s6uUBIOPvOUpvQNGQb27gH3ULF8GS4pDiC2IfetXY49DpEidoX4YP8/ri4DVNJ+Y/duAeTWzPYtzKR3w6MMCa/+3OXG5KeOu8bZ/Qq2iI0Um5f1o92+ddeNLmKbjKytLUwRgTc96ovcYHuUSq+13xDFFzghGLJtgZasGE1Ec/qPSwlrXn1xwUkp/h0DaYC6odtYIiBmx3Ex5T8Vbpj9PMdVyHbobeQHqIJYLrfemvU3eSuDjiTuGlpFlA8dDazcvkoA3PDrx1R+kPWerJZrJGrFod6Fi8OyJqXZuQTPlcH2bG8Uu4N0Yh1SNYT/Ph7TLnbR0V7YorE+5i6AyliB2UFUMk4LWQrg3QSfXxVeNRHkXnloQTxZiJKShmG91gbjYz+g6yOoSeF4BFy5wE4Au05SQris4ckbpltcMPcV4qkCNhX8PELQrTOzBMVM62vOMaNLHyJm3suUQMxxynYUP13Fz2sHIfZ7I5zpUDcA== X-Forefront-Antispam-Report: CIP:12.22.5.235;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:mail.nvidia.com;PTR:InfoNoRecords;CAT:NONE;SFS:(4636009)(46966006)(36840700001)(40470700004)(36860700001)(4326008)(1076003)(70206006)(6666004)(316002)(110136005)(54906003)(107886003)(47076005)(36756003)(8676002)(81166007)(2906002)(26005)(70586007)(336012)(82310400004)(83380400001)(2616005)(356005)(426003)(86362001)(8936002)(40460700003)(508600001)(186003)(7696005)(5660300002)(32563001)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jan 2022 18:17:50.9152 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 353910cb-8374-471f-9426-08d9df65d526 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[12.22.5.235];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT046.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN7PR12MB2833 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If needs_pm_restore is set (PCI device does not have support for no soft reset), then the current PCI state will be saved during D0->D3hot transition and same will be restored back during D3hot->D0 transition. For saving the PCI state locally, pci_store_saved_state() is being used and the pci_load_and_free_saved_state() will free the allocated memory. But for reset related IOCTLs, vfio driver calls PCI reset related API's which will internally change the PCI power state back to D0. So, when the guest resumes, then it will get the current state as D0 and it will skip the call to vfio_pci_set_power_state() for changing the power state to D0 explicitly. In this case, the memory pointed by pm_save will never be freed. Also, in malicious sequence, the state changing to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and it can cause an OOM situation. This patch stores the power state locally and uses the same for comparing the current power state. For the places where D0 transition can happen, call vfio_pci_set_power_state() to transition to D0 state. Since the vfio power state is still D3hot, so this D0 transition will help in running the logic required from D3hot->D0 transition. Also, to prevent any miss during future development to detect this condition, this patch puts a check and frees the memory after printing warning. This locally saved power state will help in subsequent patches also. Signed-off-by: Abhishek Sahu --- drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++-- include/linux/vfio_pci_core.h | 1 + 2 files changed, 51 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index c6e4fe9088c3..ee2fb8af57fa 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci= _core_device *vdev) * restore when returned to D0. Saved separately from pci_saved_state for= use * by PM capability emulation and separately from pci_dev internal saved s= tate * to avoid it being overwritten and consumed around other resets. + * + * There are few cases where the PCI power state can be changed to D0 + * without the involvement of this API. So, cache the power state locally + * and call this API to update the D0 state. It will help in running the + * logic that is needed for transitioning to the D0 state. For example, + * if needs_pm_restore is set, then the PCI state will be saved locally. + * The memory taken for saving this PCI state needs to be freed to + * prevent memory leak. */ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_= t state) { @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_dev= ice *vdev, pci_power_t stat int ret; =20 if (vdev->needs_pm_restore) { - if (pdev->current_state < PCI_D3hot && state >=3D PCI_D3hot) { + if (vdev->power_state < PCI_D3hot && state >=3D PCI_D3hot) { pci_save_state(pdev); needs_save =3D true; } =20 - if (pdev->current_state >=3D PCI_D3hot && state <=3D PCI_D0) + if (vdev->power_state >=3D PCI_D3hot && state <=3D PCI_D0) needs_restore =3D true; } =20 ret =3D pci_set_power_state(pdev, state); =20 if (!ret) { + vdev->power_state =3D pdev->current_state; + /* D3 might be unsupported via quirk, skip unless in D3 */ - if (needs_save && pdev->current_state >=3D PCI_D3hot) { + if (needs_save && vdev->power_state >=3D PCI_D3hot) { + /* + * If somehow, the vfio driver was not able to free the + * memory allocated in pm_save, then free the earlier + * memory first before overwriting pm_save to prevent + * memory leak. + */ + if (vdev->pm_save) { + pci_warn(pdev, + "Overwriting saved PCI state pointer so freeing the earlier memory\n= "); + kfree(vdev->pm_save); + } + vdev->pm_save =3D pci_store_saved_state(pdev); } else if (needs_restore) { pci_load_and_free_saved_state(pdev, &vdev->pm_save); @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device= *vdev) /* For needs_reset */ lockdep_assert_held(&vdev->vdev.dev_set->lock); =20 + /* + * If disable has been called while the power state is other than D0, + * then set the power state in vfio driver to D0. It will help + * in running the logic needed for D0 power state. The subsequent + * runtime PM API's will put the device into the low power state again. + */ + vfio_pci_set_power_state(vdev, PCI_D0); + /* Stop the device from further DMA */ pci_clear_master(pdev); =20 @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev= , unsigned int cmd, =20 vfio_pci_zap_and_down_write_memory_lock(vdev); ret =3D pci_try_reset_function(vdev->pdev); + + /* + * If pci_try_reset_function() has been called while the power + * state is other than D0, then pci_try_reset_function() will + * internally set the device state to D0 without vfio driver + * interaction. Update the power state in vfio driver to perform + * the logic needed for D0 power state. + */ + vfio_pci_set_power_state(vdev, PCI_D0); up_write(&vdev->memory_lock); =20 return ret; @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_de= vice_set *dev_set, =20 err_undo: list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { + /* + * If pci_reset_bus() has been called while the power + * state is other than D0, then pci_reset_bus() will + * internally set the device state to D0 without vfio driver + * interaction. Update the power state in vfio driver to perform + * the logic needed for D0 power state. + */ + vfio_pci_set_power_state(cur, PCI_D0); if (cur =3D=3D cur_mem) is_mem =3D false; if (cur =3D=3D cur_vma) diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index aafe09c9fa64..05db838e72cc 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -124,6 +124,7 @@ struct vfio_pci_core_device { bool needs_reset; bool nointx; bool needs_pm_restore; + pci_power_t power_state; struct pci_saved_state *pci_saved_state; struct pci_saved_state *pm_save; int ioeventfds_nr; --=20 2.17.1 From nobody Tue Jun 30 05:25:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 034D4C433EF for ; Mon, 24 Jan 2022 18:18:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245269AbiAXSSK (ORCPT ); Mon, 24 Jan 2022 13:18:10 -0500 Received: from mail-dm6nam11on2084.outbound.protection.outlook.com ([40.107.223.84]:3809 "EHLO NAM11-DM6-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S245189AbiAXSR5 (ORCPT ); Mon, 24 Jan 2022 13:17:57 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ZbL28qvhxzhNCaRTt6uyY+j8hxieFcEM2Rqy+jjPZFTjCu3/VR4fkVSHbbVneFcvQri+iCklSWdyLWBFo9yH9RMloR+J+qIwloYnCBf9Cxxqtpb819ZUe2G4jMFEwTtGJOsfuK0SgOeVGqQ24LbW5hBgUmlodIdhhS7jHeh71BpZeSHwMgPSFRP3eJN6iEneSlt7aMgEe5LA+cZRRqz3W2tabHaOdAK1NzylXMJqZz/1+CkSLI/C6KTjjXlFmvBk1mb6vFpbkMJKqTBuEVyY+xZMKfIck7e0sy1L7O4l6clZxOP4aM3BDtT1j7QzT8iPT0mUwNHxlT+pODif9lehIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rth0ZEyC9MBUtXH7RdAXZkShOSCjkgmWq8Wps9loBDY=; b=nYTbfjP7n5F0dHq/3dDLHYR7HQPoAAAMmPgQkHthoJ56hsfu9I0tkhdaQVAV//5sZ/uj1ZEjdFSgNkMWerJiSApHyeLztqWQwicc+584ChK9YmM+U5lXinWGumgchvT0tM6vB3NaF0wPueFIadF9rh0G1Cz0zEBbN6Bs3IPpfTt+F3nhJ2xphDnGf3kmmx5QCHpKXVleas3EI5/byfRUc+zFFTTwflnoAt8x812QpA9isdJUbVk+W1ZYSlWlnmV+zbamiYn+tzz29XPfgkdHi0mlvnI/aq7whuh1iKoXXbZog6mteCrx9qlo9rh+Mj8u9H4YKP0dfKVn4GtoCJJ/uQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.238) smtp.rcpttodomain=huawei.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=rth0ZEyC9MBUtXH7RdAXZkShOSCjkgmWq8Wps9loBDY=; b=LHnMwVnLJXdHkKemrPUJXqXexcJXR85edcSYNo9l9LjMxtAjlIhUY1X2MPPURbZohpSD4Hvl5kCGTRK2jRgqaw/VcjJyT6BOdZVzzHWUrQZcBhVMDsS1GbQ5GYlRyFKcSYTcj0YEj3f3ZaicU6JOLUgUNnJ0qjJCqayc5wlhN1S3IX5FgJ0mrmutHDR8lm61lvevpVrwnyr+DBiuxxGmIZp4xO/xXlKsyvhrnw98uXM/5fyJa2cJgIE/fyp7zVVNZR9pE+w+Yf+xIB56BQi4qbWIKekE1CGPTZyCPLWkT9FIDymKdF5MoVvFmqg9ijj40bV9A1OCft62IvWQG++u9w== Received: from DM6PR08CA0027.namprd08.prod.outlook.com (2603:10b6:5:80::40) by SA0PR12MB4381.namprd12.prod.outlook.com (2603:10b6:806:70::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.7; Mon, 24 Jan 2022 18:17:55 +0000 Received: from DM6NAM11FT031.eop-nam11.prod.protection.outlook.com (2603:10b6:5:80:cafe::30) by DM6PR08CA0027.outlook.office365.com (2603:10b6:5:80::40) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.17 via Frontend Transport; Mon, 24 Jan 2022 18:17:55 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.238) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.238) by DM6NAM11FT031.mail.protection.outlook.com (10.13.172.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4909.7 via Frontend Transport; Mon, 24 Jan 2022 18:17:54 +0000 Received: from drhqmail201.nvidia.com (10.126.190.180) by DRHQMAIL105.nvidia.com (10.27.9.14) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 24 Jan 2022 18:17:54 +0000 Received: from drhqmail201.nvidia.com (10.126.190.180) by drhqmail201.nvidia.com (10.126.190.180) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.9; Mon, 24 Jan 2022 10:17:54 -0800 Received: from nvidia-Inspiron-15-7510.nvidia.com (10.127.8.13) by mail.nvidia.com (10.126.190.180) with Microsoft SMTP Server id 15.2.986.9 via Frontend Transport; Mon, 24 Jan 2022 10:17:50 -0800 From: Abhishek Sahu To: , Alex Williamson , Cornelia Huck CC: Max Gurtovoy , Yishai Hadas , Zhen Lei , Jason Gunthorpe , , Abhishek Sahu Subject: [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Date: Mon, 24 Jan 2022 23:47:25 +0530 Message-ID: <20220124181726.19174-5-abhsahu@nvidia.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220124181726.19174-1-abhsahu@nvidia.com> References: <20220124181726.19174-1-abhsahu@nvidia.com> X-NVConfidentiality: public MIME-Version: 1.0 X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: c9f547c5-af1f-456b-46f9-08d9df65d78d X-MS-TrafficTypeDiagnostic: SA0PR12MB4381:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:883; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 0R6n4/tdOCQPsXn187VIvpBLZXia2kjzvlQagZ1s87u7nuAKKgPNBl14ZPnbHC/11fvxVe4MC+t3lA20C/PZed7agUElabfysYQL5DS7iKVFAT6BfnVfJ2xjeojlwIG2DOEBMyumi9CVCo71V/ius3xVGaSuXF5qlFgiw6lA+8Uh5AANaFj+q261yChhd5iEAS7j2wIMyvHc1Bhd/eWxRVtHpaEVMEeVVRWq2Ibmgw+7b4jg6aEJ6f35aOzVjSxQ0akr1+ZuvxWuZ6gMmE0tJTV/gZLT4CZuO1HK/yGXM8yyjSAQnnMTFBNq1vnKJs/2F/6U8AoslnuX+hCvL3ZyFc3xup3ge490hKVi5sS3tVt1VEacrU0O4qs1Y7YeX3QmdRQUyTpZHgo7vvvemrFcJ+FfNr6OJHkEKNUs0T/JjtIHgekXRv/sQCAq3guCSE3YuLBgg4Y+gYTkuSNXR/hnjH5E2MPAJU7PIGQdTqpXI9R68PxjoVnYH5p5gG+9P39T0IQlYIWXu7GhxkZUcpv8pWjcvRFSHby4/vN1zwrSigb6j5heFF5gn7ophLdUkeNJf5KNCZxqgtuJSkmkyt0+wvnalD2MQrODvQ5wzutqIZlmM4qL9lG0Cl8Q4r3z41RXo05IrSkDEK3dBmJdVjI1NQt+tSEpg4K7rODYUvgr6KgvF8zSVQ2gduPUGyz50RN8q4Uhw9o3nBbTJjdRPxIFBUJt1I+Duq42/GtfoZpqJfxFh+Ql3ZULvFHR8sMcfNactgIVxwOqRx78/9Hd1CjKGSGGBjBsC6Vgt7XZazWkiR2DyW4+TAD+7AcPgSAKdgSR9mPl8V3czLXCTpAB3UTQng== X-Forefront-Antispam-Report: CIP:12.22.5.238;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:mail.nvidia.com;PTR:InfoNoRecords;CAT:NONE;SFS:(4636009)(46966006)(40470700004)(36840700001)(40460700003)(426003)(336012)(186003)(36860700001)(508600001)(47076005)(4326008)(82310400004)(70206006)(2616005)(1076003)(70586007)(8936002)(110136005)(54906003)(7696005)(6666004)(356005)(83380400001)(5660300002)(81166007)(86362001)(26005)(36756003)(107886003)(2906002)(8676002)(316002)(32563001)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jan 2022 18:17:54.9593 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: c9f547c5-af1f-456b-46f9-08d9df65d78d X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[12.22.5.238];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT031.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA0PR12MB4381 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" According to [PCIe v5 5.3.1.4.1] for D3hot state "Configuration and Message requests are the only TLPs accepted by a Function in the D3Hot state. All other received Requests must be handled as Unsupported Requests, and all received Completions may optionally be handled as Unexpected Completions." Currently, if the vfio PCI device has been put into D3hot state and if user makes non-config related read/write request in D3hot state, these requests will be forwarded to the host and this access may cause issues on a few systems. This patch leverages the memory-disable support added in commit 'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory")' to generate page fault on mmap access and return error for the direct read/write. If the device is D3hot state, then the error needs to be returned for all kinds of BAR related access (memory, IO and ROM). Also, the power related structure fields need to be protected so we can use the same 'memory_lock' to protect these fields also. For the few cases, this 'memory_lock' will be already acquired by callers so introduce a separate function vfio_pci_set_power_state_locked(). The original vfio_pci_set_power_state() now contains the code to do the locking related operations. Signed-off-by: Abhishek Sahu --- drivers/vfio/pci/vfio_pci_core.c | 47 +++++++++++++++++++++++++------- drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++---- 2 files changed, 51 insertions(+), 16 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index ee2fb8af57fa..38440d48973f 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -201,11 +201,12 @@ static void vfio_pci_probe_power_state(struct vfio_pc= i_core_device *vdev) } =20 /* - * pci_set_power_state() wrapper handling devices which perform a soft res= et on - * D3->D0 transition. Save state prior to D0/1/2->D3, stash it on the vde= v, - * restore when returned to D0. Saved separately from pci_saved_state for= use - * by PM capability emulation and separately from pci_dev internal saved s= tate - * to avoid it being overwritten and consumed around other resets. + * vfio_pci_set_power_state_locked() wrapper handling devices which perfor= m a + * soft reset on D3->D0 transition. Save state prior to D0/1/2->D3, stash= it + * on the vdev, restore when returned to D0. Saved separately from + * pci_saved_state for use by PM capability emulation and separately from + * pci_dev internal saved state to avoid it being overwritten and consumed + * around other resets. * * There are few cases where the PCI power state can be changed to D0 * without the involvement of this API. So, cache the power state locally @@ -215,7 +216,8 @@ static void vfio_pci_probe_power_state(struct vfio_pci_= core_device *vdev) * The memory taken for saving this PCI state needs to be freed to * prevent memory leak. */ -int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_= t state) +static int vfio_pci_set_power_state_locked(struct vfio_pci_core_device *vd= ev, + pci_power_t state) { struct pci_dev *pdev =3D vdev->pdev; bool needs_restore =3D false, needs_save =3D false; @@ -260,6 +262,26 @@ int vfio_pci_set_power_state(struct vfio_pci_core_devi= ce *vdev, pci_power_t stat return ret; } =20 +/* + * vfio_pci_set_power_state() takes all the required locks to protect + * the access of power related variables and then invokes + * vfio_pci_set_power_state_locked(). + */ +int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, + pci_power_t state) +{ + int ret; + + if (state >=3D PCI_D3hot) + vfio_pci_zap_and_down_write_memory_lock(vdev); + else + down_write(&vdev->memory_lock); + + ret =3D vfio_pci_set_power_state_locked(vdev, state); + up_write(&vdev->memory_lock); + return ret; +} + int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) { struct pci_dev *pdev =3D vdev->pdev; @@ -354,7 +376,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device = *vdev) * in running the logic needed for D0 power state. The subsequent * runtime PM API's will put the device into the low power state again. */ - vfio_pci_set_power_state(vdev, PCI_D0); + vfio_pci_set_power_state_locked(vdev, PCI_D0); =20 /* Stop the device from further DMA */ pci_clear_master(pdev); @@ -967,7 +989,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev,= unsigned int cmd, * interaction. Update the power state in vfio driver to perform * the logic needed for D0 power state. */ - vfio_pci_set_power_state(vdev, PCI_D0); + vfio_pci_set_power_state_locked(vdev, PCI_D0); up_write(&vdev->memory_lock); =20 return ret; @@ -1453,6 +1475,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_faul= t *vmf) goto up_out; } =20 + if (vdev->power_state >=3D PCI_D3hot) { + ret =3D VM_FAULT_SIGBUS; + goto up_out; + } + /* * We populate the whole vma on fault, so we need to test whether * the vma has already been mapped, such as for concurrent faults @@ -1902,7 +1929,7 @@ int vfio_pci_core_register_device(struct vfio_pci_cor= e_device *vdev) * be able to get to D3. Therefore first do a D0 transition * before enabling runtime PM. */ - vfio_pci_set_power_state(vdev, PCI_D0); + vfio_pci_set_power_state_locked(vdev, PCI_D0); pm_runtime_allow(&pdev->dev); =20 if (!disable_idle_d3) @@ -2117,7 +2144,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_dev= ice_set *dev_set, * interaction. Update the power state in vfio driver to perform * the logic needed for D0 power state. */ - vfio_pci_set_power_state(cur, PCI_D0); + vfio_pci_set_power_state_locked(cur, PCI_D0); if (cur =3D=3D cur_mem) is_mem =3D false; if (cur =3D=3D cur_vma) diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_r= dwr.c index 57d3b2cbbd8e..e97ba14c4aa0 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -41,8 +41,13 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev, \ bool test_mem, u##size val, void __iomem *io) \ { \ + down_read(&vdev->memory_lock); \ + if (vdev->power_state >=3D PCI_D3hot) { \ + up_read(&vdev->memory_lock); \ + return -EIO; \ + } \ + \ if (test_mem) { \ - down_read(&vdev->memory_lock); \ if (!__vfio_pci_memory_enabled(vdev)) { \ up_read(&vdev->memory_lock); \ return -EIO; \ @@ -51,8 +56,7 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_de= vice *vdev, \ \ vfio_iowrite##size(val, io); \ \ - if (test_mem) \ - up_read(&vdev->memory_lock); \ + up_read(&vdev->memory_lock); \ \ return 0; \ } @@ -68,8 +72,13 @@ VFIO_IOWRITE(64) static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev, \ bool test_mem, u##size *val, void __iomem *io) \ { \ + down_read(&vdev->memory_lock); \ + if (vdev->power_state >=3D PCI_D3hot) { \ + up_read(&vdev->memory_lock); \ + return -EIO; \ + } \ + \ if (test_mem) { \ - down_read(&vdev->memory_lock); \ if (!__vfio_pci_memory_enabled(vdev)) { \ up_read(&vdev->memory_lock); \ return -EIO; \ @@ -78,8 +87,7 @@ static int vfio_pci_ioread##size(struct vfio_pci_core_dev= ice *vdev, \ \ *val =3D vfio_ioread##size(io); \ \ - if (test_mem) \ - up_read(&vdev->memory_lock); \ + up_read(&vdev->memory_lock); \ \ return 0; \ } --=20 2.17.1 From nobody Tue Jun 30 05:25:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF7E1C4332F for ; Mon, 24 Jan 2022 18:18:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245216AbiAXSSM (ORCPT ); Mon, 24 Jan 2022 13:18:12 -0500 Received: from mail-bn8nam12on2041.outbound.protection.outlook.com ([40.107.237.41]:16224 "EHLO NAM12-BN8-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S245214AbiAXSSC (ORCPT ); Mon, 24 Jan 2022 13:18:02 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dkLFO/XiXgtOJ7k9co6xl4lWHKtfTlrMwu1QTfSDAuJXWUkkz7Q9YXcQt51h+RQw7sjnDiHn3tBJLuCWfQy7JB7NjjzvFtub37GjkfOIr/75/c+mfMiQEvKt+e8htDa0gICXT9Q30NGufWmUN8DSPBarrjDMvUR7amWrubxC7jxvKDKnkaA/xWD/MLKefnddniszIiW+o7FDtlIF6p9ejl+oCwfzcww/xkU3nhvKpvZ/QsaSt7AR5hGv6wgeZxdNPyQCtOC8Sp/SxwZaPIQBXCzsJ6CkWPd5in2ActzbIfKx6p9JBfYykSdm+k4b4xTrlV3Q5gXjUi2gWp3qIrt6Tw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=7kMmK440KY1hsngBEDOmuALB/c01GhuWzXRykEI4+88=; b=OkK5WwAHLc2EViVANl9FfsoeA+agsP+mZ+HqHVaDYMgKlnO334pBK7s4AnRaxji9srHbT7ByePbhEv5JgaHXgTKKyW3BCFg7psSgWyUj5lO5zZsOnc5qfqAXuhrxY9E/YX3QdTCM54jm4wABPkxqoZeBAk+Oj3IMDu0d53e2sEiRL8Uv/YEv1eTdQfXVvzvodN5W2EtdppVlVy5wnn9RhUqjapd+9B2Ya/2KX37EDwVrbhNtjVDdoK+0YquTOTqFaOi7LaUJ+8ecex6cHuPgtPqlgiEmSjkpyMWm3+85LpixJksOEpukC292BmR3DOzxO186freVhW2YIBOdOA02Wg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.238) smtp.rcpttodomain=huawei.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=7kMmK440KY1hsngBEDOmuALB/c01GhuWzXRykEI4+88=; b=nDnyuxvsmX5sARaZsMor5DwhgpuTrJv8Qsayu0NmrcPnq90dIOp0/b0JIBsOgltSv2bcTQWwM0bArTLAhK9yOUyoVTY3A1nabXDbqm7jBWv0lTAEybWsjs5QPiC0rEFgmgMuBOWoDwGhXMY7uedri3buVqjVTP6Y65JT4cUPimaQxTa7s3ItdwHHxWZ7ZgxALrm7L235E5XZxCrOSJlvEYXkfD1zMOwqjythKXRusJAj1LbMSBdLfBzlPvNcFqn/R4Yd0GUTCvfUs39Z7Ilrx2HpJjtF4bP5uSvrzxrJTWpqD+wJY4lQO4wAWxhnu64Qe60xPNQU7Q/rL7rkMK4ecQ== Received: from DM5PR19CA0035.namprd19.prod.outlook.com (2603:10b6:3:9a::21) by BY5PR12MB3714.namprd12.prod.outlook.com (2603:10b6:a03:1a9::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.10; Mon, 24 Jan 2022 18:18:00 +0000 Received: from DM6NAM11FT029.eop-nam11.prod.protection.outlook.com (2603:10b6:3:9a:cafe::58) by DM5PR19CA0035.outlook.office365.com (2603:10b6:3:9a::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.17 via Frontend Transport; Mon, 24 Jan 2022 18:17:59 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.238) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.238) by DM6NAM11FT029.mail.protection.outlook.com (10.13.173.23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4909.7 via Frontend Transport; Mon, 24 Jan 2022 18:17:59 +0000 Received: from drhqmail202.nvidia.com (10.126.190.181) by DRHQMAIL105.nvidia.com (10.27.9.14) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 24 Jan 2022 18:17:58 +0000 Received: from drhqmail201.nvidia.com (10.126.190.180) by drhqmail202.nvidia.com (10.126.190.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.9; Mon, 24 Jan 2022 10:17:58 -0800 Received: from nvidia-Inspiron-15-7510.nvidia.com (10.127.8.13) by mail.nvidia.com (10.126.190.180) with Microsoft SMTP Server id 15.2.986.9 via Frontend Transport; Mon, 24 Jan 2022 10:17:54 -0800 From: Abhishek Sahu To: , Alex Williamson , Cornelia Huck CC: Max Gurtovoy , Yishai Hadas , Zhen Lei , Jason Gunthorpe , , Abhishek Sahu Subject: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state Date: Mon, 24 Jan 2022 23:47:26 +0530 Message-ID: <20220124181726.19174-6-abhsahu@nvidia.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220124181726.19174-1-abhsahu@nvidia.com> References: <20220124181726.19174-1-abhsahu@nvidia.com> X-NVConfidentiality: public MIME-Version: 1.0 X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 95bf5bc7-0111-4245-456a-08d9df65da17 X-MS-TrafficTypeDiagnostic: BY5PR12MB3714:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: RWaEARxNJt4PGCQX9UND3JnFkm80lqTfbpbOFpg3Lf+dt9q01Zlob2z+42wY7ECvuJNQPp+cEaz6pY5yKsjoKyeRpPISdAK3ycatH2Io3AgcHZ44TbPZfFx96LvDQ6FU59AFUPSONZecXCazSAIimDiCWICQsj5CWBQe8bPSwhqmSavhcjr5w7pUBXbhc1H26GkWUaOR0TRKF2fdW+0AlbLLGSi19UbwvMmet3tzJdAZvQ4wgB/u77VLgXRVI/qNeHPcs1hUfe1aOF0Kbzwow2b0Q1GEtmOFei92I3G5YB0i2eEFRcOnTCerfFuY0rPqSV2OAJ1ZAv4YVRhoMV2CDklhFLexDq58cxjwRxBLVwrt3qceQXpjTERlXn/bv4jwhCKw8ggXPXvVyFo92RrP4Rl/wgv/ZiRVNYWpJ6KBuEjbBzoLfBDQhY4X+/XKDTB59LbeMyqiNp4rbBlGQUCgp3dtYbcNK+Kw7EEjF2G304AWp2T2QgbQ+bxbZ53rXO24QV4EwoN/khG6rruRZ+vOsj9YqUOvfKGUl0OWPUnHGa3tN07lu2G5Ne+DnfINsIDO46bxtoySNDI7MIVwt0RAQ5IeCcirOiLjfWtvdNr+O3fRaMcc0Rp6m7X2ttgfkweBl8YPSA4oDyb6u2XZW9JbBRdmG72SqypwtovDTaW5J1P1Go3lkOk0dinSNPrSBaaK6Y06qsrMTMKLkAZQbwK/YZPHNJ2AG5oL1+tzqHZgTBbcUJdTiikrUmcHxXEsiazQF6F59BgIX+q1ya5z8enL0HZNM0dToebnhkzDGxc7D5dS1D6rbg4fQWHfVv3JCaeetDQNiX8hqBXvq/Wu3Hxoxw== X-Forefront-Antispam-Report: CIP:12.22.5.238;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:mail.nvidia.com;PTR:InfoNoRecords;CAT:NONE;SFS:(4636009)(36840700001)(40470700004)(46966006)(2906002)(40460700003)(107886003)(508600001)(8676002)(426003)(30864003)(5660300002)(86362001)(4326008)(36860700001)(7696005)(26005)(81166007)(82310400004)(2616005)(356005)(83380400001)(6666004)(1076003)(336012)(36756003)(316002)(70206006)(70586007)(47076005)(8936002)(110136005)(186003)(54906003)(32563001)(36900700001);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jan 2022 18:17:59.1566 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 95bf5bc7-0111-4245-456a-08d9df65da17 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[12.22.5.238];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT029.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB3714 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, if the runtime power management is enabled for vfio-pci device in the guest OS, then guest OS will do the register write for PCI_PM_CTRL register. This write request will be handled in vfio_pm_config_write() where it will do the actual register write of PCI_PM_CTRL register. With this, the maximum D3hot state can be achieved for low power. If we can use the runtime PM framework, then we can achieve the D3cold state which will help in saving maximum power. 1. Since D3cold state can't be achieved by writing PCI standard PM config registers, so this patch adds a new IOCTL which change the PCI device from D3hot to D3cold state and then D3cold to D0 state. 2. The hypervisors can implement virtual ACPI methods. For example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power resources with _ON/_OFF method, then guest linux OS makes the _OFF call during D3cold transition and then _ON during D0 transition. The hypervisor can tap these virtual ACPI calls and then do the D3cold related IOCTL in the vfio driver. 3. The vfio driver uses runtime PM framework to achieve the D3cold state. For the D3cold transition, decrement the usage count and during D0 transition increment the usage count. 4. For D3cold, the device current power state should be D3hot. Then during runtime suspend, the pci_platform_power_transition() is required for D3cold state. If the D3cold state is not supported, then the device will still be in D3hot state. But with the runtime PM, the root port can now also go into suspended state. 5. For most of the systems, the D3cold is supported at the root port level. So, when root port will transition to D3cold state, then the vfio PCI device will go from D3hot to D3cold state during its runtime suspend. If root port does not support D3cold, then the root will go into D3hot state. 6. The runtime suspend callback can now happen for 2 cases: there is no user of vfio device and the case where user has initiated D3cold. The 'runtime_suspend_pending' flag can help to distinguish this case. 7. There are cases where guest has put PCI device into D3cold state and then on the host side, user has run lspci or any other command which requires access of the PCI config register. In this case, the kernel runtime PM framework will resume the PCI device internally, read the config space and put the device into D3cold state again. Some PCI device needs the SW involvement before going into D3cold state. For the first D3cold state, the driver running in guest side does the SW side steps. But the second D3cold transition will be without guest driver involvement. So, prevent this second d3cold transition by incrementing the device usage count. This will make the device unnecessary in D0 but it's better than failure. In future, we can some mechanism by which we can forward these wake-up request to guest and then the mentioned case can be handled also. 8. In D3cold, all kind of BAR related access needs to be disabled like D3hot. Additionally, the config space will also be disabled in D3cold state. To prevent access of config space in the D3cold state, increment the runtime PM usage count before doing any config space access. Also, most of the IOCTLs do the config space access, so maintain one safe list and skip the resume only for these safe IOCTLs alone. For other IOCTLs, the runtime PM usage count will be incremented first. 9. Now, runtime suspend/resume callbacks need to get the vdev reference which can be obtained by dev_get_drvdata(). Currently, the dev_set_drvdata() is being set after returning from vfio_pci_core_register_device(). The runtime callbacks can come anytime after enabling runtime PM so dev_set_drvdata() must happen before that. We can move dev_set_drvdata() inside vfio_pci_core_register_device() itself. 10. The vfio device user can close the device after putting the device into runtime suspended state so inside vfio_pci_core_disable(), increment the runtime PM usage count. 11. Runtime PM will be possible only if CONFIG_PM is enabled on the host. So, the IOCTL related code can be put under CONFIG_PM Kconfig. Signed-off-by: Abhishek Sahu --- drivers/vfio/pci/vfio_pci.c | 1 - drivers/vfio/pci/vfio_pci_config.c | 11 +- drivers/vfio/pci/vfio_pci_core.c | 186 +++++++++++++++++++++++++++-- include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 21 ++++ 5 files changed, 211 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index c8695baf3b54..4ac3338c8fc7 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const s= truct pci_device_id *id) ret =3D vfio_pci_core_register_device(vdev); if (ret) goto out_free; - dev_set_drvdata(&pdev->dev, vdev); return 0; =20 out_free: diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci= _config.c index dd9ed211ba6f..d20420657959 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -25,6 +25,7 @@ #include #include #include +#include =20 #include =20 @@ -1919,16 +1920,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_co= re_device *vdev, char __user ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user = *buf, size_t count, loff_t *ppos, bool iswrite) { + struct device *dev =3D &vdev->pdev->dev; size_t done =3D 0; int ret =3D 0; loff_t pos =3D *ppos; =20 pos &=3D VFIO_PCI_OFFSET_MASK; =20 + ret =3D pm_runtime_resume_and_get(dev); + if (ret < 0) + return ret; + while (count) { ret =3D vfio_config_do_rw(vdev, buf, count, &pos, iswrite); - if (ret < 0) + if (ret < 0) { + pm_runtime_put(dev); return ret; + } =20 count -=3D ret; done +=3D ret; @@ -1936,6 +1944,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_devic= e *vdev, char __user *buf, pos +=3D ret; } =20 + pm_runtime_put(dev); *ppos +=3D done; =20 return done; diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index 38440d48973f..b70bb4fd940d 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -371,12 +371,23 @@ void vfio_pci_core_disable(struct vfio_pci_core_devic= e *vdev) lockdep_assert_held(&vdev->vdev.dev_set->lock); =20 /* - * If disable has been called while the power state is other than D0, - * then set the power state in vfio driver to D0. It will help - * in running the logic needed for D0 power state. The subsequent - * runtime PM API's will put the device into the low power state again. + * The vfio device user can close the device after putting the device + * into runtime suspended state so wake up the device first in + * this case. */ - vfio_pci_set_power_state_locked(vdev, PCI_D0); + if (vdev->runtime_suspend_pending) { + vdev->runtime_suspend_pending =3D false; + pm_runtime_resume_and_get(&pdev->dev); + } else { + /* + * If disable has been called while the power state is other + * than D0, then set the power state in vfio driver to D0. It + * will help in running the logic needed for D0 power state. + * The subsequent runtime PM API's will put the device into + * the low power state again. + */ + vfio_pci_set_power_state_locked(vdev, PCI_D0); + } =20 /* Stop the device from further DMA */ pci_clear_master(pdev); @@ -693,8 +704,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_d= evice *vdev, } EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region); =20 -long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, - unsigned long arg) +static long vfio_pci_core_ioctl_internal(struct vfio_device *core_vdev, + unsigned int cmd, unsigned long arg) { struct vfio_pci_core_device *vdev =3D container_of(core_vdev, struct vfio_pci_core_device, vdev); @@ -1241,10 +1252,119 @@ long vfio_pci_core_ioctl(struct vfio_device *core_= vdev, unsigned int cmd, default: return -ENOTTY; } +#ifdef CONFIG_PM + } else if (cmd =3D=3D VFIO_DEVICE_POWER_MANAGEMENT) { + struct vfio_power_management vfio_pm; + struct pci_dev *pdev =3D vdev->pdev; + bool request_idle =3D false, request_resume =3D false; + int ret =3D 0; + + if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm))) + return -EFAULT; + + /* + * The vdev power related fields are protected with memory_lock + * semaphore. + */ + down_write(&vdev->memory_lock); + switch (vfio_pm.d3cold_state) { + case VFIO_DEVICE_D3COLD_STATE_ENTER: + /* + * For D3cold, the device should already in D3hot + * state. + */ + if (vdev->power_state < PCI_D3hot) { + ret =3D EINVAL; + break; + } + + if (!vdev->runtime_suspend_pending) { + vdev->runtime_suspend_pending =3D true; + pm_runtime_put_noidle(&pdev->dev); + request_idle =3D true; + } + + break; + + case VFIO_DEVICE_D3COLD_STATE_EXIT: + /* + * If the runtime resume has already been run, then + * the device will be already in D0 state. + */ + if (vdev->runtime_suspend_pending) { + vdev->runtime_suspend_pending =3D false; + pm_runtime_get_noresume(&pdev->dev); + request_resume =3D true; + } + + break; + + default: + ret =3D EINVAL; + break; + } + + up_write(&vdev->memory_lock); + + /* + * Call the runtime PM API's without any lock. Inside vfio driver + * runtime suspend/resume, the locks can be acquired again. + */ + if (request_idle) + pm_request_idle(&pdev->dev); + + if (request_resume) + pm_runtime_resume(&pdev->dev); + + return ret; +#endif } =20 return -ENOTTY; } + +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, + unsigned long arg) +{ +#ifdef CONFIG_PM + struct vfio_pci_core_device *vdev =3D + container_of(core_vdev, struct vfio_pci_core_device, vdev); + struct device *dev =3D &vdev->pdev->dev; + bool skip_runtime_resume =3D false; + long ret; + + /* + * The list of commands which are safe to execute when the PCI device + * is in D3cold state. In D3cold state, the PCI config or any other IO + * access won't work. + */ + switch (cmd) { + case VFIO_DEVICE_POWER_MANAGEMENT: + case VFIO_DEVICE_GET_INFO: + case VFIO_DEVICE_FEATURE: + skip_runtime_resume =3D true; + break; + + default: + break; + } + + if (!skip_runtime_resume) { + ret =3D pm_runtime_resume_and_get(dev); + if (ret < 0) + return ret; + } + + ret =3D vfio_pci_core_ioctl_internal(core_vdev, cmd, arg); + + if (!skip_runtime_resume) + pm_runtime_put(dev); + + return ret; +#else + return vfio_pci_core_ioctl_internal(core_vdev, cmd, arg); +#endif +} EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); =20 static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user = *buf, @@ -1897,6 +2017,7 @@ int vfio_pci_core_register_device(struct vfio_pci_cor= e_device *vdev) return -EBUSY; } =20 + dev_set_drvdata(&pdev->dev, vdev); if (pci_is_root_bus(pdev->bus)) { ret =3D vfio_assign_device_set(&vdev->vdev, vdev); } else if (!pci_probe_reset_slot(pdev->slot)) { @@ -1966,6 +2087,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_= core_device *vdev) pm_runtime_get_noresume(&pdev->dev); =20 pm_runtime_forbid(&pdev->dev); + dev_set_drvdata(&pdev->dev, NULL); } EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); =20 @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_= device_set *dev_set) #ifdef CONFIG_PM static int vfio_pci_core_runtime_suspend(struct device *dev) { + struct pci_dev *pdev =3D to_pci_dev(dev); + struct vfio_pci_core_device *vdev =3D dev_get_drvdata(dev); + + down_read(&vdev->memory_lock); + + /* + * runtime_suspend_pending won't be set if there is no user of vfio pci + * device. In that case, return early and PCI core will take care of + * putting the device in the low power state. + */ + if (!vdev->runtime_suspend_pending) { + up_read(&vdev->memory_lock); + return 0; + } + + /* + * The runtime suspend will be called only if device is already at + * D3hot state. Now, change the device state from D3hot to D3cold by + * using platform power management. If setting of D3cold is not + * supported for the PCI device, then the device state will still be + * in D3hot state. The PCI core expects to save the PCI state, if + * driver runtime routine handles the power state management. + */ + pci_save_state(pdev); + pci_platform_power_transition(pdev, PCI_D3cold); + up_read(&vdev->memory_lock); + return 0; } =20 static int vfio_pci_core_runtime_resume(struct device *dev) { + struct pci_dev *pdev =3D to_pci_dev(dev); + struct vfio_pci_core_device *vdev =3D dev_get_drvdata(dev); + + down_write(&vdev->memory_lock); + + /* + * The PCI core will move the device to D0 state before calling the + * driver runtime resume. + */ + vfio_pci_set_power_state_locked(vdev, PCI_D0); + + /* + * Some PCI device needs the SW involvement before going to D3cold + * state again. So if there is any wake-up which is not triggered + * by the guest, then increase the usage count to prevent the + * second runtime suspend. + */ + if (vdev->runtime_suspend_pending) { + vdev->runtime_suspend_pending =3D false; + pm_runtime_get_noresume(&pdev->dev); + } + + up_write(&vdev->memory_lock); return 0; } =20 diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 05db838e72cc..8bbfd028115a 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -124,6 +124,7 @@ struct vfio_pci_core_device { bool needs_reset; bool nointx; bool needs_pm_restore; + bool runtime_suspend_pending; pci_power_t power_state; struct pci_saved_state *pci_saved_state; struct pci_saved_state *pm_save; diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index ef33ea002b0b..7b7dadc6df71 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1002,6 +1002,27 @@ struct vfio_device_feature { */ #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0) =20 +/** + * VFIO_DEVICE_POWER_MANAGEMENT - _IOW(VFIO_TYPE, VFIO_BASE + 18, + * struct vfio_power_management) + * + * Provide the support for device power management. The native PCI power + * management does not support the D3cold power state. For moving the dev= ice + * into D3cold state, change the PCI state to D3hot with standard + * configuration registers and then call this IOCTL to setting the D3cold + * state. Similarly, if the device in D3cold state, then call this IOCTL + * to exit from D3cold state. + * + * Return 0 on success, -errno on failure. + */ +#define VFIO_DEVICE_POWER_MANAGEMENT _IO(VFIO_TYPE, VFIO_BASE + 18) +struct vfio_power_management { + __u32 argsz; +#define VFIO_DEVICE_D3COLD_STATE_EXIT 0x0 +#define VFIO_DEVICE_D3COLD_STATE_ENTER 0x1 + __u32 d3cold_state; +}; + /* -------- API for Type1 VFIO IOMMU -------- */ =20 /** --=20 2.17.1