From nobody Thu Sep 18 12:57:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3635BC47090 for ; Tue, 6 Dec 2022 05:58:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233532AbiLFF6k (ORCPT ); Tue, 6 Dec 2022 00:58:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42836 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233338AbiLFF6g (ORCPT ); Tue, 6 Dec 2022 00:58:36 -0500 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E50C24F28; Mon, 5 Dec 2022 21:58:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670306315; x=1701842315; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9tscb8eIosHw9BslFqEXbDs27VxieMXwgkEh6RsBQDE=; b=aHuBepdwiqmRB6NerVm44L6r7YWbGKK1kQGiyivS//+h3RfVSE+hwTGc TkbAk71WZwmMamn3kyDZrX8tmS9/+0uXOQvlkogI2BeOx3xdr804pgh7k lABz2NklRnytnEifZCkM/ZdejZ/9qL05XPbtdui6GHhRHd87HXddj0t9Z GkccAXwK22Su2jL2sCfgQIpJYjt05IDSOttuwpaC8gAoAdi6r0uu9v+1H QLUt0uv+iwww3tyF7YEi1rh7XuVzGVYFymQSw8XYRmrG9da7dTpySpf2W iEO7sEeGZWHp/qLQ9aDyp3kp4CM5LsyC8H/tSsJrxfLIRTqjUQ6bXyOpq g==; X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="378706621" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="378706621" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2022 21:58:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="648211267" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="648211267" Received: from leirao-pc.bj.intel.com ([10.238.156.101]) by fmsmga007.fm.intel.com with ESMTP; 05 Dec 2022 21:58:29 -0800 From: Lei Rao To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, jgg@ziepe.ca, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com, Lei Rao Subject: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver. Date: Tue, 6 Dec 2022 13:58:12 +0800 Message-Id: <20221206055816.292304-2-lei.rao@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com> References: <20221206055816.292304-1-lei.rao@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The new function nvme_submit_vf_cmd() helps the host VF driver to issue VF admin commands. It's helpful in some cases that the host NVMe driver does not control VF's admin queue. For example, in the virtualization device pass-through case, the VF controller's admin queue is governed by the Guest NVMe driver. Host VF driver relies on PF device's admin queue to control VF devices like vendor-specific live migration commands. Signed-off-by: Lei Rao Signed-off-by: Yadong Li Signed-off-by: Chaitanya Kulkarni Reviewed-by: Eddie Dong Reviewed-by: Hang Yuan --- drivers/nvme/host/pci.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 488ad7dabeb8..3d9c54d8e7fc 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3585,6 +3585,24 @@ static struct pci_driver nvme_driver =3D { .err_handler =3D &nvme_err_handler, }; =20 +int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd, + size_t *result, void *buffer, unsigned int bufflen) +{ + struct nvme_dev *ndev =3D NULL; + union nvme_result res =3D { }; + int ret; + + ndev =3D pci_iov_get_pf_drvdata(dev, &nvme_driver); + if (IS_ERR(ndev)) + return PTR_ERR(ndev); + ret =3D __nvme_submit_sync_cmd(ndev->ctrl.admin_q, cmd, &res, buffer, buf= flen, + NVME_QID_ANY, 0, 0); + if (ret >=3D 0 && result) + *result =3D le32_to_cpu(res.u32); + return ret; +} +EXPORT_SYMBOL_GPL(nvme_submit_vf_cmd); + static int __init nvme_init(void) { BUILD_BUG_ON(sizeof(struct nvme_create_cq) !=3D 64); --=20 2.34.1 From nobody Thu Sep 18 12:57:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 895FBC47090 for ; Tue, 6 Dec 2022 05:58:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232284AbiLFF6v (ORCPT ); Tue, 6 Dec 2022 00:58:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42998 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232399AbiLFF6n (ORCPT ); Tue, 6 Dec 2022 00:58:43 -0500 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6ABB226540; Mon, 5 Dec 2022 21:58:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670306321; x=1701842321; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DUSArCj86hDjg2xcLGooiKx6M0pp7aVHzT7v3JaXbC4=; b=B3rS5+iHX+muKncdThZQBiUJQNFsAOofrJlOCSS9cI00IYyYWhGlsEcM 4rs862f5NXzuQY5laf24D82AXfJ5T019VcX8xfTSsSbZmy1gAlQcVBCHm 7EhIh247AG+HFCzRv5qViRElmYdLgtukepjwY9U7tAzsnYmGI6YQgT7sY 8It9e/BdY77IXPU1TO8Lb7YOgps9Fz/j/8h4i5qTtlGNiic8qrfW9giES sDeWgRPssitv47HC7GSJdl4uRTer/FvAvo7ruOOwZv6LPEw9hWQ8iakgy 6QkvQzUdDq1aGdM9SywClX/gUsnwOusUquP43MZLMLz+B/sGlXdevQDcQ g==; X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="378706658" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="378706658" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2022 21:58:40 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="648211382" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="648211382" Received: from leirao-pc.bj.intel.com ([10.238.156.101]) by fmsmga007.fm.intel.com with ESMTP; 05 Dec 2022 21:58:35 -0800 From: Lei Rao To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, jgg@ziepe.ca, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com, Lei Rao Subject: [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device Date: Tue, 6 Dec 2022 13:58:13 +0800 Message-Id: <20221206055816.292304-3-lei.rao@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com> References: <20221206055816.292304-1-lei.rao@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" NVMe device has specific live migration implementation. Add an specific VFIO PCI driver for NVMe device. Its live migration support will be added in the subsequent patches. Signed-off-by: Lei Rao Signed-off-by: Yadong Li Signed-off-by: Chaitanya Kulkarni Reviewed-by: Eddie Dong Reviewed-by: Hang Yuan --- drivers/vfio/pci/Kconfig | 2 + drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/nvme/Kconfig | 9 ++++ drivers/vfio/pci/nvme/Makefile | 3 ++ drivers/vfio/pci/nvme/nvme.c | 99 ++++++++++++++++++++++++++++++++++ 5 files changed, 115 insertions(+) create mode 100644 drivers/vfio/pci/nvme/Kconfig create mode 100644 drivers/vfio/pci/nvme/Makefile create mode 100644 drivers/vfio/pci/nvme/nvme.c diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index f9d0c908e738..fcd45144d3e3 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig" =20 source "drivers/vfio/pci/hisilicon/Kconfig" =20 +source "drivers/vfio/pci/nvme/Kconfig" + endif diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5..eddc8e889726 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) +=3D vfio-pci.o obj-$(CONFIG_MLX5_VFIO_PCI) +=3D mlx5/ =20 obj-$(CONFIG_HISI_ACC_VFIO_PCI) +=3D hisilicon/ + +obj-$(CONFIG_NVME_VFIO_PCI) +=3D nvme/ diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig new file mode 100644 index 000000000000..c281fe154007 --- /dev/null +++ b/drivers/vfio/pci/nvme/Kconfig @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: GPL-2.0-only +config NVME_VFIO_PCI + tristate "VFIO support for NVMe PCI devices" + depends on VFIO_PCI_CORE + help + This provides generic VFIO PCI support for NVMe device + using the VFIO framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/nvme/Makefile b/drivers/vfio/pci/nvme/Makefile new file mode 100644 index 000000000000..2f4a0ad3d9cf --- /dev/null +++ b/drivers/vfio/pci/nvme/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_NVME_VFIO_PCI) +=3D nvme-vfio-pci.o +nvme-vfio-pci-y :=3D nvme.o diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c new file mode 100644 index 000000000000..f1386d8a9287 --- /dev/null +++ b/drivers/vfio/pci/nvme/nvme.c @@ -0,0 +1,99 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int nvmevf_pci_open_device(struct vfio_device *core_vdev) +{ + struct vfio_pci_core_device *vdev =3D + container_of(core_vdev, struct vfio_pci_core_device, vdev); + int ret; + + ret =3D vfio_pci_core_enable(vdev); + if (ret) + return ret; + + vfio_pci_core_finish_enable(vdev); + return 0; +} + +static const struct vfio_device_ops nvmevf_pci_ops =3D { + .name =3D "nvme-vfio-pci", + .init =3D vfio_pci_core_init_dev, + .release =3D vfio_pci_core_release_dev, + .open_device =3D nvmevf_pci_open_device, + .close_device =3D vfio_pci_core_close_device, + .ioctl =3D vfio_pci_core_ioctl, + .device_feature =3D vfio_pci_core_ioctl_feature, + .read =3D vfio_pci_core_read, + .write =3D vfio_pci_core_write, + .mmap =3D vfio_pci_core_mmap, + .request =3D vfio_pci_core_request, + .match =3D vfio_pci_core_match, +}; + +static int nvmevf_pci_probe(struct pci_dev *pdev, const struct pci_device_= id *id) +{ + struct vfio_pci_core_device *vdev; + int ret; + + vdev =3D vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev, + &nvmevf_pci_ops); + if (IS_ERR(vdev)) + return PTR_ERR(vdev); + + dev_set_drvdata(&pdev->dev, vdev); + ret =3D vfio_pci_core_register_device(vdev); + if (ret) + goto out_put_dev; + + return 0; + +out_put_dev: + vfio_put_device(&vdev->vdev); + return ret; +} + +static void nvmevf_pci_remove(struct pci_dev *pdev) +{ + struct vfio_pci_core_device *vdev =3D dev_get_drvdata(&pdev->dev); + + vfio_pci_core_unregister_device(vdev); + vfio_put_device(&vdev->vdev); +} + +static const struct pci_device_id nvmevf_pci_table[] =3D { + /* Intel IPU NVMe Virtual Function */ + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x1457) }, + {} +}; + +MODULE_DEVICE_TABLE(pci, nvmevf_pci_table); + +static struct pci_driver nvmevf_pci_driver =3D { + .name =3D KBUILD_MODNAME, + .id_table =3D nvmevf_pci_table, + .probe =3D nvmevf_pci_probe, + .remove =3D nvmevf_pci_remove, + .err_handler =3D &vfio_pci_core_err_handlers, + .driver_managed_dma =3D true, +}; + +module_pci_driver(nvmevf_pci_driver); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Lei Rao "); +MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe"); --=20 2.34.1 From nobody Thu Sep 18 12:57:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9675C47090 for ; Tue, 6 Dec 2022 05:59:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233693AbiLFF7F (ORCPT ); Tue, 6 Dec 2022 00:59:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233637AbiLFF6w (ORCPT ); Tue, 6 Dec 2022 00:58:52 -0500 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F3B7C2613A; Mon, 5 Dec 2022 21:58:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670306328; x=1701842328; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DejVWff0PrM8RwxO5Ftjo/53EiZqKWB8xQb046cdS+4=; b=YPrMLn/uYwYrYXr1PdwkLyNdP98ly6xkrdLEAnzF3oPkSUQEQp4klB93 QYDQjo5otb49GClbvSdPuPiHWjnC9FEJfCtQEuevvL4QVA2umlMljuQ5w +aZrmd38/yF3g+e88FIwoXFkcPFMutxJQqTPJDq3v/UePjch1XbJQ3nWN A7v5AiV2zuPxQo6SZKDHnkhSNIqyZEdFQMRbNC2WuLkvlWMvPLaMcysjN M7eDveWgEtMpuNcOPLGRtThHWXzVM2tlYlDiDBnp7aU3ytYGQ/bmIVfMe uao87+bW9d39/Armpj5I9PrwLHPzSaoVlsrYu07ykyJIFpWW42QshbpWB w==; X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="378706679" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="378706679" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2022 21:58:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="648211398" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="648211398" Received: from leirao-pc.bj.intel.com ([10.238.156.101]) by fmsmga007.fm.intel.com with ESMTP; 05 Dec 2022 21:58:41 -0800 From: Lei Rao To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, jgg@ziepe.ca, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com, Lei Rao Subject: [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration. Date: Tue, 6 Dec 2022 13:58:14 +0800 Message-Id: <20221206055816.292304-4-lei.rao@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com> References: <20221206055816.292304-1-lei.rao@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Implement specific VFIO live migration operations for NVMe devices. Signed-off-by: Lei Rao Signed-off-by: Yadong Li Signed-off-by: Chaitanya Kulkarni Reviewed-by: Eddie Dong Reviewed-by: Hang Yuan --- drivers/vfio/pci/nvme/Kconfig | 5 +- drivers/vfio/pci/nvme/nvme.c | 543 ++++++++++++++++++++++++++++++++-- drivers/vfio/pci/nvme/nvme.h | 111 +++++++ 3 files changed, 637 insertions(+), 22 deletions(-) create mode 100644 drivers/vfio/pci/nvme/nvme.h diff --git a/drivers/vfio/pci/nvme/Kconfig b/drivers/vfio/pci/nvme/Kconfig index c281fe154007..12e0eaba0de1 100644 --- a/drivers/vfio/pci/nvme/Kconfig +++ b/drivers/vfio/pci/nvme/Kconfig @@ -1,9 +1,10 @@ # SPDX-License-Identifier: GPL-2.0-only config NVME_VFIO_PCI tristate "VFIO support for NVMe PCI devices" + depends on NVME_CORE depends on VFIO_PCI_CORE help - This provides generic VFIO PCI support for NVMe device - using the VFIO framework. + This provides migration support for NVMe devices using the + VFIO framework. =20 If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c index f1386d8a9287..698e470a4e53 100644 --- a/drivers/vfio/pci/nvme/nvme.c +++ b/drivers/vfio/pci/nvme/nvme.c @@ -13,29 +13,503 @@ #include #include #include -#include -#include + +#include "nvme.h" + +#define MAX_MIGRATION_SIZE (256 * 1024) + +static int nvmevf_cmd_suspend_device(struct nvmevf_pci_core_device *nvmevf= _dev) +{ + struct pci_dev *dev =3D nvmevf_dev->core_device.pdev; + struct nvme_live_mig_command c =3D { }; + int ret; + + c.suspend.opcode =3D nvme_admin_live_mig_suspend; + c.suspend.vf_index =3D nvmevf_dev->vf_id; + + ret =3D nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, NULL, 0); + if (ret) { + dev_warn(&dev->dev, "Suspend virtual function failed (ret=3D0x%x)\n", re= t); + return ret; + } + return 0; +} + +static int nvmevf_cmd_resume_device(struct nvmevf_pci_core_device *nvmevf_= dev) +{ + struct pci_dev *dev =3D nvmevf_dev->core_device.pdev; + struct nvme_live_mig_command c =3D { }; + int ret; + + c.resume.opcode =3D nvme_admin_live_mig_resume; + c.resume.vf_index =3D nvmevf_dev->vf_id; + + ret =3D nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, NULL, 0); + if (ret) { + dev_warn(&dev->dev, "Resume virtual function failed (ret=3D0x%x)\n", ret= ); + return ret; + } + return 0; +} + +static int nvmevf_cmd_query_data_size(struct nvmevf_pci_core_device *nvmev= f_dev, + size_t *state_size) +{ + struct pci_dev *dev =3D nvmevf_dev->core_device.pdev; + struct nvme_live_mig_command c =3D { }; + size_t result; + int ret; + + c.query.opcode =3D nvme_admin_live_mig_query_data_size; + c.query.vf_index =3D nvmevf_dev->vf_id; + + ret =3D nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, &result, NULL,= 0); + if (ret) { + dev_warn(&dev->dev, "Query the states size failed (ret=3D0x%x)\n", ret); + *state_size =3D 0; + return ret; + } + *state_size =3D result; + return 0; +} + +static int nvmevf_cmd_save_data(struct nvmevf_pci_core_device *nvmevf_dev, + void *buffer, size_t buffer_len) +{ + struct pci_dev *dev =3D nvmevf_dev->core_device.pdev; + struct nvme_live_mig_command c =3D { }; + int ret; + + c.save.opcode =3D nvme_admin_live_mig_save_data; + c.save.vf_index =3D nvmevf_dev->vf_id; + + ret =3D nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, buffer, = buffer_len); + if (ret) { + dev_warn(&dev->dev, "Save the device states failed (ret=3D0x%x)\n", ret); + return ret; + } + return 0; +} + +static int nvmevf_cmd_load_data(struct nvmevf_pci_core_device *nvmevf_dev, + struct nvmevf_migration_file *migf) +{ + struct pci_dev *dev =3D nvmevf_dev->core_device.pdev; + struct nvme_live_mig_command c =3D { }; + int ret; + + c.load.opcode =3D nvme_admin_live_mig_load_data; + c.load.vf_index =3D nvmevf_dev->vf_id; + c.load.size =3D migf->total_length; + + ret =3D nvme_submit_vf_cmd(dev, (struct nvme_command *)&c, NULL, + migf->vf_data, migf->total_length); + if (ret) { + dev_warn(&dev->dev, "Load the device states failed (ret=3D0x%x)\n", ret); + return ret; + } + return 0; +} + +static struct nvmevf_pci_core_device *nvmevf_drvdata(struct pci_dev *pdev) +{ + struct vfio_pci_core_device *core_device =3D dev_get_drvdata(&pdev->dev); + + return container_of(core_device, struct nvmevf_pci_core_device, core_devi= ce); +} + +static void nvmevf_disable_fd(struct nvmevf_migration_file *migf) +{ + mutex_lock(&migf->lock); + + /* release the device states buffer */ + kvfree(migf->vf_data); + migf->vf_data =3D NULL; + migf->disabled =3D true; + migf->total_length =3D 0; + migf->filp->f_pos =3D 0; + mutex_unlock(&migf->lock); +} + +static int nvmevf_release_file(struct inode *inode, struct file *filp) +{ + struct nvmevf_migration_file *migf =3D filp->private_data; + + nvmevf_disable_fd(migf); + mutex_destroy(&migf->lock); + kfree(migf); + return 0; +} + +static ssize_t nvmevf_save_read(struct file *filp, char __user *buf, size_= t len, loff_t *pos) +{ + struct nvmevf_migration_file *migf =3D filp->private_data; + ssize_t done =3D 0; + int ret; + + if (pos) + return -ESPIPE; + pos =3D &filp->f_pos; + + mutex_lock(&migf->lock); + if (*pos > migf->total_length) { + done =3D -EINVAL; + goto out_unlock; + } + + if (migf->disabled) { + done =3D -EINVAL; + goto out_unlock; + } + + len =3D min_t(size_t, migf->total_length - *pos, len); + if (len) { + ret =3D copy_to_user(buf, migf->vf_data + *pos, len); + if (ret) { + done =3D -EFAULT; + goto out_unlock; + } + *pos +=3D len; + done =3D len; + } + +out_unlock: + mutex_unlock(&migf->lock); + return done; +} + +static const struct file_operations nvmevf_save_fops =3D { + .owner =3D THIS_MODULE, + .read =3D nvmevf_save_read, + .release =3D nvmevf_release_file, + .llseek =3D no_llseek, +}; + +static ssize_t nvmevf_resume_write(struct file *filp, const char __user *b= uf, + size_t len, loff_t *pos) +{ + struct nvmevf_migration_file *migf =3D filp->private_data; + loff_t requested_length; + ssize_t done =3D 0; + int ret; + + if (pos) + return -ESPIPE; + pos =3D &filp->f_pos; + + if (*pos < 0 || check_add_overflow((loff_t)len, *pos, &requested_length)) + return -EINVAL; + + if (requested_length > MAX_MIGRATION_SIZE) + return -ENOMEM; + mutex_lock(&migf->lock); + if (migf->disabled) { + done =3D -ENODEV; + goto out_unlock; + } + + ret =3D copy_from_user(migf->vf_data + *pos, buf, len); + if (ret) { + done =3D -EFAULT; + goto out_unlock; + } + *pos +=3D len; + done =3D len; + migf->total_length +=3D len; + +out_unlock: + mutex_unlock(&migf->lock); + return done; +} + +static const struct file_operations nvmevf_resume_fops =3D { + .owner =3D THIS_MODULE, + .write =3D nvmevf_resume_write, + .release =3D nvmevf_release_file, + .llseek =3D no_llseek, +}; + +static void nvmevf_disable_fds(struct nvmevf_pci_core_device *nvmevf_dev) +{ + if (nvmevf_dev->resuming_migf) { + nvmevf_disable_fd(nvmevf_dev->resuming_migf); + fput(nvmevf_dev->resuming_migf->filp); + nvmevf_dev->resuming_migf =3D NULL; + } + + if (nvmevf_dev->saving_migf) { + nvmevf_disable_fd(nvmevf_dev->saving_migf); + fput(nvmevf_dev->saving_migf->filp); + nvmevf_dev->saving_migf =3D NULL; + } +} + +static struct nvmevf_migration_file * +nvmevf_pci_resume_device_data(struct nvmevf_pci_core_device *nvmevf_dev) +{ + struct nvmevf_migration_file *migf; + int ret; + + migf =3D kzalloc(sizeof(*migf), GFP_KERNEL); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->filp =3D anon_inode_getfile("nvmevf_mig", &nvmevf_resume_fops, migf, + O_WRONLY); + if (IS_ERR(migf->filp)) { + int err =3D PTR_ERR(migf->filp); + + kfree(migf); + return ERR_PTR(err); + } + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + + /* Allocate buffer to load the device states and the max states is 256K */ + migf->vf_data =3D kvzalloc(MAX_MIGRATION_SIZE, GFP_KERNEL); + if (!migf->vf_data) { + ret =3D -ENOMEM; + goto out_free; + } + + return migf; + +out_free: + fput(migf->filp); + return ERR_PTR(ret); +} + +static struct nvmevf_migration_file * +nvmevf_pci_save_device_data(struct nvmevf_pci_core_device *nvmevf_dev) +{ + struct nvmevf_migration_file *migf; + int ret; + + migf =3D kzalloc(sizeof(*migf), GFP_KERNEL); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->filp =3D anon_inode_getfile("nvmevf_mig", &nvmevf_save_fops, migf, + O_RDONLY); + if (IS_ERR(migf->filp)) { + int err =3D PTR_ERR(migf->filp); + + kfree(migf); + return ERR_PTR(err); + } + + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + + ret =3D nvmevf_cmd_query_data_size(nvmevf_dev, &migf->total_length); + if (ret) + goto out_free; + /* Allocate buffer and save the device states*/ + migf->vf_data =3D kvzalloc(migf->total_length, GFP_KERNEL); + if (!migf->vf_data) { + ret =3D -ENOMEM; + goto out_free; + } + + ret =3D nvmevf_cmd_save_data(nvmevf_dev, migf->vf_data, migf->total_lengt= h); + if (ret) + goto out_free; + + return migf; +out_free: + fput(migf->filp); + return ERR_PTR(ret); +} + +static struct file * +nvmevf_pci_step_device_state_locked(struct nvmevf_pci_core_device *nvmevf_= dev, u32 new) +{ + u32 cur =3D nvmevf_dev->mig_state; + int ret; + + if (cur =3D=3D VFIO_DEVICE_STATE_RUNNING && new =3D=3D VFIO_DEVICE_STATE_= STOP) { + ret =3D nvmevf_cmd_suspend_device(nvmevf_dev); + if (ret) + return ERR_PTR(ret); + return NULL; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP && new =3D=3D VFIO_DEVICE_STATE_STO= P_COPY) { + struct nvmevf_migration_file *migf; + + migf =3D nvmevf_pci_save_device_data(nvmevf_dev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + nvmevf_dev->saving_migf =3D migf; + return migf->filp; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP_COPY && new =3D=3D VFIO_DEVICE_STAT= E_STOP) { + nvmevf_disable_fds(nvmevf_dev); + return NULL; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP && new =3D=3D VFIO_DEVICE_STATE_RES= UMING) { + struct nvmevf_migration_file *migf; + + migf =3D nvmevf_pci_resume_device_data(nvmevf_dev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + nvmevf_dev->resuming_migf =3D migf; + return migf->filp; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_RESUMING && new =3D=3D VFIO_DEVICE_STATE= _STOP) { + ret =3D nvmevf_cmd_load_data(nvmevf_dev, nvmevf_dev->resuming_migf); + if (ret) + return ERR_PTR(ret); + nvmevf_disable_fds(nvmevf_dev); + return NULL; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP && new =3D=3D VFIO_DEVICE_STATE_RUN= NING) { + nvmevf_cmd_resume_device(nvmevf_dev); + return NULL; + } + + /* vfio_mig_get_next_state() does not use arcs other than the above */ + WARN_ON(true); + return ERR_PTR(-EINVAL); +} + +static void nvmevf_state_mutex_unlock(struct nvmevf_pci_core_device *nvmev= f_dev) +{ +again: + spin_lock(&nvmevf_dev->reset_lock); + if (nvmevf_dev->deferred_reset) { + nvmevf_dev->deferred_reset =3D false; + spin_unlock(&nvmevf_dev->reset_lock); + nvmevf_dev->mig_state =3D VFIO_DEVICE_STATE_RUNNING; + nvmevf_disable_fds(nvmevf_dev); + goto again; + } + mutex_unlock(&nvmevf_dev->state_mutex); + spin_unlock(&nvmevf_dev->reset_lock); +} + +static struct file * +nvmevf_pci_set_device_state(struct vfio_device *vdev, enum vfio_device_mig= _state new_state) +{ + struct nvmevf_pci_core_device *nvmevf_dev =3D container_of(vdev, + struct nvmevf_pci_core_device, core_device.vdev); + enum vfio_device_mig_state next_state; + struct file *res =3D NULL; + int ret; + + mutex_lock(&nvmevf_dev->state_mutex); + while (new_state !=3D nvmevf_dev->mig_state) { + ret =3D vfio_mig_get_next_state(vdev, nvmevf_dev->mig_state, new_state, = &next_state); + if (ret) { + res =3D ERR_PTR(-EINVAL); + break; + } + + res =3D nvmevf_pci_step_device_state_locked(nvmevf_dev, next_state); + if (IS_ERR(res)) + break; + nvmevf_dev->mig_state =3D next_state; + if (WARN_ON(res && new_state !=3D nvmevf_dev->mig_state)) { + fput(res); + res =3D ERR_PTR(-EINVAL); + break; + } + } + nvmevf_state_mutex_unlock(nvmevf_dev); + return res; +} + +static int nvmevf_pci_get_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state *curr_state) +{ + struct nvmevf_pci_core_device *nvmevf_dev =3D container_of( + vdev, struct nvmevf_pci_core_device, core_device.vdev); + + mutex_lock(&nvmevf_dev->state_mutex); + *curr_state =3D nvmevf_dev->mig_state; + nvmevf_state_mutex_unlock(nvmevf_dev); + return 0; +} =20 static int nvmevf_pci_open_device(struct vfio_device *core_vdev) { - struct vfio_pci_core_device *vdev =3D - container_of(core_vdev, struct vfio_pci_core_device, vdev); + struct nvmevf_pci_core_device *nvmevf_dev =3D container_of( + core_vdev, struct nvmevf_pci_core_device, core_device.vdev); + struct vfio_pci_core_device *vdev =3D &nvmevf_dev->core_device; int ret; =20 ret =3D vfio_pci_core_enable(vdev); if (ret) return ret; =20 + if (nvmevf_dev->migrate_cap) + nvmevf_dev->mig_state =3D VFIO_DEVICE_STATE_RUNNING; vfio_pci_core_finish_enable(vdev); return 0; } =20 +static void nvmevf_cmd_close_migratable(struct nvmevf_pci_core_device *nvm= evf_dev) +{ + if (!nvmevf_dev->migrate_cap) + return; + + mutex_lock(&nvmevf_dev->state_mutex); + nvmevf_disable_fds(nvmevf_dev); + nvmevf_state_mutex_unlock(nvmevf_dev); +} + +static void nvmevf_pci_close_device(struct vfio_device *core_vdev) +{ + struct nvmevf_pci_core_device *nvmevf_dev =3D container_of( + core_vdev, struct nvmevf_pci_core_device, core_device.vdev); + + nvmevf_cmd_close_migratable(nvmevf_dev); + vfio_pci_core_close_device(core_vdev); +} + +static const struct vfio_migration_ops nvmevf_pci_mig_ops =3D { + .migration_set_state =3D nvmevf_pci_set_device_state, + .migration_get_state =3D nvmevf_pci_get_device_state, +}; + +static int nvmevf_migration_init_dev(struct vfio_device *core_vdev) +{ + struct nvmevf_pci_core_device *nvmevf_dev =3D container_of(core_vdev, + struct nvmevf_pci_core_device, core_device.vdev); + struct pci_dev *pdev =3D to_pci_dev(core_vdev->dev); + int vf_id; + int ret =3D -1; + + if (!pdev->is_virtfn) + return ret; + + nvmevf_dev->migrate_cap =3D 1; + + vf_id =3D pci_iov_vf_id(pdev); + if (vf_id < 0) + return ret; + nvmevf_dev->vf_id =3D vf_id + 1; + core_vdev->migration_flags =3D VFIO_MIGRATION_STOP_COPY; + + mutex_init(&nvmevf_dev->state_mutex); + spin_lock_init(&nvmevf_dev->reset_lock); + core_vdev->mig_ops =3D &nvmevf_pci_mig_ops; + + return vfio_pci_core_init_dev(core_vdev); +} + static const struct vfio_device_ops nvmevf_pci_ops =3D { .name =3D "nvme-vfio-pci", - .init =3D vfio_pci_core_init_dev, + .init =3D nvmevf_migration_init_dev, .release =3D vfio_pci_core_release_dev, .open_device =3D nvmevf_pci_open_device, - .close_device =3D vfio_pci_core_close_device, + .close_device =3D nvmevf_pci_close_device, .ioctl =3D vfio_pci_core_ioctl, .device_feature =3D vfio_pci_core_ioctl_feature, .read =3D vfio_pci_core_read, @@ -47,32 +521,56 @@ static const struct vfio_device_ops nvmevf_pci_ops =3D= { =20 static int nvmevf_pci_probe(struct pci_dev *pdev, const struct pci_device_= id *id) { - struct vfio_pci_core_device *vdev; + struct nvmevf_pci_core_device *nvmevf_dev; int ret; =20 - vdev =3D vfio_alloc_device(vfio_pci_core_device, vdev, &pdev->dev, - &nvmevf_pci_ops); - if (IS_ERR(vdev)) - return PTR_ERR(vdev); + nvmevf_dev =3D vfio_alloc_device(nvmevf_pci_core_device, core_device.vdev, + &pdev->dev, &nvmevf_pci_ops); + if (IS_ERR(nvmevf_dev)) + return PTR_ERR(nvmevf_dev); =20 - dev_set_drvdata(&pdev->dev, vdev); - ret =3D vfio_pci_core_register_device(vdev); + dev_set_drvdata(&pdev->dev, &nvmevf_dev->core_device); + ret =3D vfio_pci_core_register_device(&nvmevf_dev->core_device); if (ret) goto out_put_dev; - return 0; =20 out_put_dev: - vfio_put_device(&vdev->vdev); + vfio_put_device(&nvmevf_dev->core_device.vdev); return ret; + } =20 static void nvmevf_pci_remove(struct pci_dev *pdev) { - struct vfio_pci_core_device *vdev =3D dev_get_drvdata(&pdev->dev); + struct nvmevf_pci_core_device *nvmevf_dev =3D nvmevf_drvdata(pdev); + + vfio_pci_core_unregister_device(&nvmevf_dev->core_device); + vfio_put_device(&nvmevf_dev->core_device.vdev); +} + +static void nvmevf_pci_aer_reset_done(struct pci_dev *pdev) +{ + struct nvmevf_pci_core_device *nvmevf_dev =3D nvmevf_drvdata(pdev); + + if (!nvmevf_dev->migrate_cap) + return; =20 - vfio_pci_core_unregister_device(vdev); - vfio_put_device(&vdev->vdev); + /* + * As the higher VFIO layers are holding locks across reset and using + * those same locks with the mm_lock we need to prevent ABBA deadlock + * with the state_mutex and mm_lock. + * In case the state_mutex was taken already we defer the cleanup work + * to the unlock flow of the other running context. + */ + spin_lock(&nvmevf_dev->reset_lock); + nvmevf_dev->deferred_reset =3D true; + if (!mutex_trylock(&nvmevf_dev->state_mutex)) { + spin_unlock(&nvmevf_dev->reset_lock); + return; + } + spin_unlock(&nvmevf_dev->reset_lock); + nvmevf_state_mutex_unlock(nvmevf_dev); } =20 static const struct pci_device_id nvmevf_pci_table[] =3D { @@ -83,12 +581,17 @@ static const struct pci_device_id nvmevf_pci_table[] = =3D { =20 MODULE_DEVICE_TABLE(pci, nvmevf_pci_table); =20 +static const struct pci_error_handlers nvmevf_err_handlers =3D { + .reset_done =3D nvmevf_pci_aer_reset_done, + .error_detected =3D vfio_pci_core_aer_err_detected, +}; + static struct pci_driver nvmevf_pci_driver =3D { .name =3D KBUILD_MODNAME, .id_table =3D nvmevf_pci_table, .probe =3D nvmevf_pci_probe, .remove =3D nvmevf_pci_remove, - .err_handler =3D &vfio_pci_core_err_handlers, + .err_handler =3D &nvmevf_err_handlers, .driver_managed_dma =3D true, }; =20 @@ -96,4 +599,4 @@ module_pci_driver(nvmevf_pci_driver); =20 MODULE_LICENSE("GPL"); MODULE_AUTHOR("Lei Rao "); -MODULE_DESCRIPTION("NVMe VFIO PCI - Generic VFIO PCI driver for NVMe"); +MODULE_DESCRIPTION("NVMe VFIO PCI - VFIO PCI driver with live migration su= pport for NVMe"); diff --git a/drivers/vfio/pci/nvme/nvme.h b/drivers/vfio/pci/nvme/nvme.h new file mode 100644 index 000000000000..c8464554ef53 --- /dev/null +++ b/drivers/vfio/pci/nvme/nvme.h @@ -0,0 +1,111 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2022, INTEL CORPORATION. All rights reserved + */ + +#ifndef NVME_VFIO_PCI_H +#define NVME_VFIO_PCI_H + +#include +#include +#include + +struct nvme_live_mig_query_size { + __u8 opcode; + __u8 flags; + __u16 command_id; + __u32 rsvd1[9]; + __u16 vf_index; + __u16 rsvd2; + __u32 rsvd3[5]; +}; + +struct nvme_live_mig_suspend { + __u8 opcode; + __u8 flags; + __u16 command_id; + __u32 rsvd1[9]; + __u16 vf_index; + __u16 rsvd2; + __u32 rsvd3[5]; +}; + +struct nvme_live_mig_resume { + __u8 opcode; + __u8 flags; + __u16 command_id; + __u32 rsvd1[9]; + __u16 vf_index; + __u16 rsvd2; + __u32 rsvd3[5]; +}; + +struct nvme_live_mig_save_data { + __u8 opcode; + __u8 flags; + __u16 command_id; + __u32 rsvd1[5]; + __le64 prp1; + __le64 prp2; + __u16 vf_index; + __u16 rsvd2; + __u32 rsvd3[5]; +}; + +struct nvme_live_mig_load_data { + __u8 opcode; + __u8 flags; + __u16 command_id; + __u32 rsvd1[5]; + __le64 prp1; + __le64 prp2; + __u16 vf_index; + __u16 rsvd2; + __u32 size; + __u32 rsvd3[4]; +}; + +enum nvme_live_mig_admin_opcode { + nvme_admin_live_mig_query_data_size =3D 0xC4, + nvme_admin_live_mig_suspend =3D 0xC8, + nvme_admin_live_mig_resume =3D 0xCC, + nvme_admin_live_mig_save_data =3D 0xD2, + nvme_admin_live_mig_load_data =3D 0xD5, +}; + +struct nvme_live_mig_command { + union { + struct nvme_live_mig_query_size query; + struct nvme_live_mig_suspend suspend; + struct nvme_live_mig_resume resume; + struct nvme_live_mig_save_data save; + struct nvme_live_mig_load_data load; + }; +}; + +struct nvmevf_migration_file { + struct file *filp; + struct mutex lock; + bool disabled; + u8 *vf_data; + size_t total_length; +}; + +struct nvmevf_pci_core_device { + struct vfio_pci_core_device core_device; + int vf_id; + u8 migrate_cap:1; + u8 deferred_reset:1; + /* protect migration state */ + struct mutex state_mutex; + enum vfio_device_mig_state mig_state; + /* protect the reset_done flow */ + spinlock_t reset_lock; + struct nvmevf_migration_file *resuming_migf; + struct nvmevf_migration_file *saving_migf; +}; + +extern int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cm= d, + size_t *result, void *buffer, unsigned int bufflen); + +#endif /* NVME_VFIO_PCI_H */ --=20 2.34.1 From nobody Thu Sep 18 12:57:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B95B4C352A1 for ; Tue, 6 Dec 2022 05:59:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234003AbiLFF7d (ORCPT ); Tue, 6 Dec 2022 00:59:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43742 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233637AbiLFF7H (ORCPT ); Tue, 6 Dec 2022 00:59:07 -0500 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2BE7927140; Mon, 5 Dec 2022 21:59:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670306343; x=1701842343; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9CnQAmoATTPKUFpttCxBP0OKKTyZY2W0SyyLmyVLCtc=; b=m2gUfz2/TTX/k3a63KziwsFuG894D2E3LRLNIX28gQhFis6EtJAZi1aM 6aO1ymb0Yq8FX7w6cNLvGRjeaqLX/vfhRcloON+ESzVxOLisVMypHlXKO sTL0ZTHh9xjbaM5/iW7oLgveH9fRtRrUdpwKUhEsg6C4oWj/Rqn8r18pb D5bM13ypNjE7p4y9olRN6U5WxjD6wXVAlOVSYlMwwaD4I+mWK3B+SF6Ic imcCP31KqIN3GAUILzxGj/oigphWzn7vd7dkGxtyipq5+aenabJvp+2hl F/S8ow0HsPPy+YlAXF9cT5OSH9eHxVRndNjrSzzP7P8C6S9pIZTwa5AZo w==; X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="378706716" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="378706716" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2022 21:58:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="648211408" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="648211408" Received: from leirao-pc.bj.intel.com ([10.238.156.101]) by fmsmga007.fm.intel.com with ESMTP; 05 Dec 2022 21:58:48 -0800 From: Lei Rao To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, jgg@ziepe.ca, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com, Lei Rao Subject: [RFC PATCH 4/5] nvme-vfio: check if the hardware supports live migration Date: Tue, 6 Dec 2022 13:58:15 +0800 Message-Id: <20221206055816.292304-5-lei.rao@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com> References: <20221206055816.292304-1-lei.rao@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" NVMe device can extend the vendor-specific field in the identify controller data structure to indicate whether live migration is supported. This patch checks if the NVMe device supports live migration. Signed-off-by: Lei Rao Signed-off-by: Yadong Li Signed-off-by: Chaitanya Kulkarni Reviewed-by: Eddie Dong Reviewed-by: Hang Yuan --- drivers/vfio/pci/nvme/nvme.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/drivers/vfio/pci/nvme/nvme.c b/drivers/vfio/pci/nvme/nvme.c index 698e470a4e53..2ffc90ad556d 100644 --- a/drivers/vfio/pci/nvme/nvme.c +++ b/drivers/vfio/pci/nvme/nvme.c @@ -473,6 +473,36 @@ static void nvmevf_pci_close_device(struct vfio_device= *core_vdev) vfio_pci_core_close_device(core_vdev); } =20 +static bool nvmevf_check_migration(struct pci_dev *pdev) +{ + struct nvme_command c =3D { }; + struct nvme_id_ctrl *id; + u8 live_mig_support; + int ret; + + c.identify.opcode =3D nvme_admin_identify; + c.identify.cns =3D NVME_ID_CNS_CTRL; + + id =3D kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL); + if (!id) + return false; + + ret =3D nvme_submit_vf_cmd(pdev, &c, NULL, id, sizeof(struct nvme_id_ctrl= )); + if (ret) { + dev_warn(&pdev->dev, "Get identify ctrl failed (ret=3D0x%x)\n", ret); + goto out; + } + + live_mig_support =3D id->vs[0]; + if (live_mig_support) { + kfree(id); + return true; + } +out: + kfree(id); + return false; +} + static const struct vfio_migration_ops nvmevf_pci_mig_ops =3D { .migration_set_state =3D nvmevf_pci_set_device_state, .migration_get_state =3D nvmevf_pci_get_device_state, @@ -489,6 +519,10 @@ static int nvmevf_migration_init_dev(struct vfio_devic= e *core_vdev) if (!pdev->is_virtfn) return ret; =20 + /* Get the identify controller data structure to check the live migration= support */ + if (!nvmevf_check_migration(pdev)) + return ret; + nvmevf_dev->migrate_cap =3D 1; =20 vf_id =3D pci_iov_vf_id(pdev); --=20 2.34.1 From nobody Thu Sep 18 12:57:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7EF16C3A5A7 for ; Tue, 6 Dec 2022 05:59:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234018AbiLFF7p (ORCPT ); Tue, 6 Dec 2022 00:59:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233770AbiLFF7S (ORCPT ); Tue, 6 Dec 2022 00:59:18 -0500 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2FAFD2793F; Mon, 5 Dec 2022 21:59:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670306350; x=1701842350; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VBnJVd+Hc6al1XaW5PlQfaxbMT88bR8HJInkPCC3JHA=; b=LYfMyyXZNP80nirOweJ4PHTgk8NpF29lRc3gTUODDzFtFu7qo0NlSPg+ 61ZNPN1GZJnq7PwTkroyguMtBYbnRQKyehyCTdE1rk4RN953Ln98trDXG CX5YcYy09JiANM2gBJbGDd3RzNNHymFC2a83OcELW+Ni7em+Tzsd/zk9A YV4JrB3KVFowolplSH1d/BAnsBG/XS7hBzIysTOKtGgtKkbIGW7rKwDpY uUhd2cjpdKKPVhvkCJ11wdfYfDe/8D07nKnC9BMDN3YjiVcwRBNclWfYC TOHA2ah0j1h1JEPAk2vHVBovCwcewu1vnyzxb679PrtGTLJR17UoQyE+X A==; X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="378706773" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="378706773" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2022 21:59:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="648211414" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="648211414" Received: from leirao-pc.bj.intel.com ([10.238.156.101]) by fmsmga007.fm.intel.com with ESMTP; 05 Dec 2022 21:58:54 -0800 From: Lei Rao To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, jgg@ziepe.ca, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com, Lei Rao Subject: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Date: Tue, 6 Dec 2022 13:58:16 +0800 Message-Id: <20221206055816.292304-6-lei.rao@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com> References: <20221206055816.292304-1-lei.rao@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The documentation describes the details of the NVMe hardware extension to support VFIO live migration. Signed-off-by: Lei Rao Signed-off-by: Yadong Li Signed-off-by: Chaitanya Kulkarni Reviewed-by: Eddie Dong Reviewed-by: Hang Yuan --- drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 drivers/vfio/pci/nvme/nvme.txt diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt new file mode 100644 index 000000000000..eadcf2082eed --- /dev/null +++ b/drivers/vfio/pci/nvme/nvme.txt @@ -0,0 +1,278 @@ +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D +NVMe Live Migration Support +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D + +Introduction +------------ +To support live migration, NVMe device designs its own implementation, +including five new specific admin commands and a capability flag in +the vendor-specific field in the identify controller data structure to +support VF's live migration usage. Software can use these live migration +admin commands to get device migration state data size, save and load the +data, suspend and resume the given VF device. They are submitted by softwa= re +to the NVMe PF device's admin queue and ignored if placed in the VF device= 's +admin queue. This is due to the NVMe VF device being passed to the virtual +machine in the virtualization scenario. So VF device's admin queue is not +available for the hypervisor to submit VF device live migration commands. +The capability flag in the identify controller data structure can be used = by +software to detect if the NVMe device supports live migration. The followi= ng +chapters introduce the detailed format of the commands and the capability = flag. + +Definition of opcode for live migration commands +------------------------------------------------ + ++---------------------------+-----------+-----------+------------+ +| | | | | +| Opcode by Field | | | | +| | | | | ++--------+---------+--------+ | | | +| | | | Combined | Namespace | | +| 07 | 06:02 | 01:00 | Opcode | Identifier| Command | +| | | | | used | | ++--------+---------+--------+ | | | +|Generic | Function| Data | | | | +|command | |Transfer| | | | ++--------+---------+--------+-----------+-----------+------------+ +| | +| Vendor SpecificOpcode | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Query the | +| 1b | 10001 | 00 | 0xC4 | | data size | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Suspend the| +| 1b | 10010 | 00 | 0xC8 | | VF | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Resume the | +| 1b | 10011 | 00 | 0xCC | | VF | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Save the | +| 1b | 10100 | 10 | 0xD2 | |device data | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Load the | +| 1b | 10101 | 01 | 0xD5 | |device data | ++--------+---------+--------+-----------+-----------+------------+ + +Definition of QUERY_DATA_SIZE command +------------------------------------- + ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| Bytes | Description = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| | = | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | Bits |Description = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury comma= nd | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for mo= re details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| 03:00 | | 13:10 |Reserved = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC= for details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 31:16 |Command Identifier(CID) = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| 39:04 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ +| 41:40 | VF index: means which VF controller internal data size to que= ry | ++---------+---------------------------------------------------------------= ---------------------+ +| 63:42 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ + +The QUERY_DATA_SIZE command is used to query the NVMe VF internal data siz= e for live migration. +When the NVMe firmware receives the command, it will return the size of NV= Me VF internal +data. The data size depends on how many IO queues are created. + +Definition of SUSPEND command +----------------------------- + ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| Bytes | Description = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| | = | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | Bits |Description = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend com= mand | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe specificati= on for details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| 03:00 | | 13:10 |Reserved = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC = for details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 31:16 |Command Identifier(CID) = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| 39:04 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ +| 41:40 | VF index: means which VF controller to suspend = | ++---------+---------------------------------------------------------------= ---------------------+ +| 63:42 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ + +The SUSPEND command is used to suspend the NVMe VF controller. When the NV= Me firmware receives +this command, it will suspend the NVMe VF controller. + +Definition of RESUME command +---------------------------- + ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| Bytes | Description = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| | = | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | Bits |Description = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume comm= and | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for de= tails[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| 03:00 | | 13:10 |Reserved = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC = for details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 31:16 |Command Identifier(CID) = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| 39:04 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ +| 41:40 | VF index: means which VF controller to resume = | ++---------+---------------------------------------------------------------= ---------------------+ +| 63:42 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ + +The RESUME command is used to resume the NVMe VF controller. When firmware= receives this command, +it will restart the NVMe VF controller. + +Definition of SAVE_DATA command +-------------------------- + ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| Bytes | Description = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| | = | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | Bits |Description = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save comman= d | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for de= tails[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| 03:00 | | 13:10 |Reserved = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC = for details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 31:16 |Command Identifier(CID) = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| 23:04 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ +| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List = Pointer | ++---------+---------------------------------------------------------------= ---------------------+ +| 39:32 | PRP Entry2:the second address entry(reserved,page base address= or PRP List Pointer)| ++---------+---------------------------------------------------------------= ---------------------+ +| 41:40 | VF index: means which VF controller internal data to save = | ++---------+---------------------------------------------------------------= ---------------------+ +| 63:42 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ + +The SAVE_DATA command is used to save the NVMe VF internal data for live m= igration. When firmware +receives this command, it will save the admin queue states, save some regi= sters, drain IO SQs +and CQs, save every IO queue state, disable the VF controller, and transfe= r all data to the +host memory through DMA. + +Definition of LOAD_DATA command +-------------------------- + ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| Bytes | Description = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| | = | +| | = | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | Bits |Description = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load comman= d | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for de= tails[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| 03:00 | | 13:10 |Reserved = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC= for details[1] | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | | 31:16 |Command Identifier(CID) = | | +| | +-----------+-------------------------------------------------= -------------------+ | +| | = | +| | = | ++---------+---------------------------------------------------------------= ---------------------+ +| 23:04 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ +| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List = Pointer | ++---------+---------------------------------------------------------------= ---------------------+ +| 39:32 | PRP Entry2:the second address entry(reserved,page base address= or PRP List Pointer)| ++---------+---------------------------------------------------------------= ---------------------+ +| 41:40 | VF index: means which VF controller internal data to load = | ++---------+---------------------------------------------------------------= ---------------------+ +| 47:44 | Size: means the size of the device's internal data to be loade= d | ++---------+---------------------------------------------------------------= ---------------------+ +| 63:48 | Reserved = | ++---------+---------------------------------------------------------------= ---------------------+ + +The LOAD_DATA command is used to restore the NVMe VF internal data. When f= irmware receives this +command, it will read the device internal's data from the host memory thro= ugh DMA, restore the +admin queue states and some registers, and restore every IO queue state. + +Extensions of the vendor-specific field in the identify controller data st= ructure +--------------------------------------------------------------------------= ------- + ++---------+------+------+------+-------------------------------+ +| | | | | | +| Bytes | I/O |Admin | Disc | Description | +| | | | | | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 01:00 | M | M | R | PCI Vendor ID(VID) | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) | ++---------+------+------+------+-------------------------------+ +| | | | | | +| ... | ... | ... | ... | ... | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 3072 | O | O | O | Live Migration Support | ++---------+------+------+------+-------------------------------+ +| | | | | | +|4095:3073| O | O | O | Vendor Specific | ++---------+------+------+------+-------------------------------+ + +According to NVMe specification, the bytes from 3072 to 4095 are vendor-sp= ecific fields. +NVMe device uses the 3072 bytes in the identify controller data structure = to indicate +whether live migration is supported. 0x0 means live migration is not suppo= rted. 0x01 means +live migration is supported, and other values are reserved. + +[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.0= 7.26-Ratified.pdf --=20 2.34.1