From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BE5CDF6C; Tue, 9 Sep 2025 22:00:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455213; cv=none; b=M0vd6dSEy4BZKix9829TaMOqG1ZxghX9qXgBAAiDvDC6oe370qTkjzVRqOfzaBGU1WB5Uo7HTyNtSFcN69+0NvmbtHETM7ufDxvyZPojWsSoyxaekTJ9UGCgkdBz51+iHNVFVPD5/SCgf/nI1zZrAOcnn0gbc/VMvhCYrYJVmUs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455213; c=relaxed/simple; bh=+cYC7KKFbfc+PsZdj7Hu4e4V8Q72FyxbrdzCuw3DHgM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=SM50U/be1WfJq/hap/1tWZiqg3+UT+GZr6Of96K5Vd/6L0sQ3FgHZ8UhiWmC49ktMB3+7f/bzJ6z0kU+ehVrt1VYQmmuKQiNds76mK4pvfgiOF0QywsL3rkX56TkL2E5tMfk0ZAgV5DF5NyDmonhdcrer1B/epYp2lMH/YJsDTg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=TauRSjvm; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="TauRSjvm" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455211; x=1788991211; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=+cYC7KKFbfc+PsZdj7Hu4e4V8Q72FyxbrdzCuw3DHgM=; b=TauRSjvmZcnXcTDBphAUWvAlW+x56BOnxULB59bYa8LpMMC303dVFEqJ La7mWxZzFvMO3tHkGdvF+iUTMQCBY1tOtqNuzzjcsc71JMPHCfvX+k88U WtoaSJ4Sd/ZPsJQO/LlcbuOtuTy7D1/f8dkq4XqZLfBfbCt941cMSOwKv 3bp3IrzEvH4bL1RaXz2dwhQc4Qih12baRWOKYrtwCZT8a5KD667mnz17E inLZVOzhodR158hPKR5VPIGbKokNbxMwMlAy+uVJoOOJ9SqCtrlq4hAd8 pu/riEAxr5n03UBQXgc/wCw81sNpbFhKVfnn/2FJ0tz0cNB0Q9SruQync Q==; X-CSE-ConnectionGUID: c6mNF5eyQxuQAP45cTZE9w== X-CSE-MsgGUID: cjaxYB8nRISwUdbR+4m9AA== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584620" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584620" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:07 -0700 X-CSE-ConnectionGUID: xzLJntAGSSqXvhKR6ZBw/w== X-CSE-MsgGUID: vOu/IhS/RM+HPCdVOrJRhA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780951" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:07 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:50 -0700 Subject: [PATCH RFC net-next 1/7] ice: add basic skeleton and TLV framework for live migration support Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-1-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=33763; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=+cYC7KKFbfc+PsZdj7Hu4e4V8Q72FyxbrdzCuw3DHgM=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi1PUbNXYWcqeG7eYT3t6Zc2hl2bOIrcmli/+Nsnwh 9X1VgO+jlIWBjEuBlkxRRYFh5CV140nhGm9cZaDmcPKBDKEgYtTACbyW4vhv3f/ntqtvYfCxM2O 1fzLCpubcPmz9utjdRUis6erHPt/6xvDP+viA1UbpjbKWIW+lilJ6znH+yZCLCpw5V3pSM1Puk/ /8gMA X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 In preparation for supporting VF live migration, introduce a new functional block to the ice driver to expose hooks for migrating a VF. The ice-vfio-pci driver will call ice_migration_init_dev() to indicate that the migration driver is loaded. When the host wants to migrate the device, first it suspends the device by calling ice_migration_suspend_dev(). Then it saves any device state by calling ice_migration_save_devstate() which serializes the VF device state into a buffer payload which is sent to the resuming host. This target system calls ice_migration_suspend_dev() to stop the target VF device, and finally calls ice_migration_load_devstate() to deserialize the migration state and program the target VF. Add the virt/migration.c file and implement the skeleton for these functions. A basic framework is implemented, but the following patches will fully flesh out the serialization and deserialization for the various kinds of VF data. Previous implementations of live migration were implemented by serializing device state as a series of virtchnl messages. It was thought that this would be simple since virtchnl is the ABI for communication between the PF and VF. Unfortunately, use of virtchnl presented numerous problems. To avoid these issues, the migration buffer will be implemented using a custom Type-Length-Value format which avoids a few key issues with the virtchnl solution: 1. Migration data is only captured when a live migration event is requested. The driver no longer caches a copy of every virtchnl message sent by the VF. 2. Migration data size is bounded, and the VF cannot indirectly increase it to arbitrary sizes by sending unexpected sequences of virtchnl messages. 3. Replay of VF state is controlled precisely, and no longer requires modification of the standard virtchnl communication flow. 4. Additional data about the VF state is sent over the migration buffer, which is not captured by virtchnl messages. This includes host state such as the VF trust flag, MSI-X configuration, RSS configuration, etc. Introduce the initial support for this TLV format along with the first TLV for storing basic VF info. The TLV definitions are placed in virt/migration_tlv.h which defines the payload data and describes the ABI for this communication format. The first TLV must be the ICE_MIG_TLV_HEADER which consists of a magic number and a version used to identify the payload format. This allows for the possibility of future extension should the entire format need to be changed. TLVs are specified as a series of ice_migration_tlv_structures, which are variable length structures using a flexible array of bytes. These structures are __packed. However, access to unaligned memory requires the compiler to insert additional instructions to avoid unaligned access on some platforms. To minimize -- but not completely remove -- this, the length for all TLVs is required to be aligned to 4-bytes. The header itself is 4 bytes, and this ensures that the header and all values with a size 4 bytes or smaller will have aligned access. 8 byte values may potentially be unaligned, but use if the __packed attribute ensures that the compiler will insert appropriate access instructions for all platforms. Note that this migration implementation generally assumes the host and target system are of the same endianness, integer size, etc. The chosen magic number of 0xE8000001 will catch byte order difference, and all types in the TLVs are fixed size. The type of the TLVs is specified by the ice_migration_tlvs enumeration. This *is* ABI, and any extension must add new TLVs at the end of this list just prior to the NUM_ICE_MIG_TLV which specifies the number of recognized TLVs by this version of the ABI. The exact "version" of the ABI is specified by a combination of the magic number, the version in the header, and the number of TLVs as specified by NUM_ICE_MIG_TLV, which is sent in the header. This allows for new TLVs to be added in the future without needing to increment the version number. Such an increment should only be done if any existing TLV format must change, or if the entire format must change for some reason. The payload *must* begin with the ICE_MIG_TLV_HEADER, as this specifies the format. All other TLVs do not have an order, and receiving code must be capable of handling TLVs in an arbitrary order. Some TLVs will be sent multiple times, such as for Tx and Rx queue information. The special ICE_MIG_TLV_END type must come last and is a marker to indicate the end of the payload buffer. It has a size of 0. Implement a few macros to help reduce boiler plate. The ice_mig_tlv_type() macro converts a structure pointer to its TLV type using _Generic(). The ice_mig_alloc_tlv() macro allocates a new TLV, returning the pointer to the value structure. This macro ensures the 4-byte alignment required for all TLV lengths. The ice_mig_alloc_flex_tlv() is similar, but allows allocating a TLV whose element structure ends in a flexible array. Finally, the ice_mig_tlv_add_tail() macro takes a pointer to an element structure and inserts the container ice_mig_tlv_entry and inserts it into the TLV list used for temporary storage. Signed-off-by: Jacob Keller --- drivers/net/ethernet/intel/ice/ice.h | 2 + drivers/net/ethernet/intel/ice/ice_vf_lib.h | 2 + .../net/ethernet/intel/ice/virt/migration_tlv.h | 221 +++++++++ include/linux/net/intel/ice_migration.h | 49 ++ drivers/net/ethernet/intel/ice/ice_main.c | 16 + drivers/net/ethernet/intel/ice/ice_vf_lib.c | 3 + drivers/net/ethernet/intel/ice/virt/migration.c | 506 +++++++++++++++++= ++++ drivers/net/ethernet/intel/ice/Makefile | 1 + 8 files changed, 800 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/in= tel/ice/ice.h index e952d67388bf..daa395f5691f 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -55,6 +55,7 @@ #include #include #include +#include #include "ice_devids.h" #include "ice_type.h" #include "ice_txrx.h" @@ -1015,6 +1016,7 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __= be16 proto, u16 vid); int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid); void ice_get_stats64(struct net_device *netdev, struct rtnl_link_stats64 *stats); +struct ice_pf *ice_vf_dev_to_pf(struct pci_dev *vf_dev); =20 /** * ice_set_rdma_cap - enable RDMA support diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.h b/drivers/net/ethe= rnet/intel/ice/ice_vf_lib.h index b00708907176..1a0c66182a9a 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.h +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.h @@ -124,6 +124,7 @@ struct ice_vf { u8 link_forced:1; u8 link_up:1; /* only valid if VF link is forced */ u8 lldp_tx_ena:1; + u8 migration_enabled:1; =20 u16 num_msix; /* num of MSI-X configured on this VF */ =20 @@ -158,6 +159,7 @@ struct ice_vf { u16 lldp_rule_id; =20 struct ice_vf_qs_bw qs_bw[ICE_MAX_RSS_QS_PER_VF]; + struct list_head mig_tlvs; }; =20 /* Flags for controlling behavior of ice_reset_vf */ diff --git a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h b/drivers/= net/ethernet/intel/ice/virt/migration_tlv.h new file mode 100644 index 000000000000..2c5b4578060b --- /dev/null +++ b/drivers/net/ethernet/intel/ice/virt/migration_tlv.h @@ -0,0 +1,221 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2025, Intel Corporation. */ + +#ifndef _VIRT_MIGRATION_TLV_H_ +#define _VIRT_MIGRATION_TLV_H_ + +#include + +/* The ice driver uses a series of TLVs to define the live migration data = that + * is passed between PFs during a migration event. This data includes all = of + * the information required to migrate the VM onto a new VF without loss of + * data. + * + * On a migration event, the initial PF will scan its VF structures for + * relevant information and serialize it into TLVs which are passed as par= t of + * the binary migration data. + * + * The target PF will read the binary migration data and deserialize it us= ing + * the TLV definitions. + * + * The first TLV in the binary data *MUST* be ICE_MIG_TLV_HEADER, and defi= nes + * the overall migration version and format. + * + * A receiving PF should scan the set of provided TLVs, and ensure that it + * recognizes all of the provided data. Once validated, the PF can apply t= he + * configuration to the target VF, ensuring it is configured appropriately= to + * match the VM. + */ + +#define ICE_MIG_MAGIC 0xE8000001 +#define ICE_MIG_VERSION 1 + +#define ICE_MIG_VF_ITR_NUM 4 + +/** + * struct ice_migration_tlv - TLV header structure + * @type: type identifier for the data + * @len: length of the data block + * @data: migration data payload + * + * Migration data is serialized using this structure as a series of + * type-length-value chunks. Each TLV is defined by its type. The length c= an + * be used to move to the next TLV in the full data payload. + * + * The data payload structure is defined by the structure associated with = the + * type as defined by the following enumerations and structures. + * + * TLVs are placed within the binary migration payload sequentially, and a= re + * __packed in order to avoid padding. + * + * Some of the TLVs are variable length, which could result in excessive + * unaligned accesses. While the compiler should insert appropriate + * instructions to handle this access due to the __packed attribute, we + * enforce that all TLV headers begin at a 4-byte aligned boundary by padd= ing + * all TLV sizes to multiple of 4-bytes. This minimizes the amount of + * unaligned access without sacrificing significant additional space. + */ +struct ice_migration_tlv { + u16 type; + u16 len; + u8 data[] __counted_by(len); +} __packed; + +/** + * enum ice_migration_tlvs - Valid TLV types + * + * @ICE_MIG_TLV_END: Used to mark the end of the TLV list. The TLV header = will + * have a len of 0 and no data. + * + * @ICE_MIG_TLV_HEADER: Header identifying the migration format. Must be t= he + * first TLV in the list. + * + * @NUM_ICE_MIG_TLV: Number of known TLV types. Any type equal to or larger + * than this value is unrecognized by this version. + * + * Enumeration of valid types for the virtualization migration data. The T= LV + * data is transferred between PFs, so this must be treated as ABI that ca= n't + * change. + */ +enum ice_migration_tlvs { + /* Do not change the order or add anything between, this is ABI! */ + ICE_MIG_TLV_END =3D 0, + ICE_MIG_TLV_HEADER, + + /* Add new types above here */ + NUM_ICE_MIG_TLV +}; + +/** + * struct ice_mig_tlv_entry - Wrapper to store TLV entries in linked list + * @list_entry: list node used for temporary storage prior to STOP_COPY + * @tlv: The migration TLV data. + * + * Because ice_migration_tlv is a variable length structure, this is also + * a variable length structure. + */ +struct ice_mig_tlv_entry { + struct list_head list_entry; + struct ice_migration_tlv tlv; +}; + +/** + * struct ice_mig_tlv_header - Migration version header + * @magic: Magic number identifying this migration format. Always 0xE80000= 01. + * @version: Version of the migration format. + * @num_supported_tlvs: The value of NUM_ICE_MIG_TLV for the sender. + * + * Structure defining the version of the migration data payload. A magic + * number and version are used to identify this format. This is to potenti= ally + * allow changing or extending the format in the future in a way that the + * receiving system can recognize. + * + * The num_supported_tlvs field is used to inform the receiver of the + * supported set of TLVs being sent with this payload. This allows the + * receiver to quickly identify if the payload may contain data it does not + * recognize. + */ +struct ice_mig_tlv_header { + u32 magic; + u16 version; + u16 num_supported_tlvs; +} __packed; + +/** + * ice_mig_tlv_type - Convert a TLV type to its number + * @p: the TLV structure type + * + * Generic which converts the specified TLV structure type to its TLV nume= ric + * value. Used to reduce potential error when initializing a TLV header for + * the migration payload. + */ +#define ice_mig_tlv_type(p) \ + _Generic(*(p), \ + struct ice_mig_tlv_header : ICE_MIG_TLV_HEADER, \ + default : ICE_MIG_TLV_END) + +/** + * ice_mig_alloc_tlv - Allocate a non-variable length TLV entry + * @p: pointer to the TLV element type + * + * Shorthand macro which allocates space for both a TLV header and the TLV + * element structure. For variable-length TLVs with a flexible array membe= r, + * use ice_mig_alloc_flex_tlv instead. + * + * Because the allocations are ultimately triggered from userspace, and mu= st + * be held until userspace actually initiates the migration, allocate with + * GFP_KERLEL_ACCOUNT, causing the allocations to be accounted by kmemcg. + * + * Returns: pointer to the allocated TLV element, or NULL on failure to + * allocate. + */ +#define ice_mig_alloc_tlv(p) \ + ({ \ + struct ice_mig_tlv_entry *entry; \ + typeof(p) __elem; \ + size_t tlv_size; \ + \ + tlv_size =3D ALIGN(sizeof(*__elem), 4); \ + entry =3D kzalloc(struct_size(entry, tlv.data, tlv_size), \ + GFP_KERNEL_ACCOUNT); \ + if (!entry) { \ + __elem =3D NULL; \ + } else { \ + entry->tlv.type =3D ice_mig_tlv_type(__elem); \ + entry->tlv.len =3D tlv_size; \ + __elem =3D (typeof(__elem))entry->tlv.data; \ + } \ + __elem; \ + }) + +/** + * ice_mig_alloc_flex_tlv - Allocate a variable length TLV with flexible a= rray + * @p: pointer to the TLV element type + * @member: flexible array member element + * @count: number of elements in the flexible array. + * + * Shorthand macro which allocates space for both a TLV header and the TLV + * element structure, and its variable length flexible array member. + * + * Because the allocations are ultimately triggered from userspace, and mu= st + * be held until userspace actually initiates the migration, allocate with + * GFP_KERLEL_ACCOUNT, causing the allocations to be accounted by kmemcg. + * + * Returns: pointer to the allocated TLV element, or NULL on failure to + * allocate. + */ +#define ice_mig_alloc_flex_tlv(p, member, count) \ + ({ \ + struct ice_mig_tlv_entry *entry; \ + typeof(p) __elem; \ + size_t tlv_size; \ + \ + tlv_size =3D ALIGN(struct_size(__elem, member, count), 4);\ + entry =3D kzalloc(struct_size(entry, tlv.data, tlv_size), \ + GFP_KERNEL_ACCOUNT); \ + if (!entry) { \ + __elem =3D NULL; \ + } else { \ + entry->tlv.type =3D ice_mig_tlv_type(__elem); \ + entry->tlv.len =3D tlv_size; \ + __elem =3D (typeof(__elem))entry->tlv.data; \ + } \ + __elem; \ + }) + +/** + * ice_mig_tlv_add_tail - Add TLV element to tail of a TLV list + * @p: pointer to the TLV element + * @head: pointer to the head of the linked list to insert into + * + * Shorthand macro to find the struct ice_mig_tlv_entry header pointer of = the + * given TLV element and insert it into the list. + */ +#define ice_mig_tlv_add_tail(p, head) \ + ({ \ + struct ice_mig_tlv_entry *entry; \ + entry =3D container_of((void *)p, typeof(*entry), tlv.data); \ + list_add_tail(&entry->list_entry, head); \ + }) + +#endif /* _VIRT_MIGRATION_TLV_H_ */ diff --git a/include/linux/net/intel/ice_migration.h b/include/linux/net/in= tel/ice_migration.h new file mode 100644 index 000000000000..85095f4c02c4 --- /dev/null +++ b/include/linux/net/intel/ice_migration.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright (C) 2018-2025 Intel Corporation */ + +#ifndef _ICE_MIGRATION_H_ +#define _ICE_MIGRATION_H_ + +#if IS_ENABLED(CONFIG_ICE_VFIO_PCI) +int ice_migration_init_dev(struct pci_dev *vf_dev); +void ice_migration_uninit_dev(struct pci_dev *vf_dev); +size_t ice_migration_get_required_size(struct pci_dev *vf_dev); +int ice_migration_save_devstate(struct pci_dev *vf_dev, void *buf, + size_t buf_sz); +int ice_migration_load_devstate(struct pci_dev *vf_dev, + const void *buf, size_t buf_sz); +int ice_migration_suspend_dev(struct pci_dev *vf_dev, bool save_state); +#else +static inline int ice_migration_init_dev(struct pci_dev *vf_dev) +{ + return -EOPNOTSUPP; +} + +static inline void ice_migration_uninit_dev(struct pci_dev *vf_dev) { } + +static inline size_t ice_migration_get_required_size(struct pci_dev *vf_de= v) +{ + return 0; +} + +static inline int +ice_migration_save_devstate(struct pci_dev *vf_dev, void *buf, + size_t buf_sz) +{ + return -EOPNOTSUPP; +} + +static inline int ice_migration_load_devstate(struct pci_dev *vf_dev, + const void *buf, size_t buf_sz) +{ + return -EOPNOTSUPP; +} + +static inline int ice_migration_suspend_dev(struct pci_dev *vf_dev, + bool save_state) +{ + return -EOPNOTSUPP; +} +#endif /* CONFIG_ICE_VFIO_PCI */ + +#endif /* _ICE_MIGRATION_H_ */ diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethern= et/intel/ice/ice_main.c index 92b95d92d599..2a204bb1ad31 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -9738,3 +9738,19 @@ static const struct net_device_ops ice_netdev_ops = =3D { .ndo_hwtstamp_get =3D ice_ptp_hwtstamp_get, .ndo_hwtstamp_set =3D ice_ptp_hwtstamp_set, }; + +/** + * ice_vf_dev_to_pf - Get PF private structure from VF PCI device pointer + * @vf_dev: pointer to a VF PCI device structure + * + * Obtain the PF private data structure of the ice PF associated with the + * provided VF PCI device. + * + * Return: pointer to the ice PF private data, or a ERR_PTR on failure. + */ +struct ice_pf *ice_vf_dev_to_pf(struct pci_dev *vf_dev) +{ + struct ice_pf *pf =3D pci_iov_get_pf_drvdata(vf_dev, &ice_driver); + + return pf; +} diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethe= rnet/intel/ice/ice_vf_lib.c index de9e81ccee66..6b91aa9394af 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c @@ -1024,6 +1024,9 @@ void ice_initialize_vf_entry(struct ice_vf *vf) else ice_mbx_init_vf_info(&pf->hw, &vf->mbx_info); =20 + /* List to store migration TLVs temporarily */ + INIT_LIST_HEAD(&vf->mig_tlvs); + mutex_init(&vf->cfg_lock); } =20 diff --git a/drivers/net/ethernet/intel/ice/virt/migration.c b/drivers/net/= ethernet/intel/ice/virt/migration.c new file mode 100644 index 000000000000..f13b7674dabd --- /dev/null +++ b/drivers/net/ethernet/intel/ice/virt/migration.c @@ -0,0 +1,506 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2018-2025 Intel Corporation */ + +#include "ice.h" +#include "ice_lib.h" +#include "ice_fltr.h" +#include "ice_base.h" +#include "ice_txrx_lib.h" +#include "virt/migration_tlv.h" + +/** + * ice_migration_init_dev - Enable migration support for the requested VF + * @vf_dev: pointer to the VF PCI device + * + * TODO: currently the vf->migration_enabled field is unused. It is likely + * that we will need to use it to check that features which cannot migrate= are + * disabled. + * + * Return: 0 on success, negative error code on failure. + */ +int ice_migration_init_dev(struct pci_dev *vf_dev) +{ + struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + struct ice_vf *vf; + + if (IS_ERR(pf)) + return PTR_ERR(pf); + + vf =3D ice_get_vf_by_dev(pf, vf_dev); + if (!vf) { + dev_err(&vf_dev->dev, "Unable to locate VF from VF device\n"); + return -EINVAL; + } + + mutex_lock(&vf->cfg_lock); + vf->migration_enabled =3D true; + mutex_unlock(&vf->cfg_lock); + + ice_put_vf(vf); + return 0; +} +EXPORT_SYMBOL(ice_migration_init_dev); + +/** + * ice_migration_uninit_dev - Disable migration support for the requested = VF + * @vf_dev: pointer to the VF PCI device + */ +void ice_migration_uninit_dev(struct pci_dev *vf_dev) +{ + struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + struct device *dev; + struct ice_vf *vf; + + if (IS_ERR(pf)) + return; + + vf =3D ice_get_vf_by_dev(pf, vf_dev); + if (!vf) { + dev_err(&vf_dev->dev, "Unable to locate VF from VF device\n"); + return; + } + + dev =3D ice_pf_to_dev(pf); + + mutex_lock(&vf->cfg_lock); + + vf->migration_enabled =3D false; + + if (!list_empty(&vf->mig_tlvs)) { + struct ice_mig_tlv_entry *entry, *tmp; + + dev_dbg(dev, "Freeing unused migration TLVs for VF %d\n", + vf->vf_id); + + list_for_each_entry_safe(entry, tmp, &vf->mig_tlvs, + list_entry) { + list_del(&entry->list_entry); + kfree(entry); + } + } + + mutex_unlock(&vf->cfg_lock); + + ice_put_vf(vf); +} +EXPORT_SYMBOL(ice_migration_uninit_dev); + +int ice_migration_suspend_dev(struct pci_dev *vf_dev, bool save_state) +{ + return -EOPNOTSUPP; +} +EXPORT_SYMBOL(ice_migration_suspend_dev); + +/** + * ice_migration_calculate_size - Calculate the size of the migration buff= er + * @vf: pointer to the VF being migrated + * + * Calculate the total size required for all the TLVs used to form the + * migration data buffer. The TLVs containing migration data are already + * recorded and saved in the vf->mig_tlvs linked list. In addition to this= , we + * need to account for the header data and the data-end marker TLV. + * + * Return: the size in bytes required to store the full migration payload. + */ +static size_t ice_migration_calculate_size(struct ice_vf *vf) +{ + struct ice_mig_tlv_entry *entry; + size_t tlv_sz, total_sz; + + lockdep_assert_held(&vf->cfg_lock); + + /* The migration data begins with a header TLV describing the format */ + total_sz =3D struct_size_t(struct ice_migration_tlv, data, + sizeof(struct ice_mig_tlv_header)); + + list_for_each_entry(entry, &vf->mig_tlvs, list_entry) { + tlv_sz =3D struct_size(&entry->tlv, data, entry->tlv.len); + total_sz =3D size_add(total_sz, tlv_sz); + } + + /* The end of the data is signified by an empty TLV */ + tlv_sz =3D struct_size_t(struct ice_migration_tlv, data, 0); + total_sz =3D size_add(total_sz, tlv_sz); + + return total_sz; +} + +/** + * ice_migration_get_required_size - Request migration payload buffer size + * @vf_dev: pointer to the VF PCI device + * + * Request the size required to serialize the VF migration payload. Used to + * calculate allocation size of the migration file. + * + * Return: the size in bytes required to store the full migration payload,= or + * 0 if this VF is not ready to migrate. + */ +size_t ice_migration_get_required_size(struct pci_dev *vf_dev) +{ + struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + size_t payload_size; + struct ice_vf *vf; + + if (IS_ERR(pf)) { + dev_err(&vf_dev->dev, "Unable to locate PF from VF device, err=3D%pe\n", + pf); + return 0; + } + + vf =3D ice_get_vf_by_dev(pf, vf_dev); + if (!vf) { + dev_err(&vf_dev->dev, "Unable to locate VF from VF device\n"); + return 0; + } + + mutex_lock(&vf->cfg_lock); + + if (list_empty(&vf->mig_tlvs)) { + dev_warn(&vf_dev->dev, "VF %d is not ready to migrate\n", + vf->vf_id); + payload_size =3D 0; + } else { + payload_size =3D ice_migration_calculate_size(vf); + } + + mutex_unlock(&vf->cfg_lock); + + return payload_size; +} +EXPORT_SYMBOL(ice_migration_get_required_size); + +/** + * ice_migration_insert_tlv_header - Insert TLV header into migration buff= er + * @tlv: pointer to TLV in the migration buffer + * + * Fill in the TLV header describing the migration format. + * + * Return: the full struct_size of the TLV, used to move the migration buf= fer + * pointer to the next entry. + */ +static size_t ice_migration_insert_tlv_header(struct ice_migration_tlv *tl= v) +{ + struct ice_mig_tlv_header *tlv_header; + + tlv->type =3D ice_mig_tlv_type(tlv_header); + tlv->len =3D sizeof(*tlv_header); + tlv_header =3D (typeof(tlv_header))tlv->data; + + tlv_header->magic =3D ICE_MIG_MAGIC; + tlv_header->version =3D ICE_MIG_VERSION; + tlv_header->num_supported_tlvs =3D NUM_ICE_MIG_TLV; + + return struct_size(tlv, data, tlv->len); +} + +/** + * ice_migration_insert_tlv_end - Insert TLV marking end of migration data + * @tlv: pointer to TLV in the migration buffer + * + * Fill in the TLV marking end of the migration buffer data. + */ +static void ice_migration_insert_tlv_end(struct ice_migration_tlv *tlv) +{ + tlv->type =3D ICE_MIG_TLV_END; + tlv->len =3D 0; +} + +/** + * ice_migration_save_devstate - Save device state to migration buffer + * @vf_dev: pointer to the VF PCI device + * @buf: pointer to VF msg in migration buffer + * @buf_sz: The size of the migration buffer. + * + * Serialize the saved device state to the migration buffer. It is expected + * that buf_sz is determined by calling ice_migration_get_required_size() + * ahead of time. + * + * Return: 0 for success, or a negative error code on failure. + */ +int ice_migration_save_devstate(struct pci_dev *vf_dev, void *buf, + size_t buf_sz) +{ + struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + struct ice_mig_tlv_entry *entry, *tmp; + struct ice_vsi *vsi; + struct device *dev; + struct ice_vf *vf; + size_t total_sz; + int err =3D 0; + + if (IS_ERR(pf)) + return PTR_ERR(pf); + + vf =3D ice_get_vf_by_dev(pf, vf_dev); + if (!vf) { + dev_err(&vf_dev->dev, "Unable to locate VF from VF device\n"); + return -EINVAL; + } + + dev =3D ice_pf_to_dev(pf); + + dev_dbg(dev, "Serializing migration device state for VF %u\n", + vf->vf_id); + + mutex_lock(&vf->cfg_lock); + + vsi =3D ice_get_vf_vsi(vf); + if (!vsi) { + dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id); + err =3D -EINVAL; + goto out_release_cfg_lock; + } + + /* Make sure we have enough space */ + total_sz =3D ice_migration_calculate_size(vf); + if (total_sz > buf_sz) { + dev_err(dev, "Insufficient buffer to store device state for VF %d. Need = %zu bytes, but have only %zu bytes.\n", + vf->vf_id, total_sz, buf_sz); + err =3D -ENOBUFS; + goto out_release_cfg_lock; + } + + dev_dbg(dev, "Saving migration data for VF %d. Total migration payload si= ze is %zu bytes\n", + vf->vf_id, total_sz); + + /* 1. Insert the TLV header describing the migration format */ + buf +=3D ice_migration_insert_tlv_header(buf); + + /* 2. Insert the TLVs prepared by suspend */ + list_for_each_entry_safe(entry, tmp, &vf->mig_tlvs, list_entry) { + size_t tlv_sz =3D struct_size(&entry->tlv, data, entry->tlv.len); + + memcpy(buf, &entry->tlv, tlv_sz); + buf +=3D tlv_sz; + + list_del(&entry->list_entry); + kfree(entry); + } + + /* 3. Insert TLV marking the end of the data */ + ice_migration_insert_tlv_end(buf); + +out_release_cfg_lock: + mutex_unlock(&vf->cfg_lock); + ice_put_vf(vf); + + return err; +} +EXPORT_SYMBOL(ice_migration_save_devstate); + +/** + * ice_migration_check_tlv_size - Validate size of next TLV in buffer + * @dev: device structure + * @tlv: pointer to the next TLV in migration buffer + * @sz_remaining: number of bytes left in migration buffer + * + * Check that the migration buffer has sufficient space to completely hold= the + * TLV, and that its length is properly aligned. + * + * Note that the tlv variable points into the migration buffer. To avoid + * a read-overflow, special care is taken to validate the size of the buff= er + * before accessing the contents of the tlv variable. + * + * Return: 0 if there is sufficient space for the entire TLV in the migrat= ion + * buffer. -ENOSPC otherwise. + */ +static int ice_migration_check_tlv_size(struct device *dev, + const struct ice_migration_tlv *tlv, + size_t sz_remaining) +{ + /* Make sure we have enough space for the TLV */ + if (sz_remaining < sizeof(*tlv)) { + dev_dbg(dev, "Not enough space in buffer for TLV header. Need %zu bytes,= but only %zu bytes remain.\n", + sizeof(*tlv), sz_remaining); + return -ENOSPC; + } + + sz_remaining -=3D sizeof(*tlv); + + /* Data lengths must be 4-byte aligned to ensure TLV header positions + * are always 4-byte aligned. + */ + if (tlv->len !=3D ALIGN(tlv->len, 4)) { + dev_dbg(dev, "TLV of type %u has unaligned length of %u bytes\n", + tlv->type, tlv->len); + return -ENOSPC; + } + + if (sz_remaining < tlv->len) { + dev_dbg(dev, "Not enough space in buffer for TLV of type %u, with length= %u. Only %zu bytes remain.\n", + tlv->type, tlv->len, sz_remaining); + return -ENOSPC; + } + + return 0; +} + +/** + * ice_migration_validate_tlvs - Validate TLV data integrity and compatibi= lity + * @dev: pointer to device + * @buf: pointer to device state buffer + * @buf_sz: size of buffer + * + * Ensure that the TLV data provided is valid, and matches the expected + * version and format. + * + * Return: 0 for success, negative for error + */ +static int +ice_migration_validate_tlvs(struct device *dev, const void *buf, size_t bu= f_sz) +{ + const struct ice_mig_tlv_header *header; + const struct ice_migration_tlv *tlv; + size_t tlv_size; + int err; + + tlv =3D buf; + + dev_dbg(dev, "Validating TLVs in migration payload of size %zu\n", + buf_sz); + + err =3D ice_migration_check_tlv_size(dev, tlv, buf_sz); + if (err) + return err; + + if (tlv->type !=3D ICE_MIG_TLV_HEADER) { + dev_dbg(dev, "First TLV in migration payload must be the header\n"); + return -EBADMSG; + } + + header =3D (typeof(header))tlv->data; + + if (header->magic !=3D ICE_MIG_MAGIC) { + dev_dbg(dev, "Got magic value 0x%08x, expected 0x%08x\n", + header->magic, ICE_MIG_MAGIC); + return -EPROTONOSUPPORT; + } + + if (header->version !=3D ICE_MIG_VERSION) { + dev_dbg(dev, "Got migration version %d, expected version %d\n", + header->version, ICE_MIG_VERSION); + return -EPROTONOSUPPORT; + } + + /* Validate remaining TLVs */ + do { + /* Move to next TLV */ + tlv_size =3D struct_size(tlv, data, tlv->len); + buf_sz -=3D tlv_size; + tlv =3D (const void *)tlv + tlv_size; + + /* Check buffer for space before dereferencing */ + err =3D ice_migration_check_tlv_size(dev, tlv, buf_sz); + if (err) + return err; + + /* Stop if we reach the end */ + if (tlv->type =3D=3D ICE_MIG_TLV_END) + break; + + if (tlv->type >=3D NUM_ICE_MIG_TLV || + tlv->type >=3D header->num_supported_tlvs) { + dev_dbg(dev, "Unsupported TLV of type %d in migration payload\n", + tlv->type); + return -EPROTONOSUPPORT; + } + + /* TODO: implement other validation? Check for compatibility + * with queue sizes, vector counts, VLAN capabilities, etc? + */ + } while (buf_sz > 0); + + return 0; +} + +/** + * ice_migration_load_devstate - Load device state into the target VF + * @vf_dev: pointer to the VF PCI device + * @buf: pointer to device state buf in migration buffer + * @buf_sz: size of migration buffer + * + * Deserialize the migration buffer TLVs and program the target VF in the + * destination VM to match. + * + * Return: 0 on success, or e negative error code on failure. + */ +int ice_migration_load_devstate(struct pci_dev *vf_dev, const void *buf, + size_t buf_sz) +{ + struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + const struct ice_migration_tlv *tlv; + struct ice_vsi *vsi; + struct device *dev; + struct ice_vf *vf; + int err; + + if (!buf) + return -EINVAL; + + if (IS_ERR(pf)) + return PTR_ERR(pf); + + dev =3D ice_pf_to_dev(pf); + + dev_dbg(&vf_dev->dev, "Loading live migration state. Migration buffer is = %zu bytes\n", + buf_sz); + + err =3D ice_migration_validate_tlvs(dev, buf, buf_sz); + if (err) + return err; + + vf =3D ice_get_vf_by_dev(pf, vf_dev); + if (!vf) { + dev_err(dev, "Unable to locate VF from VF device\n"); + return -EINVAL; + } + + mutex_lock(&vf->cfg_lock); + + vsi =3D ice_get_vf_vsi(vf); + if (!vsi) { + dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id); + err =3D -EINVAL; + goto err_release_cfg_lock; + } + + /* Iterate over TLVs and process migration data */ + tlv =3D buf; + + do { + size_t tlv_size; + + switch (tlv->type) { + case ICE_MIG_TLV_END: + case ICE_MIG_TLV_HEADER: + /* These are already handled above */ + break; + default: + dev_dbg(dev, "Unexpected TLV %d in payload?\n", + tlv->type); + err =3D -EINVAL; + } + + if (err) { + dev_dbg(dev, "Failed to load TLV data for TLV of type %d, err %d\n", + tlv->type, err); + goto err_release_cfg_lock; + } + + tlv_size =3D struct_size(tlv, data, tlv->len); + tlv =3D (const void *)tlv + tlv_size; + } while (tlv->type !=3D ICE_MIG_TLV_END); + + mutex_unlock(&vf->cfg_lock); + + ice_put_vf(vf); + + return 0; + +err_release_cfg_lock: + mutex_unlock(&vf->cfg_lock); + ice_put_vf(vf); + + return err; +} +EXPORT_SYMBOL(ice_migration_load_devstate); diff --git a/drivers/net/ethernet/intel/ice/Makefile b/drivers/net/ethernet= /intel/ice/Makefile index eac45d7c0cf1..e52585af299e 100644 --- a/drivers/net/ethernet/intel/ice/Makefile +++ b/drivers/net/ethernet/intel/ice/Makefile @@ -62,3 +62,4 @@ ice-$(CONFIG_XDP_SOCKETS) +=3D ice_xsk.o ice-$(CONFIG_ICE_SWITCHDEV) +=3D ice_eswitch.o ice_eswitch_br.o ice-$(CONFIG_GNSS) +=3D ice_gnss.o ice-$(CONFIG_ICE_HWMON) +=3D ice_hwmon.o +ice-$(CONFIG_ICE_VFIO_PCI) +=3D virt/migration.o --=20 2.51.0.rc1.197.g6d975e95c9d7 From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A88E264624; Tue, 9 Sep 2025 22:00:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455212; cv=none; b=PN6uvfJrO+nOkMegbyanL/x1uoWpvS38X7vc8bPfKqT4fuxARRRSEM9swF6InagfrKaggKYk528q2V08D1HHushZkRvEhPrwaGMTGGM6QcG/jLBd16i+7v9zyF9r9bjY4yNup5NzPVM7z7Rhh9sW4knq1J5VijEAvcSFDNkwbRM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455212; c=relaxed/simple; bh=vDl7m9tMdR9eYVxyDAaaau41yOwKsEhqf87Oqwp/3Vs=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ufpZHAytztl5xc5V9gKBzFRGZIqcwZkw2u2AorranwbOrfcqHFEFlZDpoh7jWvkkidBYbm/BuGVfr9MqJkkD/0VA/hxRUjeQXcG7lhvzzpFxCUI51RNw2inwEuVmHGDK/X4babga6WefvYJm8wPrVUj9RF6bNvkjeYOlRaZfM5E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=UC/aiuSV; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="UC/aiuSV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455211; x=1788991211; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=vDl7m9tMdR9eYVxyDAaaau41yOwKsEhqf87Oqwp/3Vs=; b=UC/aiuSVHOC96A0IX40CJZnKFfdmyTT0tzwUEfn4FLdbbf/lPTsje4ul WxfP1QzqkNEHDlX7Iv2R4hZB6we/g8H+Ekoe38b+OQgeMRe+dD4B6qRZq 9KK15/GDHDqHFynh8CCzfwP515dPgzt/R3ky2sAdemlIeaEuRWkFq/H3z Yfr5A6QcnSVToYm00Ziz+pLSeSaK1VsiU3ZrJR3xzhTN/V4+mHxOT7mwf VoKoBx9fZPnpjLKMwGYsG3cJAw58bcxVdQHCDbA9b7Zb4YuMiJUqPgyOP 2u5dx1toHer50ZnHquHIgc7xj8XQvGUkCGjiC2Tne+JgJHV4z0+BxR/Ly Q==; X-CSE-ConnectionGUID: H94IZVThRAGvq7zJCUf/Vg== X-CSE-MsgGUID: 3Eif7NrWRmGtIP3cXzFWIQ== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584631" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584631" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:08 -0700 X-CSE-ConnectionGUID: +taba2B9RWiukFxANaZjEQ== X-CSE-MsgGUID: VmtnZj3ASWK3UeOUV+/xvg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780957" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:07 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:51 -0700 Subject: [PATCH RFC net-next 2/7] ice: implement device suspension for live migration Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-2-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=3863; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=vDl7m9tMdR9eYVxyDAaaau41yOwKsEhqf87Oqwp/3Vs=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi1PPuljKLIvjF4va1mLJW/rK/Ll1r2y/8i3+rlU+t 6q+Wgh0lLIwiHExyIopsig4hKy8bjwhTOuNsxzMHFYmkCEMXJwCMBHzLEaGR9caul6Y+j7a68SX sd3N8dKZv7KztzLqKC/7J+K1+jYXJyPDv4yuiV77lzzT16h2O83WpyQc+LlkuWfsi5vFZYFMC9f wAQA= X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 The ice_migration_suspend_dev() function will be called by the ice_vfio_pci module to suspend the VF device in preparation for migration. It will be called both by the initial host device before transitioning to the STOP_COPY state, as well as by the receiving device prior to loading the migration data. In preparation for STOP_COPY, the device must save some state to fill out a migration buffer payload. In this flow, the save_state parameter is set to true. During the resume flow, the function will not need to save device state, and will set the save_state parameter to false. Signed-off-by: Jacob Keller --- drivers/net/ethernet/intel/ice/virt/migration.c | 96 +++++++++++++++++++++= +++- 1 file changed, 95 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/virt/migration.c b/drivers/net/= ethernet/intel/ice/virt/migration.c index f13b7674dabd..aa2e17c5be60 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration.c +++ b/drivers/net/ethernet/intel/ice/virt/migration.c @@ -85,9 +85,103 @@ void ice_migration_uninit_dev(struct pci_dev *vf_dev) } EXPORT_SYMBOL(ice_migration_uninit_dev); =20 +/** + * ice_migration_suspend_dev - suspend device + * @vf_dev: pointer to the VF PCI device + * @save_state: true if the device may be preparing for live migration + * + * Suspend the VF device. If save_state is set, first save any state which= is + * necessary for later migration. + * + * Return: 0 for success, negative for error + */ int ice_migration_suspend_dev(struct pci_dev *vf_dev, bool save_state) { - return -EOPNOTSUPP; + struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + struct ice_mig_tlv_entry *entry, *tmp; + struct ice_vsi *vsi; + struct device *dev; + struct ice_vf *vf; + int err; + + if (IS_ERR(pf)) + return PTR_ERR(pf); + + vf =3D ice_get_vf_by_dev(pf, vf_dev); + if (!vf) { + dev_err(&vf_dev->dev, "Unable to locate VF from VF device\n"); + return -EINVAL; + } + + dev =3D ice_pf_to_dev(pf); + + dev_dbg(dev, "Suspending VF %u in preparation for live migration\n", + vf->vf_id); + + mutex_lock(&vf->cfg_lock); + + vsi =3D ice_get_vf_vsi(vf); + if (!vsi) { + dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id); + err =3D -EINVAL; + goto err_release_cfg_lock; + } + + if (save_state) { + if (!list_empty(&vf->mig_tlvs)) { + dev_dbg(dev, "Freeing unused migration TLVs for VF %d\n", + vf->vf_id); + + list_for_each_entry_safe(entry, tmp, &vf->mig_tlvs, + list_entry) { + list_del(&entry->list_entry); + kfree(entry); + } + } + } + + /* Prevent VSI from queuing incoming packets by removing all filters */ + ice_fltr_remove_all(vsi); + /* TODO: there's probably a better way to handle this, or it may be + * unnecessary + */ + vf->num_mac =3D 0; + vsi->num_vlan =3D 0; + + /* MAC based filter rule is disabled at this point. Set MAC to zero + * to keep consistency with VF mac address info shown by ip link + */ + eth_zero_addr(vf->hw_lan_addr); + eth_zero_addr(vf->dev_lan_addr); + + err =3D ice_vsi_stop_lan_tx_rings(vsi, ICE_NO_RESET, vf->vf_id); + if (err) + dev_warn(dev, "VF %d failed to stop Tx rings. Continuing live migration = regardless.\n", + vf->vf_id); + + err =3D ice_vsi_stop_all_rx_rings(vsi); + if (err) + dev_warn(dev, "VF %d failed to stop Rx rings. Continuing live migration = regardless.\n", + vf->vf_id); + + mutex_unlock(&vf->cfg_lock); + ice_put_vf(vf); + + return 0; + +err_free_mig_tlvs: + if (save_state) { + list_for_each_entry_safe(entry, tmp, &vf->mig_tlvs, + list_entry) { + list_del(&entry->list_entry); + kfree(entry); + } + } + +err_release_cfg_lock: + mutex_unlock(&vf->cfg_lock); + ice_put_vf(vf); + return err; } EXPORT_SYMBOL(ice_migration_suspend_dev); =20 --=20 2.51.0.rc1.197.g6d975e95c9d7 From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E440263F44; Tue, 9 Sep 2025 22:00:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455214; cv=none; b=emv8ofwR3dubuK3mvdSAXd0Jr+QfAtmG/I9hhu7EQw3CnaOv+NdGamBZm0qVnNdkAX7XKMrdVIze9bKWyCVqJD3oeyN26LoRaN+NSXTRpgzoL6hgfKlM0HajZccdbhhh6Iz1qiIvh/OZfI2/vUmWpDDYVZpZ5q9e9FOir9mXEt8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455214; c=relaxed/simple; bh=zBu7BSSYtF+/LaaPaw5MmMaobB0x6Gnri9E517yTZ/I=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=iKBklM0cONko0wz+LmFlNbk0THorhOwx5WvSaSNDYd0IOZ8xiOtq+mF1RHSEux0nHpS24LIvCtqJ6TJaTHa8fFf0miD1NFWwuYGHvcb3Qnp2imdPWhGhP1tDwXu3g92R4p3h3B7IvFiSFCJcYcXCNc/3zfAUalIdkwh6V8pPO5g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Ms1PZxwe; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Ms1PZxwe" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455212; x=1788991212; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=zBu7BSSYtF+/LaaPaw5MmMaobB0x6Gnri9E517yTZ/I=; b=Ms1PZxweTexfc+e1OxYd2wA6Q7/ZzWHn4uz0Mnmh/xhgq9WYFq75aLHa cjuuElKAxXvf4owDHETBuCnwSDcvC74eXRER+FD6zCkfvP2ke/RT0m3Mc g6+kH7oMKl1kTjJ5nPsks40nqB2HWTI6swSZLJxcoITnncPsj3Ytf7mX8 Twmx5MdhLbqK6FN7GeWYfwbdMjhRavU4UtpEyX2bK7XwNV9tGZMDsQBtu cr8ikonTmeggx3eXs2LtHTf7pIuGCRNcyhM0vACW+PvImduoJOypcnTYr 23bx/q0qkbuHQDUVShQoyI8DvAaAw+4iaxHKXClAzh1JTJVvStwqDLBRr Q==; X-CSE-ConnectionGUID: MO9BBe4uSq273h2EVY9z6A== X-CSE-MsgGUID: 7IrAjfzASJiK5llPyVSlPg== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584645" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584645" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:08 -0700 X-CSE-ConnectionGUID: a0qT42+XQaWDyK3VKq/wIA== X-CSE-MsgGUID: AKzneJqxQriGs9Z++AywZw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780961" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:07 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:52 -0700 Subject: [PATCH RFC net-next 3/7] ice: add migration TLV for basic VF information Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-3-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=14888; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=zBu7BSSYtF+/LaaPaw5MmMaobB0x6Gnri9E517yTZ/I=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi1MFPLfO1mIzW8Xc/DTH88IyxUP/NHeF7Ji9arVsi nls5v+lHaUsDGJcDLJiiiwKDiErrxtPCNN64ywHM4eVCWQIAxenAEykZz7D/4ydH+TE9milfKxo FNp9KURyTmG5Z09swzOeoF26/LbWdQx/+BsSJGd0Noma/77A3x7wdsLMU8cucZxi2NHiUZWh+8S GHwA= X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 Add the ICE_MIG_TLV_VF_INFO TLV type to the migration payload. This TLV contains the basic VF information. This data is special, as it must be loaded first prior to other data from TLVs such as per-queue information. The ice_mig_vf_info structure is the element structure for this TLV. It contains the HW address, virtchnl information, and a variety of other information stored by the host PF. The trickiest detail is the handling for the set of allowed opcodes the VF has negotiated. These are typically stored as a bitmap array, which depends on the size of the VIRTCHNL_OP_MAX. To handle this, make the opcodes_allowlist a flexible array of u32. This is done to avoid encoding the VIRTCHNL_OP_MAX as part of the structure layout. Instead, it is passed as a field in the ice_mig_vf_info structure. The opcodes_allowlist is copied using bitmap_from_arr32 and bitmap_to_arr32 to facilitate conversion from the normal host PF data structures. Care is taken when restoring the VF information to only copy up to the new host's VIRTCHNL_OP_MAX. Additionally, any bits beyond the original host virtchnl_op_max are cleared to prevent the VF from sending these ops. Add logic to save the VF information during suspend. This is done before the host device is stopped, to ensure we save the correct data without any loss due to resets. When restoring the VF information, a pointer to the VF information is located as part of ice_migration_validate_tlvs(). This allows loading the VF information first, regardless of what order the TLVs are serialized. Signed-off-by: Jacob Keller Reviewed-by: Aleksandr Loktionov --- .../net/ethernet/intel/ice/virt/migration_tlv.h | 56 ++++++ drivers/net/ethernet/intel/ice/virt/migration.c | 206 +++++++++++++++++= +++- 2 files changed, 260 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h b/drivers/= net/ethernet/intel/ice/virt/migration_tlv.h index 2c5b4578060b..f941a6ccfe77 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h +++ b/drivers/net/ethernet/intel/ice/virt/migration_tlv.h @@ -70,6 +70,9 @@ struct ice_migration_tlv { * @ICE_MIG_TLV_HEADER: Header identifying the migration format. Must be t= he * first TLV in the list. * + * @ICE_MIG_TLV_VF_INFO: General configuration of the VF, including data + * exchanged over virtchnl as well as PF host configuration. + * * @NUM_ICE_MIG_TLV: Number of known TLV types. Any type equal to or larger * than this value is unrecognized by this version. * @@ -81,6 +84,7 @@ enum ice_migration_tlvs { /* Do not change the order or add anything between, this is ABI! */ ICE_MIG_TLV_END =3D 0, ICE_MIG_TLV_HEADER, + ICE_MIG_TLV_VF_INFO, =20 /* Add new types above here */ NUM_ICE_MIG_TLV @@ -121,6 +125,57 @@ struct ice_mig_tlv_header { u16 num_supported_tlvs; } __packed; =20 +/** + * struct ice_mig_vf_info - Basic VF information + * @dev_lan_addr: The current device LAN address + * @hw_lan_addr: The HW LAN address + * @driver_caps: Driver capabilities reported by the VF + * @vlan_v2_caps: The VLAN V2 capabilities of the VF + * @vf_ver: The reported virtchnl version of the VF + * @min_tx_rate: The programmed minimum Tx rate of the VF + * @max_tx_rate: The programmed maximum Tx rate of the VF + * @virtchnl_op_max: The largest known virtchnl opcode + * @allowlist_size: The size of the opcodes_allowlist + * @num_vf_qs: The number of queues assigned to the VF + * @num_msix: The number of MSI-X vectors used by the VF + * @port_vlan_tpid: port VLAN TPID + * @port_vlan_vid: port VLAN VID + * @port_vlan_prio: port VLAN priority + * @inner_vlan_strip_ena: True if the inner VLAN stripping is enabled + * @outer_vlan_strip_ena: True if the outer VLAN stripping is enabled + * @pf_set_mac: True if the PF administratively set the MAC address + * @trusted: True of the PF set the trusted VF flag for this VF + * @spoofchk: True if spoof checking is enabled on this VF + * @driver_active: True if the VF driver has initialized over virtchnl. + * @link_forced: True if the link status of this VF is forced + * @link_up: The forced link status, ignored if link_forced is false + * @opcodes_allowlist: The list of currently allowed opcodes as array of u= 32 + */ +struct ice_mig_vf_info { + u8 dev_lan_addr[ETH_ALEN]; + u8 hw_lan_addr[ETH_ALEN]; + u32 driver_caps; + struct virtchnl_vlan_caps vlan_v2_caps; + struct virtchnl_version_info vf_ver; + u32 min_tx_rate; + u32 max_tx_rate; + u32 virtchnl_op_max; + u16 num_vf_qs; + u16 num_msix; + u16 port_vlan_tpid; + u16 port_vlan_vid; + u8 port_vlan_prio; + u8 inner_vlan_strip_ena:1; + u8 outer_vlan_strip_ena:1; + u8 pf_set_mac:1; + u8 trusted:1; + u8 spoofchk:1; + u8 driver_active:1; + u8 link_forced:1; + u8 link_up:1; /* only valid if VF link is forced */ + u32 opcodes_allowlist[]; /* __counted_by(virtchnl_op_max), in bits */ +} __packed; + /** * ice_mig_tlv_type - Convert a TLV type to its number * @p: the TLV structure type @@ -132,6 +187,7 @@ struct ice_mig_tlv_header { #define ice_mig_tlv_type(p) \ _Generic(*(p), \ struct ice_mig_tlv_header : ICE_MIG_TLV_HEADER, \ + struct ice_mig_vf_info : ICE_MIG_TLV_VF_INFO, \ default : ICE_MIG_TLV_END) =20 /** diff --git a/drivers/net/ethernet/intel/ice/virt/migration.c b/drivers/net/= ethernet/intel/ice/virt/migration.c index aa2e17c5be60..67ce5b73a9ce 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration.c +++ b/drivers/net/ethernet/intel/ice/virt/migration.c @@ -85,6 +85,59 @@ void ice_migration_uninit_dev(struct pci_dev *vf_dev) } EXPORT_SYMBOL(ice_migration_uninit_dev); =20 +/** + * ice_migration_save_vf_info - Save VF information during suspend + * @vf: pointer to the VF being migrated + * @vsi: pointer to the VSI for this VF + * + * Save the VF device information when suspending a VF for live migration. + * + * Return: 0 on success, negative error code on failure. + */ +static int ice_migration_save_vf_info(struct ice_vf *vf, struct ice_vsi *v= si) +{ + struct ice_mig_vf_info *vf_info; + + lockdep_assert_held(&vf->cfg_lock); + + vf_info =3D ice_mig_alloc_flex_tlv(vf_info, opcodes_allowlist, + BITS_TO_U32(VIRTCHNL_OP_MAX)); + if (!vf_info) + return -ENOMEM; + + vf_info->driver_caps =3D vf->driver_caps; + vf_info->port_vlan_tpid =3D vf->port_vlan_info.tpid; + vf_info->port_vlan_vid =3D vf->port_vlan_info.vid; + vf_info->port_vlan_prio =3D vf->port_vlan_info.prio; + vf_info->vlan_v2_caps =3D vf->vlan_v2_caps; + vf_info->vf_ver =3D vf->vf_ver; + vf_info->min_tx_rate =3D vf->min_tx_rate; + vf_info->max_tx_rate =3D vf->max_tx_rate; + vf_info->num_vf_qs =3D vf->num_vf_qs; + vf_info->num_msix =3D vf->num_msix; + vf_info->inner_vlan_strip_ena =3D + vf->vlan_strip_ena & ICE_INNER_VLAN_STRIP_ENA ? 1 : 0; + vf_info->outer_vlan_strip_ena =3D + vf->vlan_strip_ena & ICE_OUTER_VLAN_STRIP_ENA ? 1 : 0; + vf_info->pf_set_mac =3D vf->pf_set_mac; + vf_info->trusted =3D vf->trusted; + vf_info->spoofchk =3D vf->spoofchk; + vf_info->link_forced =3D vf->link_forced; + vf_info->link_up =3D vf->link_up; + vf_info->driver_active =3D test_bit(ICE_VF_STATE_ACTIVE, vf->vf_states); + + ether_addr_copy(vf_info->dev_lan_addr, vf->dev_lan_addr); + ether_addr_copy(vf_info->hw_lan_addr, vf->hw_lan_addr); + + vf_info->virtchnl_op_max =3D VIRTCHNL_OP_MAX; + bitmap_to_arr32(vf_info->opcodes_allowlist, vf->opcodes_allowlist, + VIRTCHNL_OP_MAX); + + ice_mig_tlv_add_tail(vf_info, &vf->mig_tlvs); + + return 0; +} + /** * ice_migration_suspend_dev - suspend device * @vf_dev: pointer to the VF PCI device @@ -138,6 +191,10 @@ int ice_migration_suspend_dev(struct pci_dev *vf_dev, = bool save_state) kfree(entry); } } + + err =3D ice_migration_save_vf_info(vf, vsi); + if (err) + goto err_free_mig_tlvs; } =20 /* Prevent VSI from queuing incoming packets by removing all filters */ @@ -434,6 +491,7 @@ static int ice_migration_check_tlv_size(struct device *= dev, * @dev: pointer to device * @buf: pointer to device state buffer * @buf_sz: size of buffer + * @vf_info: on return, pointer to the VF info TLV * * Ensure that the TLV data provided is valid, and matches the expected * version and format. @@ -441,7 +499,8 @@ static int ice_migration_check_tlv_size(struct device *= dev, * Return: 0 for success, negative for error */ static int -ice_migration_validate_tlvs(struct device *dev, const void *buf, size_t bu= f_sz) +ice_migration_validate_tlvs(struct device *dev, const void *buf, size_t bu= f_sz, + const struct ice_mig_vf_info **vf_info) { const struct ice_mig_tlv_header *header; const struct ice_migration_tlv *tlv; @@ -476,6 +535,8 @@ ice_migration_validate_tlvs(struct device *dev, const v= oid *buf, size_t buf_sz) return -EPROTONOSUPPORT; } =20 + *vf_info =3D NULL; + /* Validate remaining TLVs */ do { /* Move to next TLV */ @@ -502,8 +563,140 @@ ice_migration_validate_tlvs(struct device *dev, const= void *buf, size_t buf_sz) /* TODO: implement other validation? Check for compatibility * with queue sizes, vector counts, VLAN capabilities, etc? */ + + /* Save the VF info pointer, as we must process it first */ + if (tlv->type =3D=3D ICE_MIG_TLV_VF_INFO) + *vf_info =3D (typeof(*vf_info))tlv->data; + } while (buf_sz > 0); =20 + if (!*vf_info) { + dev_dbg(dev, "Missing VF information TLV in migration payload\n"); + return -EINVAL; + } + + return 0; +} + +/** + * ice_migration_load_vf_info - Load VF information from migration buffer + * @vf: pointer to the VF being migrated to + * @vsi: the VSI for this VF + * @vf_info: VF information from the migration buffer + * + * Load the VF information from the migration buffer, preparing the VF to + * complete migration. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int ice_migration_load_vf_info(struct ice_vf *vf, struct ice_vsi *v= si, + const struct ice_mig_vf_info *vf_info) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + int err; + + lockdep_assert_held(&vf->cfg_lock); + + dev_dbg(dev, "Loading general VF configuration for VF %u\n", + vf->vf_id); + + dev_dbg(dev, "VF %d had %u MSI-X vectors. Requesting %u vectors\n", + vf->vf_id, vf->num_msix, vf_info->num_msix); + + /* Change the number of MSI-X vectors first */ + // TODO: ice_sriov_set_msix_vec_count sets the MSI-X to 1 more than + // the value passed in. This should be fixed. + err =3D ice_sriov_set_msix_vec_count(vf->vfdev, vf_info->num_msix - ICE_N= ONQ_VECS_VF); + if (err) { + dev_dbg(dev, "Unable to reconfigure MSI-X vectors, err %d\n", + err); + return err; + } + + /* Set values which are configured by VF reset */ + vf->trusted =3D vf_info->trusted; + vf->num_req_qs =3D vf_info->num_vf_qs; + vf->port_vlan_info.tpid =3D vf_info->port_vlan_tpid; + vf->port_vlan_info.vid =3D vf_info->port_vlan_vid; + vf->port_vlan_info.prio =3D vf_info->port_vlan_prio; + vf->min_tx_rate =3D vf_info->min_tx_rate; + vf->max_tx_rate =3D vf_info->max_tx_rate; + vf->spoofchk =3D vf_info->spoofchk; + + ether_addr_copy(vf->dev_lan_addr, vf_info->dev_lan_addr); + ether_addr_copy(vf->hw_lan_addr, vf_info->hw_lan_addr); + + /* Reset the VF */ + ice_reset_vf(vf, 0); + + /* Configure the rest of the settings */ + vf->vlan_v2_caps =3D vf_info->vlan_v2_caps; + vf->vf_ver =3D vf_info->vf_ver; + vf->driver_caps =3D vf_info->driver_caps; + + if (vf_info->inner_vlan_strip_ena) { + err =3D vsi->inner_vlan_ops.ena_stripping(vsi, ETH_P_8021Q); + if (err) { + dev_dbg(dev, "Failed to enable inner VLAN stripping, err %d\n", + err); + return err; + } + vf->vlan_strip_ena |=3D ICE_INNER_VLAN_STRIP_ENA; + } else { + err =3D vsi->inner_vlan_ops.dis_stripping(vsi); + if (err) { + dev_dbg(dev, "Failed to enable inner VLAN stripping, err %d\n", + err); + return err; + } + vf->vlan_strip_ena &=3D ~ICE_INNER_VLAN_STRIP_ENA; + } + + if (vf_info->outer_vlan_strip_ena) { + enum ice_l2tsel l2tsel =3D + ICE_L2TSEL_EXTRACT_FIRST_TAG_L2TAG2_2ND; + + err =3D vsi->outer_vlan_ops.ena_stripping(vsi, ETH_P_8021Q); + if (err) { + dev_dbg(dev, "Failed to enable outer VLAN stripping, err %d\n", + err); + return err; + } + ice_vsi_update_l2tsel(vsi, l2tsel); + vf->vlan_strip_ena |=3D ICE_OUTER_VLAN_STRIP_ENA; + } else { + enum ice_l2tsel l2tsel =3D + ICE_L2TSEL_EXTRACT_FIRST_TAG_L2TAG1; + + err =3D vsi->outer_vlan_ops.dis_stripping(vsi); + if (err) { + dev_dbg(dev, "Failed to enable outer VLAN stripping, err %d\n", + err); + return err; + } + ice_vsi_update_l2tsel(vsi, l2tsel); + vf->vlan_strip_ena &=3D ~ICE_OUTER_VLAN_STRIP_ENA; + } + + vf->pf_set_mac =3D vf_info->pf_set_mac; + vf->link_forced =3D vf_info->link_forced; + vf->link_up =3D vf_info->link_up; + + /* TODO: should we just enforce that virtchnl_op_max matches + * VIRTCHNL_OP_MAX? + */ + bitmap_from_arr32(vf->opcodes_allowlist, vf_info->opcodes_allowlist, + min(VIRTCHNL_OP_MAX, vf_info->virtchnl_op_max)); + + /* Disallow any ops the original VF didn't recognize */ + if (vf_info->virtchnl_op_max < VIRTCHNL_OP_MAX) + bitmap_clear(vf->opcodes_allowlist, + vf_info->virtchnl_op_max, + VIRTCHNL_OP_MAX - vf_info->virtchnl_op_max); + + if (vf_info->driver_active) + set_bit(ICE_VF_STATE_ACTIVE, vf->vf_states); + return 0; } =20 @@ -522,6 +715,7 @@ int ice_migration_load_devstate(struct pci_dev *vf_dev,= const void *buf, size_t buf_sz) { struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); + const struct ice_mig_vf_info *vf_info; const struct ice_migration_tlv *tlv; struct ice_vsi *vsi; struct device *dev; @@ -539,7 +733,7 @@ int ice_migration_load_devstate(struct pci_dev *vf_dev,= const void *buf, dev_dbg(&vf_dev->dev, "Loading live migration state. Migration buffer is = %zu bytes\n", buf_sz); =20 - err =3D ice_migration_validate_tlvs(dev, buf, buf_sz); + err =3D ice_migration_validate_tlvs(dev, buf, buf_sz, &vf_info); if (err) return err; =20 @@ -558,6 +752,13 @@ int ice_migration_load_devstate(struct pci_dev *vf_dev= , const void *buf, goto err_release_cfg_lock; } =20 + err =3D ice_migration_load_vf_info(vf, vsi, vf_info); + if (err) { + dev_dbg(dev, "Failed to load initial VF information, err %d\n", + err); + goto err_release_cfg_lock; + } + /* Iterate over TLVs and process migration data */ tlv =3D buf; =20 @@ -567,6 +768,7 @@ int ice_migration_load_devstate(struct pci_dev *vf_dev,= const void *buf, switch (tlv->type) { case ICE_MIG_TLV_END: case ICE_MIG_TLV_HEADER: + case ICE_MIG_TLV_VF_INFO: /* These are already handled above */ break; default: --=20 2.51.0.rc1.197.g6d975e95c9d7 From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF22A265CC5; Tue, 9 Sep 2025 22:00:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455216; cv=none; b=twmClKqCAVi+cl/kNwWqzZj27JCTTCgopyQjNfy4i/vaTqjXP3sAe5QG7qD6eBm1Pfbt6F2+W9WOBrjvbVH4+SU+r5cegDeg3w6fNu5QefEWwtIwjqQHxPyxw8Eh6G2RALObjmu0Ke6nVgu/nJYiHZlfOTPmdl0qQAUieoyD2zc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455216; c=relaxed/simple; bh=vD7Jqj4tEkAvBNRNkbJh/eYq0Xqv+TOhB2jpGCRhIyQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=nh3srQdzKJcH+XkCb4PJLaqbzMHQOw22iDGCpK5qZMbK3eY1a2xfuiypDUoCMcdrZT0wQSkJFv+U1mipG58ic/DFrMUvEaKUBI7kIltCUVocHxmpSr9Aku63Frd50G4AVUcCk8s44xQRbiY9wHhUxGrcPZdUjxf+BreY9j9uTSQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=nbyd3oIz; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="nbyd3oIz" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455214; x=1788991214; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=vD7Jqj4tEkAvBNRNkbJh/eYq0Xqv+TOhB2jpGCRhIyQ=; b=nbyd3oIzSJXpQtWxBT6NqrjadQl0sZtH8XhYwcWQl652P0lnWyy10N+q h/VcKEYRU/IQpEu/URJhs4Gw0wjwiv61mvRAUCd6UP78XZNV7zhUv1+Re p20vrRwxNOHFsxV/z0fp0eu4X+/g5isd7En632rOCMLubEmjubPhi+41I lX5EM8P7iMSBzKqDfcVLclfr0sMPl1XwoCxopdnbYQaTPdxDU3EN4vySV G+p+6g3Yu0eltu3E5EKcrIlZiVzLxtc7bDkIjEb0ublowmqiJIAISQLI+ g+3BjVaC+VArEst0O1aggQI+Izh86cswN/eDcDCG3N03uMQPBIUi2W4f8 w==; X-CSE-ConnectionGUID: //3VNJTWT1SqTGhKmHgG7A== X-CSE-MsgGUID: Iz+DImYLQz6Fsqo/3AMfhw== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584661" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584661" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:09 -0700 X-CSE-ConnectionGUID: tp1wuVL4QZaMPtHG8XGqtw== X-CSE-MsgGUID: SZcHgABNS0ewj9+9yHwNxA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780964" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:08 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:53 -0700 Subject: [PATCH RFC net-next 4/7] ice: add migration TLVs for queue and interrupt state Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-4-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=31648; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=vD7Jqj4tEkAvBNRNkbJh/eYq0Xqv+TOhB2jpGCRhIyQ=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi1OP/zGP4EyJeca+2y009GP1CbeFVYmGn94e5ol5s a9nmqpfRykLgxgXg6yYIouCQ8jK68YTwrTeOMvBzGFlAhnCwMUpABMJus/wP/NX1LUUsZppn9M3 W3E+u8f+bY8Iz9VPOVMzuLYZLn60OY3hv2vrhr92k7d9bYzV3X9632ONnzGygV+WzJwnNiNutyj XNxYA X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 Add the ICE_MIG_TLV_TX_QUEUE, ICE_MIG_TLV_RX_QUEUE, and ICE_MIG_TLV_MSIX_REGS TLVs to the migration payload buffer. These TLVs are included in the payload once per queue or interrupt vector, respectively. Update the PF driver to ensure that the QTX_COMM_HEAD register is initialized to a dummy QTX_COMM_HEAD_HEAD_M value, which is necessary to allow identifying the correct Tx head value. Replay of the Tx queue data is tricky, as we need to get the Tx queue of the target state to have the correct head. This is necessary as we must ensure the real Tx queue head matches the resident head value in the VM driver memory. Prior to migration start, the VF is reset which has the queue set to a head and tail of zero. To correct the head position, the queues are re-assigned to the PF, then dummy packets are inserted into the queue until the positions are correct. Finally, the queues are assigned back to the VF. This gets the head positions correct without invalid DMA access. DMA for these temporary dummy descriptors is allocated once in the ice_migration_load_devstate, in order to avoid needless re-allocation of DMA for each queue. Signed-off-by: Jacob Keller Reviewed-by: Aleksandr Loktionov --- .../net/ethernet/intel/ice/virt/migration_tlv.h | 85 +++ drivers/net/ethernet/intel/ice/virt/migration.c | 729 +++++++++++++++++= ++++ drivers/net/ethernet/intel/ice/virt/queues.c | 21 + 3 files changed, 835 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h b/drivers/= net/ethernet/intel/ice/virt/migration_tlv.h index f941a6ccfe77..3e10e53868b2 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h +++ b/drivers/net/ethernet/intel/ice/virt/migration_tlv.h @@ -73,6 +73,15 @@ struct ice_migration_tlv { * @ICE_MIG_TLV_VF_INFO: General configuration of the VF, including data * exchanged over virtchnl as well as PF host configuration. * + * @ICE_MIG_TLV_TX_QUEUE: Configuration for a Tx queue. Appears once per Tx + * queue. + * + * @ICE_MIG_TLV_RX_QUEUE: Configuration for an Rx queue. Appears once per = Rx + * queue. + * + * @ICE_MIG_TLV_MSIX_REGS: MSI-X register data for the VF. Appears once per + * MSI-X interrupt, including the miscellaneous interrupt for the mailbox. + * * @NUM_ICE_MIG_TLV: Number of known TLV types. Any type equal to or larger * than this value is unrecognized by this version. * @@ -85,6 +94,9 @@ enum ice_migration_tlvs { ICE_MIG_TLV_END =3D 0, ICE_MIG_TLV_HEADER, ICE_MIG_TLV_VF_INFO, + ICE_MIG_TLV_TX_QUEUE, + ICE_MIG_TLV_RX_QUEUE, + ICE_MIG_TLV_MSIX_REGS, =20 /* Add new types above here */ NUM_ICE_MIG_TLV @@ -176,6 +188,76 @@ struct ice_mig_vf_info { u32 opcodes_allowlist[]; /* __counted_by(virtchnl_op_max), in bits */ } __packed; =20 +/** + * struct ice_mig_tx_queue - Data to migrate a VF Tx queue + * @dma: the base DMA address for the queue + * @count: size of the Tx ring + * @head: the current head position of the Tx ring + * @queue_id: the VF relative Tx queue ID + * @vector_id: the VF relative MSI-X vector associated with this queue + * @vector_valid: if true, an MSI-X vector is associated with this queue + * @ena: if true, the Tx queue is currently enabled, false otherwise + * @reserved: reservied bitfield which must be zero + */ +struct ice_mig_tx_queue { + u64 dma; + u16 count; + u16 head; + u16 queue_id; + u16 vector_id; + u8 vector_valid:1; + u8 ena:1; + u8 reserved:6; +} __packed; + +/** + * struct ice_mig_rx_queue - Data to migrate a VF Rx queue + * @dma: the base DMA address for the queue + * @max_frame: the maximum frame size of the queue + * @rx_buf_len: the length of the Rx buffers associated with the ring + * @rxdid: the Rx descriptor format of the ring + * @count: the size of the Rx ring + * @head: the current head position of the ring + * @tail: the current tail position of the ring + * @queue_id: the VF relative Rx queue ID + * @vector_id: the VF relative MSI-X vector associated with this queue + * @vector_valid: if true, an MSI-X vector is associated with this queue + * @crc_strip: if true, CRC stripping is enabled, false otherwise + * @ena: if true, the Rx queue is currently enabled, false otherwise + * @reserved: reserved bitfield which must be zero + */ +struct ice_mig_rx_queue { + u64 dma; + u16 max_frame; + u16 rx_buf_len; + u32 rxdid; + u16 count; + u16 head; + u16 tail; + u16 queue_id; + u16 vector_id; + u8 vector_valid:1; + u8 crc_strip:1; + u8 ena:1; + u8 reserved:5; +} __packed; + +/** + * struct ice_mig_msix_regs - MSI-X register data for migrating VF + * @int_dyn_ctl: Contents GLINT_DYN_CTL for this vector + * @int_intr: Contents of GLINT_ITR for all ITRs of this vector + * @tx_itr_idx: The ITR index used for transmit + * @rx_itr_idx: The ITR index used for receive + * @vector_id: The MSI-X vector, *including* the miscellaneous non-queue v= ector + */ +struct ice_mig_msix_regs { + u32 int_dyn_ctl; + u32 int_intr[ICE_MIG_VF_ITR_NUM]; + u16 tx_itr_idx; + u16 rx_itr_idx; + u16 vector_id; +} __packed; + /** * ice_mig_tlv_type - Convert a TLV type to its number * @p: the TLV structure type @@ -188,6 +270,9 @@ struct ice_mig_vf_info { _Generic(*(p), \ struct ice_mig_tlv_header : ICE_MIG_TLV_HEADER, \ struct ice_mig_vf_info : ICE_MIG_TLV_VF_INFO, \ + struct ice_mig_tx_queue : ICE_MIG_TLV_TX_QUEUE, \ + struct ice_mig_rx_queue : ICE_MIG_TLV_RX_QUEUE, \ + struct ice_mig_msix_regs : ICE_MIG_TLV_MSIX_REGS, \ default : ICE_MIG_TLV_END) =20 /** diff --git a/drivers/net/ethernet/intel/ice/virt/migration.c b/drivers/net/= ethernet/intel/ice/virt/migration.c index 67ce5b73a9ce..e59eb99b20da 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration.c +++ b/drivers/net/ethernet/intel/ice/virt/migration.c @@ -138,6 +138,269 @@ static int ice_migration_save_vf_info(struct ice_vf *= vf, struct ice_vsi *vsi) return 0; } =20 +/** + * ice_migration_save_tx_queues - Save Tx queue state + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Save Tx queue state in preparation for live migration. + * + * Return: 0 for success, negative for error + */ +static int ice_migration_save_tx_queues(struct ice_vf *vf, struct ice_vsi = *vsi) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_mig_tlv_entry *entry, *tmp; + struct list_head queue_tlvs; + int err, i; + + lockdep_assert_held(&vf->cfg_lock); + INIT_LIST_HEAD(&queue_tlvs); + + dev_dbg(dev, "Saving Tx queue config for VF %u\n", + vf->vf_id); + + ice_for_each_txq(vsi, i) { + struct ice_tx_ring *tx_ring =3D vsi->tx_rings[i]; + struct ice_mig_tx_queue *tx_queue; + struct ice_tlan_ctx tlan_ctx =3D {}; + struct ice_hw *hw =3D &vf->pf->hw; + u32 qtx_comm_head; + u16 tx_head; + int err; + + if (!tx_ring) + continue; + + /* Ignore queues which were never configured by the VF */ + if (!tx_ring->dma) { + dev_dbg(dev, "Ignoring unconfigured Tx queue %d on VF %d with NULL DMA = address\n", + i, vf->vf_id); + continue; + } + + tx_queue =3D ice_mig_alloc_tlv(tx_queue); + if (!tx_queue) { + err =3D -ENOMEM; + goto err_free_tlv_entries; + } + + err =3D ice_read_txq_ctx(hw, &tlan_ctx, tx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to read TXQ[%d] context, err=3D%d\n", + tx_ring->q_index, err); + goto err_free_tlv_entries; + } + + qtx_comm_head =3D rd32(hw, QTX_COMM_HEAD(tx_ring->reg_idx)); + tx_head =3D FIELD_GET(QTX_COMM_HEAD_HEAD_M, qtx_comm_head); + + /* Determine the Tx head from the QTX_COMM_HEAD register. + * + * If no write back has happened since the queue was enabled, + * the register will read as QTX_COMM_HEAD_HEAD_M. + * + * Otherwise, the value from QTX_COMM_HEAD will be precisely + * one behind the real Tx head value. + */ + if (tx_head =3D=3D QTX_COMM_HEAD_HEAD_M || + tx_head =3D=3D tx_ring->count - 1) + tx_head =3D 0; + else + tx_head++; + + tx_queue->queue_id =3D i; + tx_queue->dma =3D tx_ring->dma; + tx_queue->count =3D tx_ring->count; + tx_queue->head =3D tx_head; + if (tx_ring->q_vector) { + /* we don't need to account for ICE_NONQ_VECS_VF here, + * as the deserializing end won't expect it. + */ + tx_queue->vector_id =3D tx_ring->q_vector->v_idx; + tx_queue->vector_valid =3D 1; + } + tx_queue->ena =3D test_bit(i, vf->txq_ena); + + ice_mig_tlv_add_tail(tx_queue, &queue_tlvs); + } + + list_splice_tail(&queue_tlvs, &vf->mig_tlvs); + + return 0; + +err_free_tlv_entries: + list_for_each_entry_safe(entry, tmp, &queue_tlvs, list_entry) { + list_del(&entry->list_entry); + kfree(entry); + } + + return err; +} + +/** + * ice_migration_save_rx_queues - Save Rx queue state + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Save Rx queue state in preparation for live migration. + * + * Return: 0 for success, negative for error + */ +static int ice_migration_save_rx_queues(struct ice_vf *vf, struct ice_vsi = *vsi) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_mig_tlv_entry *entry, *tmp; + struct list_head queue_tlvs; + int err, i; + + lockdep_assert_held(&vf->cfg_lock); + INIT_LIST_HEAD(&queue_tlvs); + + dev_dbg(dev, "Saving Rx queue config for VF %u\n", + vf->vf_id); + + ice_for_each_rxq(vsi, i) { + struct ice_rx_ring *rx_ring =3D vsi->rx_rings[i]; + struct ice_mig_rx_queue *rx_queue; + struct ice_rlan_ctx rlan_ctx =3D {}; + struct ice_hw *hw =3D &vf->pf->hw; + u32 rxflxp; + int err; + + if (!rx_ring) + continue; + + /* Ignore queues which were never configured by the VF */ + if (!rx_ring->dma) { + dev_dbg(dev, "Ignoring unconfigured Rx queue %d on VF %d with NULL DMA = address\n", + i, vf->vf_id); + continue; + } + + rx_queue =3D ice_mig_alloc_tlv(rx_queue); + if (!rx_queue) { + err =3D -ENOMEM; + goto err_free_tlv_entries; + } + + err =3D ice_read_rxq_ctx(hw, &rlan_ctx, rx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to read RXQ[%d] context, err=3D%d\n", + rx_ring->q_index, err); + goto err_free_tlv_entries; + } + + rxflxp =3D rd32(hw, QRXFLXP_CNTXT(rx_ring->reg_idx)); + + rx_queue->queue_id =3D i; + rx_queue->head =3D rlan_ctx.head; + rx_queue->tail =3D QRX_TAIL(rx_ring->reg_idx); + rx_queue->dma =3D rx_ring->dma; + rx_queue->max_frame =3D rlan_ctx.rxmax; + rx_queue->rx_buf_len =3D rx_ring->rx_buf_len; + rx_queue->rxdid =3D FIELD_GET(QRXFLXP_CNTXT_RXDID_IDX_M, rxflxp); + rx_queue->count =3D rx_ring->count; + if (rx_ring->q_vector) { + /* we don't need to account for ICE_NONQ_VECS_VF here, + * as the deserializing end won't expect it. + */ + rx_queue->vector_id =3D rx_ring->q_vector->v_idx; + rx_queue->vector_valid =3D 1; + } + rx_queue->crc_strip =3D rlan_ctx.crcstrip; + rx_queue->ena =3D test_bit(i, vf->rxq_ena); + + ice_mig_tlv_add_tail(rx_queue, &queue_tlvs); + } + + list_splice_tail(&queue_tlvs, &vf->mig_tlvs); + + return 0; + +err_free_tlv_entries: + list_for_each_entry_safe(entry, tmp, &queue_tlvs, list_entry) { + list_del(&entry->list_entry); + kfree(entry); + } + + return err; +} + +/** + * ice_migration_save_msix_regs - Save MSI-X registers during suspend + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Save the MMIO registers associated with MSI-X interrupts, including the + * miscellaneous interrupt used for the mailbox. Called during suspend to = save + * the values prior to queue shutdown, to ensure they match the VF suspend= ed + * state accurately. + * + * Return: 0 on success, negative error code on failure. + */ +static int ice_migration_save_msix_regs(struct ice_vf *vf, + struct ice_vsi *vsi) +{ + struct ice_mig_tlv_entry *entry, *tmp; + struct ice_hw *hw =3D &vf->pf->hw; + struct list_head msix_tlvs; + int err; + + lockdep_assert_held(&vf->cfg_lock); + INIT_LIST_HEAD(&msix_tlvs); + + /* Copy the IRQ registers, starting with the non-queue vectors */ + for (int idx =3D 0; idx < vsi->num_q_vectors + ICE_NONQ_VECS_VF; idx++) { + struct ice_mig_msix_regs *msix_regs; + u16 reg_idx, tx_itr_idx, rx_itr_idx; + + if (idx < ICE_NONQ_VECS_VF) { + reg_idx =3D vf->first_vector_idx + idx; + tx_itr_idx =3D 0; + rx_itr_idx =3D 0; + } else { + struct ice_q_vector *q_vector; + int v_id; + + v_id =3D idx - ICE_NONQ_VECS_VF; + q_vector =3D vsi->q_vectors[v_id]; + reg_idx =3D q_vector->reg_idx; + tx_itr_idx =3D q_vector->tx.itr_idx; + rx_itr_idx =3D q_vector->rx.itr_idx; + } + + msix_regs =3D ice_mig_alloc_tlv(msix_regs); + if (!msix_regs) { + err =3D -ENOMEM; + goto err_free_tlv_entries; + } + + msix_regs->vector_id =3D idx; + msix_regs->tx_itr_idx =3D tx_itr_idx; + msix_regs->rx_itr_idx =3D rx_itr_idx; + + msix_regs->int_dyn_ctl =3D rd32(hw, GLINT_DYN_CTL(reg_idx)); + for (int itr =3D 0; itr < ICE_MIG_VF_ITR_NUM; itr++) + msix_regs->int_intr[itr] =3D + rd32(hw, GLINT_ITR(itr, reg_idx)); + + ice_mig_tlv_add_tail(msix_regs, &msix_tlvs); + } + + list_splice_tail(&msix_tlvs, &vf->mig_tlvs); + + return 0; + +err_free_tlv_entries: + list_for_each_entry_safe(entry, tmp, &msix_tlvs, list_entry) { + list_del(&entry->list_entry); + kfree(entry); + } + + return err; +} + /** * ice_migration_suspend_dev - suspend device * @vf_dev: pointer to the VF PCI device @@ -195,6 +458,11 @@ int ice_migration_suspend_dev(struct pci_dev *vf_dev, = bool save_state) err =3D ice_migration_save_vf_info(vf, vsi); if (err) goto err_free_mig_tlvs; + + err =3D ice_migration_save_msix_regs(vf, vsi); + if (err) + goto err_free_mig_tlvs; + } =20 /* Prevent VSI from queuing incoming packets by removing all filters */ @@ -221,6 +489,17 @@ int ice_migration_suspend_dev(struct pci_dev *vf_dev, = bool save_state) dev_warn(dev, "VF %d failed to stop Rx rings. Continuing live migration = regardless.\n", vf->vf_id); =20 + if (save_state) { + /* Save queue state after stopping the queues */ + err =3D ice_migration_save_rx_queues(vf, vsi); + if (err) + goto err_free_mig_tlvs; + + err =3D ice_migration_save_tx_queues(vf, vsi); + if (err) + goto err_free_mig_tlvs; + } + mutex_unlock(&vf->cfg_lock); ice_put_vf(vf); =20 @@ -700,6 +979,424 @@ static int ice_migration_load_vf_info(struct ice_vf *= vf, struct ice_vsi *vsi, return 0; } =20 +/** + * ice_migration_init_dummy_desc - Initialize DMA for the dummy descriptors + * @tx_desc: Tx ring descriptor array + * @len: length of the descriptor array + * @tx_pkt_dma: dummy packet DMA memory + * + * Initialize the dummy ring data descriptors using the provided DMA for + * packet data memory. + */ +static void ice_migration_init_dummy_desc(struct ice_tx_desc *tx_desc, + u16 len, dma_addr_t tx_pkt_dma) +{ + for (int i =3D 0; i < len; i++) { + u32 td_cmd; + + td_cmd =3D ICE_TXD_LAST_DESC_CMD | ICE_TX_DESC_CMD_DUMMY; + tx_desc[i].cmd_type_offset_bsz =3D + ice_build_ctob(td_cmd, 0, SZ_256, 0); + tx_desc[i].buf_addr =3D cpu_to_le64(tx_pkt_dma); + } +} + +/** + * ice_migration_wait_for_tx_completion - Wait for Tx transmission complet= ion + * @hw: pointer to the device HW structure + * @tx_ring: Tx ring structure + * @head: target Tx head position + * + * Wait for hardware to complete updating the Tx ring head. We read this v= alue + * from QTX_COMM_HEAD. This will either be the initially programmed + * QTX_COMM_HEAD_HEAD_M marker value, or one before the actual head of the= Tx + * ring. + * + * Since we only inject packets when the head needs to move from zero, the + * target head position will always be non-zero. + * + * Return: 0 for success, negative for error. + */ +static int +ice_migration_wait_for_tx_completion(struct ice_hw *hw, + struct ice_tx_ring *tx_ring, u16 head) +{ + u32 tx_head; + int err; + + err =3D rd32_poll_timeout(hw, QTX_COMM_HEAD(tx_ring->reg_idx), + tx_head, + FIELD_GET(QTX_COMM_HEAD_HEAD_M, tx_head) =3D=3D head - 1, + 10, 500); + if (err) { + dev_dbg(ice_hw_to_dev(hw), "Timed out waiting for Tx ring completion, ta= rget head %u, qtx_comm_head %u, err %d\n", + head, tx_head, err); + return err; + } + + return 0; +} + +/** + * ice_migration_inject_dummy_desc - Inject dummy descriptors to move Tx h= ead + * @vf: pointer to the VF being migrated to + * @tx_ring: Tx ring instance + * @head: Tx head to be loaded + * @tx_desc_dma: Tx descriptor ring base DMG address + * + * Load the Tx head for the given Tx ring using the following steps: + * + * 1. Initialize QTX_COMM_HEAD to marker value. + * 2. Backup the current Tx context. + * 3. Temporarily update the Tx context to point to the PF space, using the + * provided PF Tx descriptor DMA, filled with dummy descriptors and pac= ket + * data. + * 4. Disable the Tx queue interrupt. + * 5. Bump the Tx ring doorbell to the desired Tx head position. + * 6. Wait for hardware to DMA and update Tx head. + * and update the Tx head. + * 7. Restore the backed up Tx queue context. + * 8. Re-enable the Tx queue interrupt. + * + * By updating the queue context to point to the PF space with the PF-mana= ged + * DMA address, the HW will issue PCI upstream memory transactions tagged = by + * the PF BDF. This will work successfully to update the Tx head without + * needing to interact with the VF DMA. + * + * Return: 0 for success, negative for error. + */ +static int +ice_migration_inject_dummy_desc(struct ice_vf *vf, struct ice_tx_ring *tx_= ring, + u16 head, dma_addr_t tx_desc_dma) +{ + struct ice_tlan_ctx tlan_ctx, tlan_ctx_orig; + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_hw *hw =3D &vf->pf->hw; + u32 dynctl; + u32 tqctl; + int err; + + /* 1. Initialize head after re-programming the queue */ + wr32(hw, QTX_COMM_HEAD(tx_ring->reg_idx), QTX_COMM_HEAD_HEAD_M); + + /* 2. Backup Tx Queue context */ + err =3D ice_read_txq_ctx(hw, &tlan_ctx, tx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to read TXQ[%d] context, err=3D%d\n", + tx_ring->q_index, err); + return -EIO; + } + memcpy(&tlan_ctx_orig, &tlan_ctx, sizeof(tlan_ctx)); + tqctl =3D rd32(hw, QINT_TQCTL(tx_ring->reg_idx)); + if (tx_ring->q_vector) + dynctl =3D rd32(hw, GLINT_DYN_CTL(tx_ring->q_vector->reg_idx)); + + /* 3. Switch Tx queue context as PF space and PF DMA ring base. */ + tlan_ctx.vmvf_type =3D ICE_TLAN_CTX_VMVF_TYPE_PF; + tlan_ctx.vmvf_num =3D 0; + tlan_ctx.base =3D tx_desc_dma >> ICE_TLAN_CTX_BASE_S; + err =3D ice_write_txq_ctx(hw, &tlan_ctx, tx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to write TXQ[%d] context, err=3D%d\n", + tx_ring->q_index, err); + return -EIO; + } + + /* 4. Disable Tx queue interrupt. */ + wr32(hw, QINT_TQCTL(tx_ring->reg_idx), QINT_TQCTL_ITR_INDX_M); + + /* To disable Tx queue interrupt during run time, software should + * write mmio to trigger a MSIX interrupt. + */ + if (tx_ring->q_vector) + wr32(hw, GLINT_DYN_CTL(tx_ring->q_vector->reg_idx), + (ICE_ITR_NONE << GLINT_DYN_CTL_ITR_INDX_S) | + GLINT_DYN_CTL_SWINT_TRIG_M | + GLINT_DYN_CTL_INTENA_M); + + /* Force memory writes to complete before letting h/w know there + * are new descriptors to fetch. + */ + wmb(); + + /* 5. Bump doorbell to advance Tx Queue head */ + writel(head, tx_ring->tail); + + /* 6. Wait until Tx Queue head move to expected place */ + err =3D ice_migration_wait_for_tx_completion(hw, tx_ring, head); + if (err) { + dev_err(dev, "VF %d txq[%d] head loading timeout\n", + vf->vf_id, tx_ring->q_index); + return err; + } + + /* 7. Overwrite Tx Queue context with backup context */ + err =3D ice_write_txq_ctx(hw, &tlan_ctx_orig, tx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to write TXQ[%d] context, err=3D%d\n", + tx_ring->q_index, err); + return -EIO; + } + + /* 8. Re-enable Tx queue interrupt */ + wr32(hw, QINT_TQCTL(tx_ring->reg_idx), tqctl); + if (tx_ring->q_vector) + wr32(hw, GLINT_DYN_CTL(tx_ring->q_vector->reg_idx), dynctl); + + return 0; +} + +/** + * ice_migration_load_tx_queue - Load Tx queue data from migration payload + * @vf: pointer to the VF being migrated to + * @vsi: the VSI for this VF + * @tx_queue: Tx queue data from migration payload + * @tx_desc: temporary descriptor for moving Tx head + * @tx_desc_dma: temporary descriptor DMA for moving Tx head + * @tx_pkt_dma: temporary packet DMA for moving Tx head + * + * Load the Tx queue information from the migration buffer into the target= VF. + * + * Return: 0 for success, negative for error + */ +static int ice_migration_load_tx_queue(struct ice_vf *vf, struct ice_vsi *= vsi, + const struct ice_mig_tx_queue *tx_queue, + struct ice_tx_desc *tx_desc, + dma_addr_t tx_desc_dma, + dma_addr_t tx_pkt_dma) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_q_vector *q_vector; + struct ice_tx_ring *tx_ring; + int err; + + lockdep_assert_held(&vf->cfg_lock); + + if (tx_queue->queue_id >=3D vsi->num_txq) { + dev_dbg(dev, "Got data for queue %d but the VF is only configured with %= d Tx queues\n", + tx_queue->queue_id, vsi->num_txq); + return -EINVAL; + } + + dev_dbg(dev, "Loading Tx VF queue %d (PF queue %d) on VF %d\n", + tx_queue->queue_id, vsi->txq_map[tx_queue->queue_id], + vf->vf_id); + + tx_ring =3D vsi->tx_rings[tx_queue->queue_id]; + + if (WARN_ON_ONCE(!tx_ring)) + return -EINVAL; + + tx_ring->dma =3D tx_queue->dma; + tx_ring->count =3D tx_queue->count; + + /* Disable any existing queue first */ + err =3D ice_vf_vsi_dis_single_txq(vf, vsi, tx_queue->queue_id); + if (err) { + dev_dbg(dev, "Failed to disable existing queue, err %d\n", + err); + return err; + } + + err =3D ice_vsi_cfg_single_txq(vsi, vsi->tx_rings, tx_queue->queue_id); + if (err) { + dev_dbg(dev, "Failed to configure Tx queue %u, err %d\n", + tx_queue->queue_id, err); + return err; + } + + if (tx_queue->head >=3D tx_ring->count) { + dev_err(dev, "VF %d: invalid tx ring length to load\n", + vf->vf_id); + return -EINVAL; + } + + /* After the initial reset and Tx queue re-programming, the Tx head + * and tail state will be zero. If the desired state for the head is + * non-zero, we need to inject some dummy packets into the queue to + * move the head of the ring to the desired value. + */ + if (tx_queue->head) { + ice_migration_init_dummy_desc(tx_desc, ICE_MAX_NUM_DESC, + tx_pkt_dma); + err =3D ice_migration_inject_dummy_desc(vf, tx_ring, + tx_queue->head, + tx_desc_dma); + if (err) + return err; + } + + if (tx_queue->vector_valid) { + q_vector =3D vsi->q_vectors[tx_queue->vector_id]; + ice_cfg_txq_interrupt(vsi, tx_queue->queue_id, + q_vector->vf_reg_idx, + q_vector->tx.itr_idx); + } + + if (tx_queue->ena) { + ice_vf_ena_txq_interrupt(vsi, tx_queue->queue_id); + set_bit(tx_queue->queue_id, vf->txq_ena); + } + + return 0; +} + +/** + * ice_migration_load_rx_queue - Load Rx queue data from migration buffer + * @vf: pointer to the VF being migrated to + * @vsi: pointer to the VSI for the VF + * @rx_queue: pointer to Rx queue migration data + * + * Load the Rx queue data from the migration payload into the target VF. + * + * Return: 0 for success, negative for error + */ +static int ice_migration_load_rx_queue(struct ice_vf *vf, struct ice_vsi *= vsi, + const struct ice_mig_rx_queue *rx_queue) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_rlan_ctx rlan_ctx =3D {}; + struct ice_hw *hw =3D &vf->pf->hw; + struct ice_q_vector *q_vector; + struct ice_rx_ring *rx_ring; + int err; + + lockdep_assert_held(&vf->cfg_lock); + + if (rx_queue->queue_id >=3D vsi->num_rxq) { + dev_dbg(dev, "Got data for queue %d but the VF is only configured with %= d Rx queues\n", + rx_queue->queue_id, vsi->num_rxq); + return -EINVAL; + } + + dev_dbg(dev, "Loading Rx queue %d on VF %d\n", + rx_queue->queue_id, vf->vf_id); + + if (!(BIT(rx_queue->rxdid) & vf->pf->supported_rxdids)) { + dev_dbg(dev, "Got unsupported Rx descriptor ID %u\n", + rx_queue->rxdid); + return -EINVAL; + } + + rx_ring =3D vsi->rx_rings[rx_queue->queue_id]; + + if (WARN_ON_ONCE(!rx_ring)) + return -EINVAL; + + rx_ring->dma =3D rx_queue->dma; + rx_ring->count =3D rx_queue->count; + + if (rx_queue->crc_strip) + rx_ring->flags &=3D ~ICE_RX_FLAGS_CRC_STRIP_DIS; + else + rx_ring->flags |=3D ICE_RX_FLAGS_CRC_STRIP_DIS; + + rx_ring->rx_buf_len =3D rx_queue->rx_buf_len; + rx_ring->max_frame =3D rx_queue->max_frame; + + err =3D ice_vsi_cfg_single_rxq(vsi, rx_queue->queue_id); + if (err) { + dev_dbg(dev, "Failed to configure Rx queue %u for VF %u, err %d\n", + rx_queue->queue_id, vf->vf_id, err); + return err; + } + + ice_write_qrxflxp_cntxt(hw, rx_ring->reg_idx, + rx_queue->rxdid, 0x03, false); + + err =3D ice_read_rxq_ctx(hw, &rlan_ctx, rx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to read RXQ[%d] context, err=3D%d\n", + rx_ring->q_index, err); + return -EIO; + } + + rlan_ctx.head =3D rx_queue->head; + err =3D ice_write_rxq_ctx(hw, &rlan_ctx, rx_ring->reg_idx); + if (err) { + dev_err(dev, "Failed to set LAN RXQ[%d] context, err=3D%d\n", + rx_ring->q_index, err); + return -EIO; + } + + wr32(hw, QRX_TAIL(rx_ring->reg_idx), rx_queue->tail); + + if (rx_queue->vector_valid) { + q_vector =3D vsi->q_vectors[rx_queue->vector_id]; + ice_cfg_rxq_interrupt(vsi, rx_queue->queue_id, + q_vector->vf_reg_idx, + q_vector->rx.itr_idx); + } + + if (rx_queue->ena) { + err =3D ice_vsi_ctrl_one_rx_ring(vsi, true, rx_queue->queue_id, + true); + if (err) { + dev_err(dev, "Failed to enable Rx ring %d on VSI %d, err %d\n", + rx_queue->queue_id, vsi->vsi_num, err); + return -EIO; + } + + ice_vf_ena_rxq_interrupt(vsi, rx_queue->queue_id); + set_bit(rx_queue->queue_id, vf->rxq_ena); + } + + return 0; +} + +/** + * ice_migration_load_msix_regs - Load MSI-X vector registers + * @vf: pointer to the VF being migrated to + * @vsi: the VSI of the target VF + * @msix_regs: MSI-X register data from migration payload + * + * Load the MSI-X vector register data from the migration payload into the + * target VF. + * + * Return: 0 for success, negative for error + */ +static int +ice_migration_load_msix_regs(struct ice_vf *vf, struct ice_vsi *vsi, + const struct ice_mig_msix_regs *msix_regs) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_hw *hw =3D &vf->pf->hw; + u16 reg_idx; + int itr; + + lockdep_assert_held(&vf->cfg_lock); + + if (msix_regs->vector_id > vsi->num_q_vectors + ICE_NONQ_VECS_VF) { + dev_dbg(dev, "Got data for MSI-X vector %d, but the VF is only configure= d with %d vectors\n", + msix_regs->vector_id, + vsi->num_q_vectors + ICE_NONQ_VECS_VF); + return -EINVAL; + } + + dev_dbg(dev, "Loading MSI-X register configuration for VF %u\n", + vf->vf_id); + + if (msix_regs->vector_id < ICE_NONQ_VECS_VF) { + reg_idx =3D vf->first_vector_idx + msix_regs->vector_id; + } else { + struct ice_q_vector *q_vector; + int v_id; + + v_id =3D msix_regs->vector_id - ICE_NONQ_VECS_VF; + q_vector =3D vsi->q_vectors[v_id]; + reg_idx =3D q_vector->reg_idx; + + q_vector->tx.itr_idx =3D msix_regs->tx_itr_idx; + q_vector->rx.itr_idx =3D msix_regs->rx_itr_idx; + } + + wr32(hw, GLINT_DYN_CTL(reg_idx), msix_regs->int_dyn_ctl); + for (itr =3D 0; itr < ICE_MIG_VF_ITR_NUM; itr++) + wr32(hw, GLINT_ITR(itr, reg_idx), msix_regs->int_intr[itr]); + + return 0; +} + /** * ice_migration_load_devstate - Load device state into the target VF * @vf_dev: pointer to the VF PCI device @@ -714,9 +1411,12 @@ static int ice_migration_load_vf_info(struct ice_vf *= vf, struct ice_vsi *vsi, int ice_migration_load_devstate(struct pci_dev *vf_dev, const void *buf, size_t buf_sz) { + const size_t dma_size =3D ICE_MAX_NUM_DESC * sizeof(struct ice_tx_desc); struct ice_pf *pf =3D ice_vf_dev_to_pf(vf_dev); const struct ice_mig_vf_info *vf_info; const struct ice_migration_tlv *tlv; + dma_addr_t tx_desc_dma, tx_pkt_dma; + void *tx_desc, *tx_pkt; struct ice_vsi *vsi; struct device *dev; struct ice_vf *vf; @@ -743,6 +1443,15 @@ int ice_migration_load_devstate(struct pci_dev *vf_de= v, const void *buf, return -EINVAL; } =20 + /* Allocate DMA ring and descriptor by PF */ + tx_desc =3D dma_alloc_coherent(dev, dma_size, &tx_desc_dma, GFP_KERNEL); + if (!tx_desc) + return -ENOMEM; + + tx_pkt =3D dma_alloc_coherent(dev, SZ_4K, &tx_pkt_dma, GFP_KERNEL); + if (!tx_pkt) + goto err_free_tx_desc_dma; + mutex_lock(&vf->cfg_lock); =20 vsi =3D ice_get_vf_vsi(vf); @@ -763,6 +1472,7 @@ int ice_migration_load_devstate(struct pci_dev *vf_dev= , const void *buf, tlv =3D buf; =20 do { + const void *data =3D tlv->data; size_t tlv_size; =20 switch (tlv->type) { @@ -771,6 +1481,18 @@ int ice_migration_load_devstate(struct pci_dev *vf_de= v, const void *buf, case ICE_MIG_TLV_VF_INFO: /* These are already handled above */ break; + case ICE_MIG_TLV_TX_QUEUE: + err =3D ice_migration_load_tx_queue(vf, vsi, data, + tx_desc, + tx_desc_dma, + tx_pkt_dma); + break; + case ICE_MIG_TLV_RX_QUEUE: + err =3D ice_migration_load_rx_queue(vf, vsi, data); + break; + case ICE_MIG_TLV_MSIX_REGS: + err =3D ice_migration_load_msix_regs(vf, vsi, data); + break; default: dev_dbg(dev, "Unexpected TLV %d in payload?\n", tlv->type); @@ -791,12 +1513,19 @@ int ice_migration_load_devstate(struct pci_dev *vf_d= ev, const void *buf, =20 ice_put_vf(vf); =20 + dma_free_coherent(dev, SZ_4K, tx_pkt, tx_pkt_dma); + dma_free_coherent(dev, dma_size, tx_desc, tx_desc_dma); + return 0; =20 err_release_cfg_lock: mutex_unlock(&vf->cfg_lock); ice_put_vf(vf); =20 + dma_free_coherent(dev, SZ_4K, tx_pkt, tx_pkt_dma); +err_free_tx_desc_dma: + dma_free_coherent(dev, dma_size, tx_desc, tx_desc_dma); + return err; } EXPORT_SYMBOL(ice_migration_load_devstate); diff --git a/drivers/net/ethernet/intel/ice/virt/queues.c b/drivers/net/eth= ernet/intel/ice/virt/queues.c index 40575cfe6dd4..65ba0a1d8c1f 100644 --- a/drivers/net/ethernet/intel/ice/virt/queues.c +++ b/drivers/net/ethernet/intel/ice/virt/queues.c @@ -299,6 +299,27 @@ int ice_vc_ena_qs_msg(struct ice_vf *vf, u8 *msg) continue; =20 ice_vf_ena_txq_interrupt(vsi, vf_q_id); + + /* The Tx head register is a shadow copy of on-die Tx head + * which maintains the accurate location. The Tx head register + * is only updated after a packet is sent, and is updated to + * the value one behind the actual on-die Tx head value. + * + * Even after queue enable, until a packet is sent the Tx head + * remains whatever value it had before. + * + * QTX_COMM_HEAD.HEAD values from 0x1fe0 to 0x1fff are + * reserved and will never be used by HW. Manually write a + * reserved value into Tx head and use this as a marker to + * indicate that no packets have been sent since the queue was + * enabled. + * + * This marker is used by the live migration logic to + * accurately determine the Tx queue head. + */ + wr32(&vsi->back->hw, QTX_COMM_HEAD(vsi->txq_map[vf_q_id]), + QTX_COMM_HEAD_HEAD_M); + set_bit(vf_q_id, vf->txq_ena); } =20 --=20 2.51.0.rc1.197.g6d975e95c9d7 From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 148A226A0A7; Tue, 9 Sep 2025 22:00:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455215; cv=none; b=A6QoGfMdtTYvqtqojOMijcSNw+WY9TyNkWxkcjTC3rRiIbNBfIxUXM1x1dNz4hr+oJoHAOkdsWTu6cQU1D6O/VxDu5OUVI7BafIKEFtXVEUeLGy9rMWG8mx+0q+bievXJIPAr3gLRUevyjk/0zVQAYAuurhDPmMTjMMU9mi2BZU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455215; c=relaxed/simple; bh=Zbi2e2VvKzf88ESYqAWc3a4461r6YnvPI+Q6t4jVxPE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=QLqFU9VfiStdtEe4RPl9J4s0ne0x2WBJBWuNiiF13kzBQdmZ1Ml6sI0AmqDINLhlx/+ZgRq3rvW9PNdcnofYIBNlOcNlvj24YY6wP/n+1ymDQLbhhdAwfCCYpwRa4aEiAaP8s8/02IZEAsto2j9TwX8Jt/kWuOC9Jl2xk1ePV2w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=QH0BnNyy; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="QH0BnNyy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455213; x=1788991213; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=Zbi2e2VvKzf88ESYqAWc3a4461r6YnvPI+Q6t4jVxPE=; b=QH0BnNyyYZvshQlkp3uQGLebk6KSMUkvlcVmexwWeMje5NvKLBoP+GRD r+gcqsxstiQJwIZA43zAjoaciIAyXFjaC3ZPLcO02UpmeFpJnZ+Tli0WY P1EefJK/457DeZX49jwfvqdJ1YkH7e2qDTW08efuiz9W7zCZQz9qdkduV zqKEhKhzPc/8utAaRX8zhqLQHHtipQCL97p0FOi2J5kH/jqi2OfGtzVnh 1BaOUPw4XOw7lB555Nz043CKa4F9H0xcuS/VvOVxjv4wU+ILJG9287+4J QCr5YXeOducJr6yWbuk4+Mgoe/CJNGLXTjdFhKEDH1W+pSGlUl9CMzf+Z A==; X-CSE-ConnectionGUID: NCaQgkyeSKO3pn9+Y27ydQ== X-CSE-MsgGUID: 8FJkW35cQYOYMaC7a7JjNQ== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584668" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584668" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:09 -0700 X-CSE-ConnectionGUID: l72Vu2WKSk6+qgZjp+zxcg== X-CSE-MsgGUID: YAUTIRX/QWal3k5073zBYQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780968" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:08 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:54 -0700 Subject: [PATCH RFC net-next 5/7] ice: add remaining migration TLVs Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-5-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=28999; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=Zbi2e2VvKzf88ESYqAWc3a4461r6YnvPI+Q6t4jVxPE=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi1NP1jE/KOa6sScwfcpN0diXdbO1p8gpfu5dsf6Gt 2nJVNn0jlIWBjEuBlkxRRYFh5CV140nhGm9cZaDmcPKBDKEgYtTACYS/Y3hr7zcrz+OWjpKvU1R LML381/K82Vszb2i3rDaaM4a47OOGxn+KZf1pETGGBoGH/BfsPhTZgrDJDUrDysRBh717ognPsu 4AQ== X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 Add a handful of remaining TLVs to complete the migration payload data including: * ICE_MIG_TLV_MBX_REGS This TLV contains the VF mailbox registers data to migrate and restore the mailbox queue to its appropriate state. * ICE_MIG_TLV_STATS This TLV contains the VF statistics to ensure that the original and target VM maintain the same stat counts. * ICE_MIG_TLV_RSS This TLV contains the RSS information from the original host, ensuring that such configuration is applied on the new host. * ICE_MIG_TLV_VLAN_FILTERS This TLV contains all the VLAN filters currently programmed into hardware by the VF. It is sent as a single variable length flexible array instead of as individual TLVs per VLAN to avoid a 4-byte overhead per-VLAN. * ICE_MIG_TLV_MAC_FILTERS This TLV contains all of the MAC filters currently programmed into the hardware by the VF. As with VLANs, it is sent as a flexible array to avoid too much overhead when there are many filters. Add functions to save and restore this data appropriately during the live migration process. Signed-off-by: Jacob Keller Reviewed-by: Aleksandr Loktionov --- drivers/net/ethernet/intel/ice/ice_hw_autogen.h | 8 + .../net/ethernet/intel/ice/virt/migration_tlv.h | 133 +++++ drivers/net/ethernet/intel/ice/virt/migration.c | 616 +++++++++++++++++= ++++ 3 files changed, 757 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h b/drivers/net/= ethernet/intel/ice/ice_hw_autogen.h index dd520aa4d1d6..954d671aee64 100644 --- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h +++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h @@ -39,8 +39,16 @@ #define PF_FW_ATQLEN_ATQVFE_M BIT(28) #define PF_FW_ATQLEN_ATQOVFL_M BIT(29) #define PF_FW_ATQLEN_ATQCRIT_M BIT(30) +#define VF_MBX_ARQBAH(_VF) (0x0022B800 + ((_VF) * 4)) +#define VF_MBX_ARQBAL(_VF) (0x0022B400 + ((_VF) * 4)) +#define VF_MBX_ARQH(_VF) (0x0022C000 + ((_VF) * 4)) #define VF_MBX_ARQLEN(_VF) (0x0022BC00 + ((_VF) * 4)) +#define VF_MBX_ARQT(_VF) (0x0022C400 + ((_VF) * 4)) +#define VF_MBX_ATQBAH(_VF) (0x0022A400 + ((_VF) * 4)) +#define VF_MBX_ATQBAL(_VF) (0x0022A000 + ((_VF) * 4)) +#define VF_MBX_ATQH(_VF) (0x0022AC00 + ((_VF) * 4)) #define VF_MBX_ATQLEN(_VF) (0x0022A800 + ((_VF) * 4)) +#define VF_MBX_ATQT(_VF) (0x0022B000 + ((_VF) * 4)) #define PF_FW_ATQLEN_ATQENABLE_M BIT(31) #define PF_FW_ATQT 0x00080400 #define PF_MBX_ARQBAH 0x0022E400 diff --git a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h b/drivers/= net/ethernet/intel/ice/virt/migration_tlv.h index 3e10e53868b2..183555cda9b3 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration_tlv.h +++ b/drivers/net/ethernet/intel/ice/virt/migration_tlv.h @@ -82,6 +82,16 @@ struct ice_migration_tlv { * @ICE_MIG_TLV_MSIX_REGS: MSI-X register data for the VF. Appears once per * MSI-X interrupt, including the miscellaneous interrupt for the mailbox. * + * @ICE_MIG_TLV_MBX_REGS: Mailbox register data for the VF. + * + * @ICE_MIG_TLV_STATS: Current statistics counts of the VF. + * + * @ICE_MIG_TLV_RSS: RSS configuration for the VF. + * + * @ICE_MIG_TLV_VLAN_FILTERS: VLAN filter information. + * + * @ICE_MIG_TLV_MAC_FILTERS: MAC filter information. + * * @NUM_ICE_MIG_TLV: Number of known TLV types. Any type equal to or larger * than this value is unrecognized by this version. * @@ -97,6 +107,11 @@ enum ice_migration_tlvs { ICE_MIG_TLV_TX_QUEUE, ICE_MIG_TLV_RX_QUEUE, ICE_MIG_TLV_MSIX_REGS, + ICE_MIG_TLV_MBX_REGS, + ICE_MIG_TLV_STATS, + ICE_MIG_TLV_RSS, + ICE_MIG_TLV_VLAN_FILTERS, + ICE_MIG_TLV_MAC_FILTERS, =20 /* Add new types above here */ NUM_ICE_MIG_TLV @@ -258,6 +273,119 @@ struct ice_mig_msix_regs { u16 vector_id; } __packed; =20 +/** + * struct ice_mig_stats - Hardware statistics counts from migrating VF + * @rx_bytes: total nr received bytes (gorc) + * @rx_unicast: total nr received unicast packets (uprc) + * @rx_multicast: total nr received multicast packets (mprc) + * @rx_broadcast: total nr received broadcast packets (bprc) + * @rx_discards: total nr packets discarded on receipt (rdpc) + * @rx_unknown_protocol: total nr Rx packets with unrecognized protocol (r= upp) + * @tx_bytes: total nr transmitted bytes (gotc) + * @tx_unicast: total nr transmitted unicast packets (uptc) + * @tx_multicast: total nr transmitted multicast packets (mptc) + * @tx_broadcast: total nr transmitted broadcast packets (bptc) + * @tx_discards: total nr packets discarded on transmit (tdpc) + * @tx_errors: total number of transmit errors (tepc) + */ +struct ice_mig_stats { + u64 rx_bytes; + u64 rx_unicast; + u64 rx_multicast; + u64 rx_broadcast; + u64 rx_discards; + u64 rx_unknown_protocol; + u64 tx_bytes; + u64 tx_unicast; + u64 tx_multicast; + u64 tx_broadcast; + u64 tx_discards; + u64 tx_errors; +} __packed; + +/** + * struct ice_mig_mbx_regs - PF<->VF Mailbox register data for the VF + * @atq_head: the head position of the VF AdminQ Tx ring + * @atq_tail: the tail position of the VF AdminQ Tx ring + * @atq_bal: lower 32-bits of the VF AdminQ Tx ring base address + * @atq_bah: upper 32-bits of the VF AdminQ Tx ring base address + * @atq_len: length of the VF AdminQ Tx ring + * @arq_head: the head position of the VF AdminQ Rx ring + * @arq_tail: the tail position of the VF AdminQ Tx ring + * @arq_bal: lower 32-bits of the VF AdminQ Rx ring base address + * @arq_bah: upper 32-bits of the VF AdminQ Rx ring base address + * @arq_len: length of the VF AdminQ Rx ring + */ +struct ice_mig_mbx_regs { + u32 atq_head; + u32 atq_tail; + u32 atq_bal; + u32 atq_bah; + u32 atq_len; + u32 arq_head; + u32 arq_tail; + u32 arq_bal; + u32 arq_bah; + u32 arq_len; +} __packed; + +/** + * struct ice_mig_rss - RSS configuration for the migrating VF + * @hashcfg: RSS Hash filter configuration + * @key: RSS key + * @lut_size: size of the RSS lookup table + * @hfunc: RSS hash function selected + * @lut: RSS lookup table configuration + */ +struct ice_mig_rss { + u64 hashcfg; + /* TODO: Can this key change size? Should this be a plain buffer + * instead of the struct? + */ + struct ice_aqc_get_set_rss_keys key; + u16 lut_size; + u8 hfunc; + u8 lut[] __counted_by(lut_size); +} __packed; + +/** + * struct ice_mig_vlan_filter - VLAN filter information + * @tpid: VLAN TPID + * @vid: VLAN ID + */ +struct ice_mig_vlan_filter { + u16 tpid; + u16 vid; +} __packed; + +/** + * struct ice_mig_vlan_filters - List of VLAN filters for the VF + * @num_vlans: number of VLANs for this VF + * @vlans: VLAN data + */ +struct ice_mig_vlan_filters { + u16 num_vlans; + struct ice_mig_vlan_filter vlans[] __counted_by(num_vlans); +} __packed; + +/** + * struct ice_mig_mac_filter - MAC address data for a VF filter + * @mac_addr: the MAC address + */ +struct ice_mig_mac_filter { + u8 mac_addr[ETH_ALEN]; +} __packed; + +/** + * struct ice_mig_mac_filters - List of MAC filters for the VF + * @num_macs: number of MAC filters for this VF + * @macs: MAC address data + */ +struct ice_mig_mac_filters { + u16 num_macs; + struct ice_mig_mac_filter macs[] __counted_by(num_macs); +} __packed; + /** * ice_mig_tlv_type - Convert a TLV type to its number * @p: the TLV structure type @@ -273,6 +401,11 @@ struct ice_mig_msix_regs { struct ice_mig_tx_queue : ICE_MIG_TLV_TX_QUEUE, \ struct ice_mig_rx_queue : ICE_MIG_TLV_RX_QUEUE, \ struct ice_mig_msix_regs : ICE_MIG_TLV_MSIX_REGS, \ + struct ice_mig_mbx_regs : ICE_MIG_TLV_MBX_REGS, \ + struct ice_mig_stats : ICE_MIG_TLV_STATS, \ + struct ice_mig_rss : ICE_MIG_TLV_RSS, \ + struct ice_mig_vlan_filters : ICE_MIG_TLV_VLAN_FILTERS,\ + struct ice_mig_mac_filters : ICE_MIG_TLV_MAC_FILTERS, \ default : ICE_MIG_TLV_END) =20 /** diff --git a/drivers/net/ethernet/intel/ice/virt/migration.c b/drivers/net/= ethernet/intel/ice/virt/migration.c index e59eb99b20da..a9f6d3019c0c 100644 --- a/drivers/net/ethernet/intel/ice/virt/migration.c +++ b/drivers/net/ethernet/intel/ice/virt/migration.c @@ -401,6 +401,320 @@ static int ice_migration_save_msix_regs(struct ice_vf= *vf, return err; } =20 +/** + * ice_migration_save_mbx_regs - Save Mailbox registers + * @vf: pointer to the VF being migrated + * + * Save the mailbox registers for communicating with VF in preparation for + * live migration. + * + * Return: 0 for success, negative for error + */ +static int ice_migration_save_mbx_regs(struct ice_vf *vf) +{ + struct ice_mig_mbx_regs *mbx_regs; + struct ice_hw *hw =3D &vf->pf->hw; + + lockdep_assert_held(&vf->cfg_lock); + + mbx_regs =3D ice_mig_alloc_tlv(mbx_regs); + if (!mbx_regs) + return -ENOMEM; + + mbx_regs->atq_head =3D rd32(hw, VF_MBX_ATQH(vf->vf_id)); + mbx_regs->atq_tail =3D rd32(hw, VF_MBX_ATQT(vf->vf_id)); + mbx_regs->atq_bal =3D rd32(hw, VF_MBX_ATQBAL(vf->vf_id)); + mbx_regs->atq_bah =3D rd32(hw, VF_MBX_ATQBAH(vf->vf_id)); + mbx_regs->atq_len =3D rd32(hw, VF_MBX_ATQLEN(vf->vf_id)); + + mbx_regs->arq_head =3D rd32(hw, VF_MBX_ARQH(vf->vf_id)); + mbx_regs->arq_tail =3D rd32(hw, VF_MBX_ARQT(vf->vf_id)); + mbx_regs->arq_bal =3D rd32(hw, VF_MBX_ARQBAL(vf->vf_id)); + mbx_regs->arq_bah =3D rd32(hw, VF_MBX_ARQBAH(vf->vf_id)); + mbx_regs->arq_len =3D rd32(hw, VF_MBX_ARQLEN(vf->vf_id)); + + ice_mig_tlv_add_tail(mbx_regs, &vf->mig_tlvs); + + return 0; +} + +/** + * ice_migration_save_stats - Save VF statistics counters + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Update and save the current statistics values for the VF. + * + * Return: 0 for success, negative for error + */ +static int ice_migration_save_stats(struct ice_vf *vf, struct ice_vsi *vsi) +{ + struct ice_mig_stats *stats; + + lockdep_assert_held(&vf->cfg_lock); + + stats =3D ice_mig_alloc_tlv(stats); + if (!stats) + return -ENOMEM; + + ice_update_eth_stats(vsi); + + stats->rx_bytes =3D vsi->eth_stats.rx_bytes; + stats->rx_unicast =3D vsi->eth_stats.rx_unicast; + stats->rx_multicast =3D vsi->eth_stats.rx_multicast; + stats->rx_broadcast =3D vsi->eth_stats.rx_broadcast; + stats->rx_discards =3D vsi->eth_stats.rx_discards; + stats->rx_unknown_protocol =3D vsi->eth_stats.rx_unknown_protocol; + stats->tx_bytes =3D vsi->eth_stats.tx_bytes; + stats->tx_unicast =3D vsi->eth_stats.tx_unicast; + stats->tx_multicast =3D vsi->eth_stats.tx_multicast; + stats->tx_broadcast =3D vsi->eth_stats.tx_broadcast; + stats->tx_discards =3D vsi->eth_stats.tx_discards; + stats->tx_errors =3D vsi->eth_stats.tx_errors; + + ice_mig_tlv_add_tail(stats, &vf->mig_tlvs); + + return 0; +} + +/** + * ice_migration_save_rss - Save RSS configuration during suspend + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Save the RSS configuration for this VF, including the hash function, ha= sh + * set configuration, lookup table, and RSS key. + * + * Return: 0 on success, or an error code on failure. + */ +static int ice_migration_save_rss(struct ice_vf *vf, struct ice_vsi *vsi) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_hw *hw =3D &vf->pf->hw; + struct ice_mig_rss *rss; + + lockdep_assert_held(&vf->cfg_lock); + + /* Skip RSS if its not supported by this PF */ + if (!test_bit(ICE_FLAG_RSS_ENA, vf->pf->flags)) { + dev_dbg(dev, "RSS is not supported by the PF\n"); + return 0; + } + + dev_dbg(dev, "Saving RSS config for VF %u\n", + vf->vf_id); + + /* When ice PF supports variable RSS LUT sizes, this will need to be + * updated. For now, the PF enforces a strict table size of + * ICE_LUT_VSI_SIZE. + */ + rss =3D ice_mig_alloc_flex_tlv(rss, lut, ICE_LUT_VSI_SIZE); + if (!rss) + return -ENOMEM; + + rss->hashcfg =3D vf->rss_hashcfg; + rss->hfunc =3D vsi->rss_hfunc; + rss->lut_size =3D ICE_LUT_VSI_SIZE; + ice_aq_get_rss_key(hw, vsi->idx, &rss->key); + ice_get_rss_lut(vsi, rss->lut, ICE_LUT_VSI_SIZE); + + ice_mig_tlv_add_tail(rss, &vf->mig_tlvs); + + return 0; +} + +/** + * ice_migration_save_vlan_filters - Save all VLAN filters used by VF + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Save the VLAN filters configured for the VF when suspending it for live + * migration. + * + * Return: 0 on success, negative error code on failure. + */ +static int ice_migration_save_vlan_filters(struct ice_vf *vf, + struct ice_vsi *vsi) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_fltr_mgmt_list_entry *fm_entry; + struct ice_mig_vlan_filters *vlan_filters; + struct ice_hw *hw =3D &vf->pf->hw; + struct list_head *rule_head; + struct ice_switch_info *sw; + int vlan_idx; + + lockdep_assert_held(&vf->cfg_lock); + + if (!vsi->num_vlan) + return 0; + + dev_dbg(dev, "Saving %u VLANs for VF %d\n", + vsi->num_vlan, vf->vf_id); + + /* Ensure variable size TLV is aligned to 4 bytes */ + vlan_filters =3D ice_mig_alloc_flex_tlv(vlan_filters, vlans, + vsi->num_vlan); + if (!vlan_filters) + return -ENOMEM; + + vlan_filters->num_vlans =3D vsi->num_vlan; + + sw =3D hw->switch_info; + rule_head =3D &sw->recp_list[ICE_SW_LKUP_VLAN].filt_rules; + + mutex_lock(&sw->recp_list[ICE_SW_LKUP_VLAN].filt_rule_lock); + + list_for_each_entry(fm_entry, rule_head, list_entry) { + struct ice_mig_vlan_filter *vlan; + + /* ignore anything that isn't a VLAN VSI filter */ + if (fm_entry->fltr_info.lkup_type !=3D ICE_SW_LKUP_VLAN || + (fm_entry->fltr_info.fltr_act !=3D ICE_FWD_TO_VSI && + fm_entry->fltr_info.fltr_act !=3D ICE_FWD_TO_VSI_LIST)) + continue; + + if (fm_entry->vsi_count < 2 && !fm_entry->vsi_list_info && + fm_entry->fltr_info.fltr_act =3D=3D ICE_FWD_TO_VSI) { + /* Check if ICE_FWD_TO_VSI matches this VSI */ + if (fm_entry->fltr_info.vsi_handle !=3D vsi->idx) + continue; + } else if (fm_entry->vsi_list_info && + fm_entry->fltr_info.fltr_act =3D=3D ICE_FWD_TO_VSI_LIST) { + /* Check if ICE_FWD_TO_VSI_LIST matches this VSI */ + if (!test_bit(vsi->idx, + fm_entry->vsi_list_info->vsi_map)) + continue; + } else { + dev_dbg(dev, "Ignoring malformed filter entry that doesn't look like ei= ther a VSI or VSI list filter.\n"); + continue; + } + + /* We shouldn't hit this, assuming num_vlan is consistent with + * the actual number of entries in the table. + */ + if (vlan_idx >=3D vsi->num_vlan) { + dev_warn(dev, "VF VSI claims to have %d VLAN filters but we found more = than that in the switch table. Some filters might be lost in migration\n", + vsi->num_vlan); + break; + } + + vlan =3D &vlan_filters->vlans[vlan_idx]; + vlan->vid =3D fm_entry->fltr_info.l_data.vlan.vlan_id; + if (fm_entry->fltr_info.l_data.vlan.tpid_valid) + vlan->tpid =3D fm_entry->fltr_info.l_data.vlan.tpid; + else + vlan->tpid =3D ETH_P_8021Q; + + vlan_idx++; + } + + if (vlan_idx !=3D vsi->num_vlan) { + dev_warn(dev, "VSI had %u VLANs, but we only found %u VLANs\n", + vsi->num_vlan, vlan_idx); + vlan_filters->num_vlans =3D vlan_idx; + } + + ice_mig_tlv_add_tail(vlan_filters, &vf->mig_tlvs); + + mutex_unlock(&sw->recp_list[ICE_SW_LKUP_VLAN].filt_rule_lock); + + return 0; +} + +/** + * ice_migration_save_mac_filters - Save MAC filters used by VF + * @vf: pointer to the VF being migrated + * @vsi: the VSI for this VF + * + * Save the MAC filters configured for the VF when suspending it for live + * migration. + * + * Return: 0 on success, negative error code on failure. + */ +static int ice_migration_save_mac_filters(struct ice_vf *vf, + struct ice_vsi *vsi) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_fltr_mgmt_list_entry *fm_entry; + struct ice_mig_mac_filters *mac_filters; + struct ice_hw *hw =3D &vf->pf->hw; + struct list_head *rule_head; + struct ice_switch_info *sw; + int mac_idx; + + lockdep_assert_held(&vf->cfg_lock); + + if (!vf->num_mac) + return 0; + + dev_dbg(dev, "Saving %u MAC filters for VF %u\n", + vf->num_mac, vf->vf_id); + + /* Ensure variable size TLV is aligned to 4 bytes */ + mac_filters =3D ice_mig_alloc_flex_tlv(mac_filters, macs, + vf->num_mac); + if (!mac_filters) + return -ENOMEM; + + mac_filters->num_macs =3D vf->num_mac; + + sw =3D hw->switch_info; + rule_head =3D &sw->recp_list[ICE_SW_LKUP_MAC].filt_rules; + + mutex_lock(&sw->recp_list[ICE_SW_LKUP_MAC].filt_rule_lock); + + mac_idx =3D 0; + list_for_each_entry(fm_entry, rule_head, list_entry) { + /* ignore anything that isn't a MAC VSI filter */ + if (fm_entry->fltr_info.lkup_type !=3D ICE_SW_LKUP_MAC || + (fm_entry->fltr_info.fltr_act !=3D ICE_FWD_TO_VSI && + fm_entry->fltr_info.fltr_act !=3D ICE_FWD_TO_VSI_LIST)) + continue; + + if (fm_entry->vsi_count < 2 && !fm_entry->vsi_list_info && + fm_entry->fltr_info.fltr_act =3D=3D ICE_FWD_TO_VSI) { + /* Check if ICE_FWD_TO_VSI matches this VSI */ + if (fm_entry->fltr_info.vsi_handle !=3D vsi->idx) + continue; + } else if (fm_entry->vsi_list_info && + fm_entry->fltr_info.fltr_act =3D=3D ICE_FWD_TO_VSI_LIST) { + /* Check if ICE_FWD_TO_VSI_LIST matches this VSI */ + if (!test_bit(vsi->idx, + fm_entry->vsi_list_info->vsi_map)) + continue; + } else { + dev_dbg(dev, "Ignoring malformed filter entry that doesn't look like ei= ther a VSI or VSI list filter.\n"); + continue; + } + + /* We shouldn't hit this, assuming num_mac is consistent with + * the actual number of entries in the table. + */ + if (mac_idx >=3D vf->num_mac) { + dev_warn(dev, "VF claims to have %d MAC filters but we found more than = that in the switch table. Some filters might be lost in migration\n", + vf->num_mac); + break; + } + + ether_addr_copy(mac_filters->macs[mac_idx].mac_addr, + fm_entry->fltr_info.l_data.mac.mac_addr); + mac_idx++; + } + + if (mac_idx !=3D vf->num_mac) { + dev_warn(dev, "VF VSI had %u MAC filters, but we only found %u MAC filte= rs\n", + vf->num_mac, mac_idx); + mac_filters->num_macs =3D mac_idx; + } + + ice_mig_tlv_add_tail(mac_filters, &vf->mig_tlvs); + + mutex_unlock(&sw->recp_list[ICE_SW_LKUP_MAC].filt_rule_lock); + + return 0; +} + /** * ice_migration_suspend_dev - suspend device * @vf_dev: pointer to the VF PCI device @@ -463,6 +777,17 @@ int ice_migration_suspend_dev(struct pci_dev *vf_dev, = bool save_state) if (err) goto err_free_mig_tlvs; =20 + err =3D ice_migration_save_rss(vf, vsi); + if (err) + goto err_free_mig_tlvs; + + err =3D ice_migration_save_vlan_filters(vf, vsi); + if (err) + goto err_free_mig_tlvs; + + err =3D ice_migration_save_mac_filters(vf, vsi); + if (err) + goto err_free_mig_tlvs; } =20 /* Prevent VSI from queuing incoming packets by removing all filters */ @@ -498,6 +823,16 @@ int ice_migration_suspend_dev(struct pci_dev *vf_dev, = bool save_state) err =3D ice_migration_save_tx_queues(vf, vsi); if (err) goto err_free_mig_tlvs; + + /* Save mailbox registers */ + err =3D ice_migration_save_mbx_regs(vf); + if (err) + goto err_free_mig_tlvs; + + /* Save current VF statistics */ + err =3D ice_migration_save_stats(vf, vsi); + if (err) + goto err_free_mig_tlvs; } =20 mutex_unlock(&vf->cfg_lock); @@ -1397,6 +1732,272 @@ ice_migration_load_msix_regs(struct ice_vf *vf, str= uct ice_vsi *vsi, return 0; } =20 +/** + * ice_migration_load_mbx_regs - Load mailbox registers from migration pay= load + * @vf: pointer to the VF being migrated to + * @mbx_regs: the mailbox register data from migration payload + * + * Load the mailbox register configuration from the migration payload and + * initialize the target VF. + */ +static void ice_migration_load_mbx_regs(struct ice_vf *vf, + const struct ice_mig_mbx_regs *mbx_regs) +{ + struct ice_hw *hw =3D &vf->pf->hw; + + lockdep_assert_held(&vf->cfg_lock); + + wr32(hw, VF_MBX_ATQH(vf->vf_id), mbx_regs->atq_head); + wr32(hw, VF_MBX_ATQT(vf->vf_id), mbx_regs->atq_tail); + wr32(hw, VF_MBX_ATQBAL(vf->vf_id), mbx_regs->atq_bal); + wr32(hw, VF_MBX_ATQBAH(vf->vf_id), mbx_regs->atq_bah); + wr32(hw, VF_MBX_ATQLEN(vf->vf_id), mbx_regs->atq_len); + + wr32(hw, VF_MBX_ARQH(vf->vf_id), mbx_regs->arq_head); + wr32(hw, VF_MBX_ARQT(vf->vf_id), mbx_regs->arq_tail); + wr32(hw, VF_MBX_ARQBAL(vf->vf_id), mbx_regs->arq_bal); + wr32(hw, VF_MBX_ARQBAH(vf->vf_id), mbx_regs->arq_bah); + wr32(hw, VF_MBX_ARQLEN(vf->vf_id), mbx_regs->arq_len); +} + +/** + * ice_migration_load_stats - Load VF statistics from migration buffer + * @vf: pointer to the VF being migrated to + * @vsi: the VSI for this VF + * @stats: the statistics values from the migration buffer. + * + * Load the VF statistics from the migration buffer, and re-initialize HW + * stats offsets to match. + */ +static void ice_migration_load_stats(struct ice_vf *vf, struct ice_vsi *vs= i, + const struct ice_mig_stats *stats) +{ + lockdep_assert_held(&vf->cfg_lock); + + vsi->eth_stats.rx_bytes =3D stats->rx_bytes; + vsi->eth_stats.rx_unicast =3D stats->rx_unicast; + vsi->eth_stats.rx_multicast =3D stats->rx_multicast; + vsi->eth_stats.rx_broadcast =3D stats->rx_broadcast; + vsi->eth_stats.rx_discards =3D stats->rx_discards; + vsi->eth_stats.rx_unknown_protocol =3D stats->rx_unknown_protocol; + vsi->eth_stats.tx_bytes =3D stats->tx_bytes; + vsi->eth_stats.tx_unicast =3D stats->tx_unicast; + vsi->eth_stats.tx_multicast =3D stats->tx_multicast; + vsi->eth_stats.tx_broadcast =3D stats->tx_broadcast; + vsi->eth_stats.tx_discards =3D stats->tx_discards; + vsi->eth_stats.tx_errors =3D stats->tx_errors; + + /* Force the stats offsets to reload so that reported statistics + * exactly match the values from the migration buffer. + */ + vsi->stat_offsets_loaded =3D false; + ice_update_eth_stats(vsi); +} + +/** + * ice_migration_load_rss - Load VF RSS configuration from migration buffer + * @vf: pointer to the VF being migrated to + * @vsi: the VSI for this VF + * @rss: the RSS configuration from the migration buffer + * + * Load the VF RSS configuration from the migration buffer, and configure = the + * target VF to match. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int ice_migration_load_rss(struct ice_vf *vf, struct ice_vsi *vsi, + const struct ice_mig_rss *rss) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_hw *hw =3D &vf->pf->hw; + int err; + + if (!test_bit(ICE_FLAG_RSS_ENA, vf->pf->flags)) { + dev_err(dev, "RSS is not supported by the PF\n"); + return -EOPNOTSUPP; + } + + dev_dbg(dev, "Loading RSS configuration for VF %u\n", vf->vf_id); + + err =3D ice_set_rss_key(vsi, (u8 *)&rss->key); + if (err) { + dev_dbg(dev, "Failed to set RSS key for VF %d, err %d\n", + vf->vf_id, err); + return err; + } + + err =3D ice_set_rss_lut(vsi, (u8 *)rss->lut, rss->lut_size); + if (err) { + dev_dbg(dev, "Failed to set RSS lookup table for VF %d, err %d\n", + vf->vf_id, err); + return err; + } + + err =3D ice_set_rss_hfunc(vsi, rss->hfunc); + if (err) { + dev_dbg(dev, "Failed to set RSS hash function for VF %d, err %d\n", + vf->vf_id, err); + return err; + } + + err =3D ice_rem_vsi_rss_cfg(hw, vsi->idx); + if (err && !rss->hashcfg) { + /* only report failure to clear the current RSS configuration + * if that was clearly the migrated VF's intention. + */ + dev_dbg(dev, "Failed to clear RSS hash configuration for VF %d, err %d\n= ", + vf->vf_id, err); + return err; + } + + if (!rss->hashcfg) + return 0; + + err =3D ice_add_avf_rss_cfg(hw, vsi, rss->hashcfg); + if (err) { + dev_dbg(dev, "Failed to set RSS hash configuration for VF %d, err %d\n", + vf->vf_id, err); + return err; + } + + return 0; +} + +/** + * ice_migration_load_vlan_filters - Load VLAN filters from migration buff= er + * @vf: pointer to the VF being migrated to + * @vsi: the VSI for this VF + * @vlan_filters: VLAN filters from the migration payload + * + * Load the VLAN filters from the migration payload and program the target= VF + * to match. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int +ice_migration_load_vlan_filters(struct ice_vf *vf, struct ice_vsi *vsi, + const struct ice_mig_vlan_filters *vlan_filters) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + struct ice_vsi_vlan_ops *vlan_ops; + struct ice_hw *hw =3D &vf->pf->hw; + int err; + + dev_dbg(dev, "Loading %u VLANs for VF %d\n", + vlan_filters->num_vlans, vf->vf_id); + + for (int idx =3D 0; idx < vlan_filters->num_vlans; idx++) { + const struct ice_mig_vlan_filter *entry; + struct ice_vlan vlan; + + entry =3D &vlan_filters->vlans[idx]; + vlan =3D ICE_VLAN(entry->tpid, entry->vid, 0); + + /* ice_vsi_add_vlan converts -EEXIST errors from + * ice_fltr_add_vlan() into a successful return. + */ + err =3D ice_vsi_add_vlan(vsi, &vlan); + if (err) { + dev_dbg(dev, "Failed to add VLAN %d for VF %d, err %d\n", + entry->vid, vf->vf_id, err); + return err; + } + + /* We're re-adding the hardware vlan filters. The VF can + * either add outer VLANs (in DVM), or inner VLANs (in + * SVM). In SVM, we only enable promiscuous if the port VLAN + * is hot set. + */ + if (ice_is_vlan_promisc_allowed(vf) && + (ice_is_dvm_ena(hw) || !ice_vf_is_port_vlan_ena(vf))) { + err =3D ice_vf_ena_vlan_promisc(vf, vsi, &vlan); + if (err) { + dev_dbg(dev, "Failed to enable promiscuous filter on VLAN %d for VF %d= , err %d\n", + entry->vid, vf->vf_id, err); + return err; + } + } + } + + vlan_ops =3D ice_get_compat_vsi_vlan_ops(vsi); + + if (ice_vsi_has_non_zero_vlans(vsi)) { + err =3D vlan_ops->ena_rx_filtering(vsi); + if (err) { + dev_dbg(dev, "Failed to enable VLAN pruning, err %d\n", + err); + return err; + } + + if (vf->spoofchk) { + err =3D vlan_ops->ena_tx_filtering(vsi); + if (err) { + dev_dbg(dev, "Failed to enable VLAN anti-spoofing, err %d\n", + err); + return err; + } + } + } else { + /* Disable VLAN filtering when only VLAN 0 is left */ + vlan_ops->dis_tx_filtering(vsi); + vlan_ops->dis_rx_filtering(vsi); + } + + if (vsi->num_vlan !=3D vlan_filters->num_vlans) + dev_dbg(dev, "VF %d has %d VLAN filters, but we expected to have %d\n", + vf->vf_id, vsi->num_vlan, vlan_filters->num_vlans); + + return 0; +} + +/** + * ice_migration_load_mac_filters - Load MAC filters from migration buffer + * @vf: pointer to the VF being migrated to + * @vsi: the VSI for this VF + * @mac_filters: MAC address filters from the migration payload + * + * Load the MAC filters from the migration payload and program them into t= he + * target VF. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int +ice_migration_load_mac_filters(struct ice_vf *vf, struct ice_vsi *vsi, + const struct ice_mig_mac_filters *mac_filters) +{ + struct device *dev =3D ice_pf_to_dev(vf->pf); + + dev_dbg(dev, "Loading %u MAC filters for VF %d\n", + mac_filters->num_macs, vf->vf_id); + + for (int idx =3D 0; idx < mac_filters->num_macs; idx++) { + const struct ice_mig_mac_filter *entry; + int err; + + entry =3D &mac_filters->macs[idx]; + + err =3D ice_fltr_add_mac(vsi, entry->mac_addr, ICE_FWD_TO_VSI); + if (!err) { + vf->num_mac++; + } else if (err =3D=3D -EEXIST) { + /* Ignore duplicate filters, since initial filters may + * already exist due to the resetting when loading the + * VF information TLV. + */ + } else { + dev_dbg(dev, "Failed to add MAC %pM for VF %d, err %d\n", + entry->mac_addr, vf->vf_id, err); + return err; + } + } + + if (vf->num_mac !=3D mac_filters->num_macs) + dev_dbg(dev, "VF %d has %d MAC filters, but we expected to have %d\n", + vf->vf_id, vf->num_mac, mac_filters->num_macs); + + return 0; +} + /** * ice_migration_load_devstate - Load device state into the target VF * @vf_dev: pointer to the VF PCI device @@ -1493,6 +2094,21 @@ int ice_migration_load_devstate(struct pci_dev *vf_d= ev, const void *buf, case ICE_MIG_TLV_MSIX_REGS: err =3D ice_migration_load_msix_regs(vf, vsi, data); break; + case ICE_MIG_TLV_MBX_REGS: + ice_migration_load_mbx_regs(vf, data); + break; + case ICE_MIG_TLV_STATS: + ice_migration_load_stats(vf, vsi, data); + break; + case ICE_MIG_TLV_RSS: + err =3D ice_migration_load_rss(vf, vsi, data); + break; + case ICE_MIG_TLV_VLAN_FILTERS: + err =3D ice_migration_load_vlan_filters(vf, vsi, data); + break; + case ICE_MIG_TLV_MAC_FILTERS: + err =3D ice_migration_load_mac_filters(vf, vsi, data); + break; default: dev_dbg(dev, "Unexpected TLV %d in payload?\n", tlv->type); --=20 2.51.0.rc1.197.g6d975e95c9d7 From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5121626FD97; Tue, 9 Sep 2025 22:00:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455217; cv=none; b=NZOy5cK75/cngFrVyeMCwnlK1xIorlRe2gfa9yvRNf6XJC8swZj5oYTvRpd3BHFI6WWPR8C2/2XuaDQ5wc8V1VTOmPDoEiLEfJ3loKm9+dsg6Hd2/V195TDDpyl7jCVR8B/lFPyelQode4ErjOdUGmgouuFTnq5URvWjl5WNvj4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455217; c=relaxed/simple; bh=IkacwfdlFmsVBWA5lp9RHRN9qNWyVF239IiYhl7cN/A=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Xw9abhe882ZDG47tX+N0D/0waKzvI1nYPJygmUW9MKzgqf7X/ooRBin1T3klw2jy+K+k7AnX2XN8hofhhjpl8Gwr2hQcVaSaLgfXORVOgO0pFi5aYxbeS4J8DV5nWA74Dk96DgaQBSSc5qIT+KqX6Q9IJptQi9376YOtWSj6E2Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=NJbqlac2; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="NJbqlac2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455214; x=1788991214; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=IkacwfdlFmsVBWA5lp9RHRN9qNWyVF239IiYhl7cN/A=; b=NJbqlac2ijvaEKmWXHyv4XoUg/z45C2a+ypJ3MhQzyp5H4zvMf7JPiKU jw98n30+kuBszcgqo6WHKc50jPM6dNlGkO1E+qRJ/Uz1A9/GciJlKoNWi 9yrqqWcfWiNM/AfTbOQ+Xh1hgSynKnzgGWwqc3dqPpalEsAPuaSCRlb1L Mrt1AqJItFGCxwFVm3/zNKYLo14OyyEYtLaAEYUH+6hkXMIrzeEQk0EKm QI5lTPMEqrs3H8EbgsC8xJ1AcoWoC8QXBGuRhjI+xfr11PF6I22NODvI5 XYYz88pEaZ7vZ4K6aHHPJQIiK2PxYAp8MoOuGj9wItEu62U7qucSZH3Dd g==; X-CSE-ConnectionGUID: qTz99/puTi2xdxYrabm/YQ== X-CSE-MsgGUID: ekUZm7huTy+cfsnhsDI4gw== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584674" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584674" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:09 -0700 X-CSE-ConnectionGUID: MSEoKOxQTeeZ4gO7PkzmHg== X-CSE-MsgGUID: tFveJ0vuSka+U/FRA2sJMg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780972" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:08 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:55 -0700 Subject: [PATCH RFC net-next 6/7] ice-vfio-pci: add ice VFIO PCI live migration driver Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-6-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=25059; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=IkacwfdlFmsVBWA5lp9RHRN9qNWyVF239IiYhl7cN/A=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi9MUJJ4qCGXdXvTpwKrC5ZueyMYf9NT5/kfYz2xH1 ZNbSz41dpSyMIhxMciKKbIoOISsvG48IUzrjbMczBxWJpAhDFycAjARiQsM/0NKFohfOfxGdeH/ zv1ZVrHxLv93ffsw/YilwYy3cbfOdt9gZDg40cTBL6bqz47zE5L/3BFS432pn560UebT/Uc7Wkr e3GUGAA== X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 Add the ice-vfio-pci driver module which enables live migration support via the vfio_migration_ops for the ice E800 series hardware. To use this module, you can create VFs in the usual way and then unbind them from iavf, and bind them to ice-vfio-pci: echo 2 >/sys/class/net/enp175s0f0np0/device/sriov_numvfs echo "0000:af:01.0" >/sys/bus/pci/drivers/iavf/unbind echo "0000:af:01.1" >/sys/bus/pci/drivers/iavf/unbind modprobe ice_vfio_pci echo "8086 1889" >/sys/bus/pci/drivers/ice-vfio-pci/new_id I've tested with QEMU using the "enable-migration=3Don" and "x-pre-copy-dirty-page-tracking=3Doff" settings, as we do not currently support dirty page tracking. The initial host QEMU instance is launched as usual, while the target QEMU instance is launched with the -incoming tcp:localhost:4444 option. To initiate migration you can issue the migration command from the QEMU console: migrate tcp:localhost:4444 The ice-vfio-pci driver connects to the ice driver using the interface defined in . The migration driver initializes by calling ice_migration_init_dev(). To save device state, the VF is paused using ice_migration_suspend_dev(), and then state is captured by ice_migration_save_devstate(). Some information about the VF must be saved during device suspend, as otherwise the data could be lost when stopping the device. For this reason, the ice_migration_suspend_dev() function takes a boolean indicating whether state should be saved. The VFIO migration state machine must suspend the initial device when stopping, but also suspends the target device when resuming. In the resume case, we do not need to save the state, so this can be elided when the VFIO state machine is transitioning to the resuming state. Note that full support is not functional until the PCI .reset_done handler is implemented in a following change. This was split out in order to better callout and explain the locking mechanism due to the complexity required to avoid ABBA locking violations. Signed-off-by: Jacob Keller Reviewed-by: Aleksandr Loktionov --- drivers/vfio/pci/ice/main.c | 699 ++++++++++++++++++++++++++++++++++++++= ++++ MAINTAINERS | 7 + drivers/vfio/pci/Kconfig | 2 + drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/ice/Kconfig | 8 + drivers/vfio/pci/ice/Makefile | 4 + 6 files changed, 722 insertions(+) diff --git a/drivers/vfio/pci/ice/main.c b/drivers/vfio/pci/ice/main.c new file mode 100644 index 000000000000..161053ba383c --- /dev/null +++ b/drivers/vfio/pci/ice/main.c @@ -0,0 +1,699 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2018-2025 Intel Corporation */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * struct ice_vfio_pci_migration_file - Migration payload file contents + * @filp: the file pointer for communicating with user space + * @lock: mutex protecting the migration file access + * @payload_length: length of the migration payload + * @disabled: if true, the migration file descriptor has been disabled + * @mig_data: buffer holding migration payload + * + * When saving device state, the payload length is calculated ahead of time + * and the buffer is sized appropriately. The receiver sets payload_length= to + * SZ_128K which should be sufficient space for any migration. + */ +struct ice_vfio_pci_migration_file { + struct file *filp; + struct mutex lock; + size_t payload_length; + bool disabled:1; + u8 mig_data[] __counted_by(payload_length); +}; + +/** + * struct ice_vfio_pci_device - Migration driver structure + * @core_device: The core device being operated on + * @mig_info: Migration information + * @state_mutex: mutex protecting the migration state + * @resuming_migf: Migration file containing data for the resuming VF + * @saving_migf: Migration file used to store data from saving VF + * @mig_state: the current migration state of the device + */ +struct ice_vfio_pci_device { + struct vfio_pci_core_device core_device; + struct vfio_device_migration_info mig_info; + struct mutex state_mutex; + struct ice_vfio_pci_migration_file *resuming_migf; + struct ice_vfio_pci_migration_file *saving_migf; + enum vfio_device_mig_state mig_state; +}; + +#define to_ice_vdev(dev) \ + container_of((dev), struct ice_vfio_pci_device, core_device.vdev) + +/** + * ice_vfio_pci_load_state - VFIO device state reloading + * @ice_vdev: pointer to ice-vfio-pci core device structure + * + * Load device state. This function is called when the userspace VFIO uAPI + * consumer wants to load the device state info from VFIO migration region= and + * load them into the device. This function should make sure all the device + * state info is loaded successfully. As a result, return value is mandato= ry + * to be checked. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int __must_check +ice_vfio_pci_load_state(struct ice_vfio_pci_device *ice_vdev) +{ + struct ice_vfio_pci_migration_file *migf =3D ice_vdev->resuming_migf; + struct pci_dev *pdev =3D ice_vdev->core_device.pdev; + + return ice_migration_load_devstate(pdev, + migf->mig_data, + migf->payload_length); +} + +/** + * ice_vfio_pci_save_state - VFIO device state saving + * @ice_vdev: pointer to ice-vfio-pci core device structure + * @migf: pointer to migration file + * + * Snapshot the device state and save it. This function is called when the + * VFIO uAPI consumer wants to snapshot the current device state and saves + * it into the VFIO migration region. This function should make sure all + * of the device state info is collected and saved successfully. As a + * result, return value is mandatory to be checked. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int __must_check +ice_vfio_pci_save_state(struct ice_vfio_pci_device *ice_vdev, + struct ice_vfio_pci_migration_file *migf) +{ + struct pci_dev *pdev =3D ice_vdev->core_device.pdev; + + return ice_migration_save_devstate(pdev, + migf->mig_data, + migf->payload_length); +} + +/** + * ice_vfio_migration_init - Initialization for live migration function + * @ice_vdev: pointer to ice-vfio-pci core device structure + * + * Return: 0 on success, or a negative error code on failure. + */ +static int ice_vfio_migration_init(struct ice_vfio_pci_device *ice_vdev) +{ + struct pci_dev *pdev =3D ice_vdev->core_device.pdev; + + return ice_migration_init_dev(pdev); +} + +/** + * ice_vfio_migration_uninit - Cleanup for live migration function + * @ice_vdev: pointer to ice-vfio-pci core device structure + */ +static void ice_vfio_migration_uninit(struct ice_vfio_pci_device *ice_vdev) +{ + struct pci_dev *pdev =3D ice_vdev->core_device.pdev; + + ice_migration_uninit_dev(pdev); +} + +/** + * ice_vfio_pci_disable_fd - Close migration file + * @migf: pointer to ice-vfio-pci migration file + */ +static void ice_vfio_pci_disable_fd(struct ice_vfio_pci_migration_file *mi= gf) +{ + guard(mutex)(&migf->lock); + + migf->disabled =3D true; + migf->payload_length =3D 0; + migf->filp->f_pos =3D 0; +} + +/** + * ice_vfio_pci_disable_fds - Close migration files of ice-vfio-pci device + * @ice_vdev: pointer to ice-vfio-pci core device structure + */ +static void ice_vfio_pci_disable_fds(struct ice_vfio_pci_device *ice_vdev) +{ + if (ice_vdev->resuming_migf) { + ice_vfio_pci_disable_fd(ice_vdev->resuming_migf); + fput(ice_vdev->resuming_migf->filp); + ice_vdev->resuming_migf =3D NULL; + } + if (ice_vdev->saving_migf) { + ice_vfio_pci_disable_fd(ice_vdev->saving_migf); + fput(ice_vdev->saving_migf->filp); + ice_vdev->saving_migf =3D NULL; + } +} + +/** + * ice_vfio_pci_open_device - VFIO .open_device callback + * @vdev: the VFIO device to open + * + * Called when a VFIO device is probed by VFIO uAPI. Initializes the VFIO + * device and sets up the migration state. + * + * Return: 0 on success, or a negative error code on failure. + */ +static int ice_vfio_pci_open_device(struct vfio_device *vdev) +{ + struct vfio_pci_core_device *core_vdev; + struct ice_vfio_pci_device *ice_vdev; + int ret; + + ice_vdev =3D to_ice_vdev(vdev); + core_vdev =3D &ice_vdev->core_device; + + ret =3D vfio_pci_core_enable(core_vdev); + if (ret) + return ret; + + ret =3D ice_vfio_migration_init(ice_vdev); + if (ret) { + vfio_pci_core_disable(core_vdev); + return ret; + } + ice_vdev->mig_state =3D VFIO_DEVICE_STATE_RUNNING; + vfio_pci_core_finish_enable(core_vdev); + + return 0; +} + +/** + * ice_vfio_pci_close_device - VFIO .close_device callback + * @vdev: the VFIO device to close + * + * Called when a VFIO device fd is closed. Destroys migration state. + */ +static void ice_vfio_pci_close_device(struct vfio_device *vdev) +{ + struct ice_vfio_pci_device *ice_vdev =3D to_ice_vdev(vdev); + + ice_vfio_pci_disable_fds(ice_vdev); + vfio_pci_core_close_device(vdev); + ice_vfio_migration_uninit(ice_vdev); +} + +/** + * ice_vfio_pci_release_file - Release ice-vfio-pci migration file + * @inode: pointer to inode + * @filp: pointer to the file to release + * + * Return: 0. + */ +static int ice_vfio_pci_release_file(struct inode *inode, struct file *fil= p) +{ + struct ice_vfio_pci_migration_file *migf =3D filp->private_data; + + ice_vfio_pci_disable_fd(migf); + mutex_destroy(&migf->lock); + vfree(migf); + return 0; +} + +/** + * ice_vfio_pci_save_read - Save migration file data to user space + * @filp: pointer to migration file + * @buf: pointer to user space buffer + * @len: data length to be saved + * @pos: must be 0 + * + * Return: length of the saved data, or a negative error code on failure. + */ +static ssize_t ice_vfio_pci_save_read(struct file *filp, char __user *buf, + size_t len, loff_t *pos) +{ + struct ice_vfio_pci_migration_file *migf =3D filp->private_data; + loff_t *off =3D &filp->f_pos; + ssize_t done =3D 0; + int ret; + + if (pos) + return -ESPIPE; + + guard(mutex)(&migf->lock); + + if (*off > migf->payload_length) + return -EINVAL; + + if (migf->disabled) + return -ENODEV; + + len =3D min_t(size_t, migf->payload_length - *off, len); + if (len) { + ret =3D copy_to_user(buf, migf->mig_data + *off, len); + if (ret) + return -EFAULT; + + *off +=3D len; + done =3D len; + } + + return done; +} + +static const struct file_operations ice_vfio_pci_save_fops =3D { + .owner =3D THIS_MODULE, + .read =3D ice_vfio_pci_save_read, + .release =3D ice_vfio_pci_release_file, +}; + +/** + * ice_vfio_pci_stop_copy - Create migration file and save migration state + * @ice_vdev: pointer to ice-vfio-pci core device structure + * + * Return: pointer to the migration file structure, or an error pointer on + * failure. + */ +static struct ice_vfio_pci_migration_file * +ice_vfio_pci_stop_copy(struct ice_vfio_pci_device *ice_vdev) +{ + struct pci_dev *pdev =3D ice_vdev->core_device.pdev; + struct ice_vfio_pci_migration_file *migf; + size_t payload_length; + int ret; + + payload_length =3D ice_migration_get_required_size(pdev); + if (!payload_length) + return ERR_PTR(-EIO); + + migf =3D vzalloc(struct_size(migf, mig_data, payload_length)); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->payload_length =3D payload_length; + migf->filp =3D anon_inode_getfile("ice_vfio_pci_mig", + &ice_vfio_pci_save_fops, migf, + O_RDONLY); + if (IS_ERR(migf->filp)) { + ret =3D PTR_ERR(migf->filp); + goto err_free_migf; + } + + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + + ret =3D ice_vfio_pci_save_state(ice_vdev, migf); + if (ret) + goto err_put_migf_filp; + + return migf; + +err_put_migf_filp: + fput(migf->filp); +err_free_migf: + vfree(migf); + + return ERR_PTR(ret); +} + +/** + * ice_vfio_pci_resume_write - Copy migration file data from user space + * @filp: pointer to migration file + * @buf: pointer to user space buffer + * @len: data length to be copied + * @pos: must be 0 + * + * Return: length of the data copied, or a negative error code on failure. + */ +static ssize_t ice_vfio_pci_resume_write(struct file *filp, + const char __user *buf, + size_t len, loff_t *pos) +{ + struct ice_vfio_pci_migration_file *migf =3D filp->private_data; + loff_t *off =3D &filp->f_pos; + loff_t requested_length; + ssize_t done =3D 0; + int ret; + + if (pos) + return -ESPIPE; + + if (*off < 0 || + check_add_overflow((loff_t)len, *off, &requested_length)) + return -EINVAL; + + if (requested_length > migf->payload_length) + return -ENOMEM; + + guard(mutex)(&migf->lock); + + if (migf->disabled) + return -ENODEV; + + ret =3D copy_from_user(migf->mig_data + *off, buf, len); + if (ret) + return -EFAULT; + + *off +=3D len; + done =3D len; + migf->payload_length +=3D len; + + return done; +} + +static const struct file_operations ice_vfio_pci_resume_fops =3D { + .owner =3D THIS_MODULE, + .write =3D ice_vfio_pci_resume_write, + .release =3D ice_vfio_pci_release_file, +}; + +/** + * ice_vfio_pci_resume - Create resuming migration file + * @ice_vdev: pointer to ice-vfio-pci core device structure + * + * Return: pointer to the migration file handler, or an error pointer on + * failure. + */ +static struct ice_vfio_pci_migration_file * +ice_vfio_pci_resume(struct ice_vfio_pci_device *ice_vdev) +{ + struct ice_vfio_pci_migration_file *migf; + + migf =3D vzalloc(struct_size(migf, mig_data, SZ_128K)); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->payload_length =3D SZ_128K; + migf->filp =3D anon_inode_getfile("ice_vfio_pci_mig", + &ice_vfio_pci_resume_fops, migf, + O_WRONLY); + if (IS_ERR(migf->filp)) { + int ret =3D PTR_ERR(migf->filp); + + vfree(migf); + + return ERR_PTR(ret); + } + + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + + return migf; +} + +/** + * ice_vfio_pci_step_device_state_locked - Process device state change + * @ice_vdev: pointer to ice-vfio-pci core device structure + * @new: new device state + * @final: final device state + * + * Return: pointer to the migration file handler or NULL on success, or an + * error pointer on failure. + */ +static struct file * +ice_vfio_pci_step_device_state_locked(struct ice_vfio_pci_device *ice_vdev, + enum vfio_device_mig_state new, + enum vfio_device_mig_state final) +{ + enum vfio_device_mig_state cur =3D ice_vdev->mig_state; + struct pci_dev *pdev =3D ice_vdev->core_device.pdev; + int ret; + + if (cur =3D=3D VFIO_DEVICE_STATE_RUNNING && new =3D=3D VFIO_DEVICE_STATE_= RUNNING_P2P) { + bool save_state =3D final !=3D VFIO_DEVICE_STATE_RESUMING; + + /* The ice driver needs to keep track of some state which + * would otherwise be lost when suspending the device. It only + * needs this state if the device later transitions into + * STOP_COPY, which copies the device state for migration. + * The transition from RUNNING to RUNNING_P2P also occurs as + * part of transitioning into the RESUME state. + * + * Avoid saving the device state if its known to be + * unnecessary. + */ + ice_migration_suspend_dev(pdev, save_state); + return NULL; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_RUNNING_P2P && new =3D=3D VFIO_DEVICE_ST= ATE_STOP) + return NULL; + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP && new =3D=3D VFIO_DEVICE_STATE_STO= P_COPY) { + struct ice_vfio_pci_migration_file *migf; + + migf =3D ice_vfio_pci_stop_copy(ice_vdev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + ice_vdev->saving_migf =3D migf; + return migf->filp; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP_COPY && new =3D=3D VFIO_DEVICE_STAT= E_STOP) { + ice_vfio_pci_disable_fds(ice_vdev); + return NULL; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP && new =3D=3D VFIO_DEVICE_STATE_RES= UMING) { + struct ice_vfio_pci_migration_file *migf; + + migf =3D ice_vfio_pci_resume(ice_vdev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + ice_vdev->resuming_migf =3D migf; + return migf->filp; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_RESUMING && new =3D=3D VFIO_DEVICE_STATE= _STOP) + return NULL; + + if (cur =3D=3D VFIO_DEVICE_STATE_STOP && new =3D=3D VFIO_DEVICE_STATE_RUN= NING_P2P) { + ret =3D ice_vfio_pci_load_state(ice_vdev); + if (ret) + return ERR_PTR(ret); + ice_vfio_pci_disable_fds(ice_vdev); + return NULL; + } + + if (cur =3D=3D VFIO_DEVICE_STATE_RUNNING_P2P && new =3D=3D VFIO_DEVICE_ST= ATE_RUNNING) + return NULL; + + /* + * vfio_mig_get_next_state() does not use arcs other than the above + */ + WARN_ON(true); + return ERR_PTR(-EINVAL); +} + +/** + * ice_vfio_pci_set_device_state - Configure the device state + * @vdev: pointer to VFIO device + * @new_state: device state + * + * Return: 0 on success, or a negative error code on failure. + */ +static struct file * +ice_vfio_pci_set_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state new_state) +{ + struct ice_vfio_pci_device *ice_vdev =3D to_ice_vdev(vdev); + enum vfio_device_mig_state next_state; + struct file *res =3D NULL; + int ret; + + mutex_lock(&ice_vdev->state_mutex); + + while (new_state !=3D ice_vdev->mig_state) { + ret =3D vfio_mig_get_next_state(vdev, ice_vdev->mig_state, + new_state, &next_state); + if (ret) { + res =3D ERR_PTR(ret); + break; + } + + res =3D ice_vfio_pci_step_device_state_locked(ice_vdev, + next_state, + new_state); + if (IS_ERR(res)) + break; + + ice_vdev->mig_state =3D next_state; + if (WARN_ON(res && new_state !=3D ice_vdev->mig_state)) { + fput(res); + res =3D ERR_PTR(-EINVAL); + break; + } + } + + mutex_unlock(&ice_vdev->state_mutex); + + return res; +} + +/** + * ice_vfio_pci_get_device_state - Get device state + * @vdev: pointer to VFIO device + * @curr_state: device state + * + * Return: 0. + */ +static int ice_vfio_pci_get_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state *curr_state) +{ + struct ice_vfio_pci_device *ice_vdev =3D to_ice_vdev(vdev); + + mutex_lock(&ice_vdev->state_mutex); + + *curr_state =3D ice_vdev->mig_state; + + mutex_unlock(&ice_vdev->state_mutex); + + return 0; +} + +/** + * ice_vfio_pci_get_data_size - Get estimated migration data size + * @vdev: pointer to VFIO device + * @stop_copy_length: migration data size + * + * Return: 0. + */ +static int ice_vfio_pci_get_data_size(struct vfio_device *vdev, + unsigned long *stop_copy_length) +{ + *stop_copy_length =3D SZ_128K; + return 0; +} + +static const struct vfio_migration_ops ice_vfio_pci_migrn_state_ops =3D { + .migration_set_state =3D ice_vfio_pci_set_device_state, + .migration_get_state =3D ice_vfio_pci_get_device_state, + .migration_get_data_size =3D ice_vfio_pci_get_data_size, +}; + +/** + * ice_vfio_pci_core_init_dev - initialize VFIO device + * @vdev: pointer to VFIO device + * + * Return: 0. + */ +static int ice_vfio_pci_core_init_dev(struct vfio_device *vdev) +{ + struct ice_vfio_pci_device *ice_vdev =3D to_ice_vdev(vdev); + + mutex_init(&ice_vdev->state_mutex); + + vdev->migration_flags =3D + VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P; + vdev->mig_ops =3D &ice_vfio_pci_migrn_state_ops; + + return vfio_pci_core_init_dev(vdev); +} + +/** + * ice_vfio_pci_core_release_dev - Release VFIO device + * @vdev: pointer to VFIO device + */ +static void ice_vfio_pci_core_release_dev(struct vfio_device *vdev) +{ + struct ice_vfio_pci_device *ice_vdev =3D to_ice_vdev(vdev); + + mutex_destroy(&ice_vdev->state_mutex); + + vfio_pci_core_release_dev(vdev); +} + +static const struct vfio_device_ops ice_vfio_pci_ops =3D { + .name =3D "ice-vfio-pci", + .init =3D ice_vfio_pci_core_init_dev, + .release =3D ice_vfio_pci_core_release_dev, + .open_device =3D ice_vfio_pci_open_device, + .close_device =3D ice_vfio_pci_close_device, + .device_feature =3D vfio_pci_core_ioctl_feature, + .read =3D vfio_pci_core_read, + .write =3D vfio_pci_core_write, + .ioctl =3D vfio_pci_core_ioctl, + .mmap =3D vfio_pci_core_mmap, + .request =3D vfio_pci_core_request, + .match =3D vfio_pci_core_match, + .bind_iommufd =3D vfio_iommufd_physical_bind, + .unbind_iommufd =3D vfio_iommufd_physical_unbind, + .attach_ioas =3D vfio_iommufd_physical_attach_ioas, + .detach_ioas =3D vfio_iommufd_physical_detach_ioas, +}; + +/** + * ice_vfio_pci_probe - Device initialization routine + * @pdev: PCI device information struct + * @id: entry in ice_vfio_pci_table + * + * Return: 0 on success, or a negative error code on failure. + */ +static int +ice_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) +{ + struct ice_vfio_pci_device *ice_vdev; + int ret; + + ice_vdev =3D vfio_alloc_device(ice_vfio_pci_device, core_device.vdev, + &pdev->dev, &ice_vfio_pci_ops); + if (!ice_vdev) + return -ENOMEM; + + dev_set_drvdata(&pdev->dev, &ice_vdev->core_device); + + ret =3D vfio_pci_core_register_device(&ice_vdev->core_device); + if (ret) + goto out_free; + + return 0; + +out_free: + vfio_put_device(&ice_vdev->core_device.vdev); + return ret; +} + +/** + * ice_vfio_pci_remove - Device removal routine + * @pdev: PCI device information struct + */ +static void ice_vfio_pci_remove(struct pci_dev *pdev) +{ + struct ice_vfio_pci_device *ice_vdev =3D dev_get_drvdata(&pdev->dev); + + vfio_pci_core_unregister_device(&ice_vdev->core_device); + vfio_put_device(&ice_vdev->core_device.vdev); +} + +/* ice_pci_tbl - PCI Device ID Table + * + * Wildcard entries (PCI_ANY_ID) should come last + * Last entry must be all 0s + * + * { Vendor ID, Device ID, SubVendor ID, SubDevice ID, + * Class, Class Mask, private data (not used) } + */ +static const struct pci_device_id ice_vfio_pci_table[] =3D { + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0x1889) }, + {} +}; +MODULE_DEVICE_TABLE(pci, ice_vfio_pci_table); + +static const struct pci_error_handlers ice_vfio_pci_core_err_handlers =3D { + .error_detected =3D vfio_pci_core_aer_err_detected, +}; + +static struct pci_driver ice_vfio_pci_driver =3D { + .name =3D "ice-vfio-pci", + .id_table =3D ice_vfio_pci_table, + .probe =3D ice_vfio_pci_probe, + .remove =3D ice_vfio_pci_remove, + .err_handler =3D &ice_vfio_pci_core_err_handlers, + .driver_managed_dma =3D true, +}; + +module_pci_driver(ice_vfio_pci_driver); + +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("ICE VFIO PCI - User Level meta-driver for Intel E800 d= evice family"); diff --git a/MAINTAINERS b/MAINTAINERS index b81595e9ea95..512808140ebe 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -26477,6 +26477,13 @@ L: kvm@vger.kernel.org S: Maintained F: drivers/vfio/pci/hisilicon/ =20 +VFIO ICE PCI DRIVER +M: Jacob Keller +L: kvm@vger.kernel.org +S: Supported +F: drivers/vfio/pci/ice/ +F: include/linux/net/intel/ice_migration.h + VFIO MEDIATED DEVICE DRIVERS M: Kirti Wankhede L: kvm@vger.kernel.org diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 2b0172f54665..74e0fb571936 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -67,4 +67,6 @@ source "drivers/vfio/pci/nvgrace-gpu/Kconfig" =20 source "drivers/vfio/pci/qat/Kconfig" =20 +source "drivers/vfio/pci/ice/Kconfig" + endmenu diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index cf00c0a7e55c..721b01ad3a2e 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -19,3 +19,5 @@ obj-$(CONFIG_VIRTIO_VFIO_PCI) +=3D virtio/ obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) +=3D nvgrace-gpu/ =20 obj-$(CONFIG_QAT_VFIO_PCI) +=3D qat/ + +obj-$(CONFIG_ICE_VFIO_PCI) +=3D ice/ diff --git a/drivers/vfio/pci/ice/Kconfig b/drivers/vfio/pci/ice/Kconfig new file mode 100644 index 000000000000..3e8f5d6e60dc --- /dev/null +++ b/drivers/vfio/pci/ice/Kconfig @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0-only +config ICE_VFIO_PCI + tristate "VFIO support for Intel(R) Ethernet Connection E800 Series" + depends on ICE + select VFIO_PCI_CORE + help + This provides migration support for Intel(R) Ethernet connection E800 + series devices using the VFIO framework. diff --git a/drivers/vfio/pci/ice/Makefile b/drivers/vfio/pci/ice/Makefile new file mode 100644 index 000000000000..5b8df8234b31 --- /dev/null +++ b/drivers/vfio/pci/ice/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_ICE_VFIO_PCI) +=3D ice-vfio-pci.o +ice-vfio-pci-y :=3D main.o + --=20 2.51.0.rc1.197.g6d975e95c9d7 From nobody Thu Oct 2 22:52:47 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 112672773D8; Tue, 9 Sep 2025 22:00:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455218; cv=none; b=fcAfqmzywMD4X9q8cpogsVU5vWbIkOr2GlJiaFxjNtAzOuRe5biZ5h5fCynBVhqCUcKWgMD3qf8UeI9LOIKryoGP2N1e0Q82G2HRxQe0Z5BKaT5KRyJriV7o0YUJEWdUmsque98J7LubMlPZn2a4Him/abbUnoH0C0R6psHqNBg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757455218; c=relaxed/simple; bh=J0wPZupB+Xsv5pSL6FaS6vYM6IU4qCEKiECUNs+CFV8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=UmGmIVoWEepLO5+ufkP+4gr3G9SyNlExbh8Lr1dSWqziuxGCnfxJ62o018ty+P2yUvEFxbmsPzji4I55zzeKLskbnNw2vpVmXuECK287i+rGi8XGD2P3YjA2kRuumE96rlaBAYKIjpTI2eSJ3r7IzVcX7fkKeV6wXpwvKsPVYoI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=URq2Fxdk; arc=none smtp.client-ip=192.198.163.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="URq2Fxdk" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1757455216; x=1788991216; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=J0wPZupB+Xsv5pSL6FaS6vYM6IU4qCEKiECUNs+CFV8=; b=URq2FxdkmtCq+yCguz2LNwoGaIeDzNW4KutB+2/bpYgJlJhTIDv91C5i Okdw/HnY7c1AA6QtOlQFBuQoJ50s2DZs0CeZ8PNouIlYz/RqKOIshgUw5 bFYdMU5Ek0zVms0iPVMbn/YTzVPuedFJl35rNh3QjuT5GMIsqfPgcWEwA ZydXe0bfrNxzlwWkMuNNBtPF7mwdFYdmXLaKkiYcP75wtuNbimtRLs+34 QSIz/2RFjtVRNaFwYXNX08UVM+Y5YrYixGUFkoyFI2Rcauo+cNub1o3IO IztcViUG2xvvrXfBjtqaAJjdnbuGJ0xGeW0oesopQ76zBtOfgsTtaVwQD g==; X-CSE-ConnectionGUID: rbSETMh3RkyskaPdPpAaCg== X-CSE-MsgGUID: je0gPVssQ+Ku9Ey8uAy3Tw== X-IronPort-AV: E=McAfee;i="6800,10657,11548"; a="63584676" X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="63584676" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:09 -0700 X-CSE-ConnectionGUID: HNmsrvI3SqKaxlf3KYSRpQ== X-CSE-MsgGUID: 79gh8YjbTLSpo0j/yzml+w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.18,252,1751266800"; d="scan'208";a="172780974" Received: from orcnseosdtjek.jf.intel.com (HELO [10.166.28.70]) ([10.166.28.70]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2025 15:00:08 -0700 From: Jacob Keller Date: Tue, 09 Sep 2025 14:57:56 -0700 Subject: [PATCH RFC net-next 7/7] ice-vfio-pci: implement PCI .reset_done handling Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20250909-e810-live-migration-jk-migration-tlv-v1-7-4d1dc641e31f@intel.com> References: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> In-Reply-To: <20250909-e810-live-migration-jk-migration-tlv-v1-0-4d1dc641e31f@intel.com> To: Tony Nguyen , Przemek Kitszel , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Kees Cook , "Gustavo A. R. Silva" , Alex Williamson , Jason Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Jacob Keller , Aleksandr Loktionov X-Mailer: b4 0.15-dev-c61db X-Developer-Signature: v=1; a=openpgp-sha256; l=6535; i=jacob.e.keller@intel.com; h=from:subject:message-id; bh=J0wPZupB+Xsv5pSL6FaS6vYM6IU4qCEKiECUNs+CFV8=; b=owGbwMvMwCWWNS3WLp9f4wXjabUkhowDi9Oc9l7cG7r7jd8j8xgrg+3CfcsaPnhv3DqJc7r7z oL80LXXOkpZGMS4GGTFFFkUHEJWXjeeEKb1xlkOZg4rE8gQBi5OAZjIvYeMDJcYun+8sgx+58h2 21lMLuaosZZXb36F35mIWWcXb4/iu8nIsMPz/pObfbdq4/WrO6/su6UT6hkY1P9YTX7O44a0/zd S2AA= X-Developer-Key: i=jacob.e.keller@intel.com; a=openpgp; fpr=204054A9D73390562AEC431E6A965D3E6F0F28E8 Add an implementation callback for the PCI .reset_done handler to enable cleanup after a PCI reset. This function has one rather nasty locking complexity due to the way the various locks interact. The VFIO layer holds the mm_lock across a reset, and a naive implementation which just takes the state mutex would trigger a simple ABBA deadlock between the state_mutex and the mm_lock. To avoid this, allow deferring handling cleanup after a PCI reset until the current thread holding the state_mutex exits. This is done through adding a reset_lock spinlock and a needs_reset boolean. All flows which previously simply released the state_mutex now call a specialized ice_vfio_pci_state_mutex_unlock() handler. This handler acquires the reset_lock, and checks if a reset was deferred. If so, the reset_lock is released, cleanup is handled, then the reset_lock is reacquired and the thread loops to check for another deferred reset. Eventually the needs_reset is false, and the function exits by releasing the state_mutex and then the deferred reset_lock. The actual reset_done implementation acquires the reset lock, sets needs_reset to true, then uses try_lock to acquire the state mutex. If it fails to acquire the state mutex, this means another thread is handling business and will perform the deferred reset cleanup as part of unlocking the state mutex. Finally, if the reset_done does acquire the state mutex, it simply unlocks using the ice_vfio_pci_state_mutex_unlock helper which will immediately handle the "deferred" reset. This is complicated, but is similar to code in other VFIO migration drivers including the mlx5 driver and logic in the virtiovf migration code. Signed-off-by: Jacob Keller --- drivers/vfio/pci/ice/main.c | 69 +++++++++++++++++++++++++++++++++++++++++= ++-- 1 file changed, 67 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/pci/ice/main.c b/drivers/vfio/pci/ice/main.c index 161053ba383c..17865fab02ce 100644 --- a/drivers/vfio/pci/ice/main.c +++ b/drivers/vfio/pci/ice/main.c @@ -36,17 +36,21 @@ struct ice_vfio_pci_migration_file { * @core_device: The core device being operated on * @mig_info: Migration information * @state_mutex: mutex protecting the migration state + * @reset_lock: spinlock protecting the reset_done flow * @resuming_migf: Migration file containing data for the resuming VF * @saving_migf: Migration file used to store data from saving VF * @mig_state: the current migration state of the device + * @needs_reset: if true, reset is required at next unlock of state_mutex */ struct ice_vfio_pci_device { struct vfio_pci_core_device core_device; struct vfio_device_migration_info mig_info; struct mutex state_mutex; + spinlock_t reset_lock; struct ice_vfio_pci_migration_file *resuming_migf; struct ice_vfio_pci_migration_file *saving_migf; enum vfio_device_mig_state mig_state; + bool needs_reset:1; }; =20 #define to_ice_vdev(dev) \ @@ -154,6 +158,65 @@ static void ice_vfio_pci_disable_fds(struct ice_vfio_p= ci_device *ice_vdev) } } =20 +/** + * ice_vfio_pci_state_mutex_unlock - Unlock state_mutex + * @mutex: pointer to the ice-vfio-pci state_mutex + * + * ice_vfio_pci_reset_done may defer a reset in the event it fails to acqu= ire + * the state_mutex. This is necessary in order to avoid an unconditional + * acquire of the state_mutex that could lead to ABBA lock inversion issues + * with the mm lock. + * + * This function is called to unlock the state_mutex, but ensures that any + * deferred reset is handled prior to unlocking. It uses the reset_lock to + * check if any reset has been deferred. + */ +static void ice_vfio_pci_state_mutex_unlock(struct mutex *mutex) +{ + struct ice_vfio_pci_device *ice_vdev =3D + container_of(mutex, struct ice_vfio_pci_device, + state_mutex); + +again: + spin_lock(&ice_vdev->reset_lock); + if (ice_vdev->needs_reset) { + ice_vdev->needs_reset =3D false; + spin_unlock(&ice_vdev->reset_lock); + ice_vdev->mig_state =3D VFIO_DEVICE_STATE_RUNNING; + ice_vfio_pci_disable_fds(ice_vdev); + goto again; + } + /* The state_mutex must be unlocked before the reset_lock, otherwise + * a new deferred reset could occur inbetween. Such a reset then be + * deferred until the next state_mutex critical section. + */ + mutex_unlock(&ice_vdev->state_mutex); + spin_unlock(&ice_vdev->reset_lock); +} + +/** + * ice_vfio_pci_reset_done - Handle or defer PCI reset + * @pdev: The PCI device structure + * + * As the higher VFIO layers are holding locks across reset and using those + * same locks with the mm_lock we need to prevent ABBA deadlock with the + * state_mutex and mm_lock. In case the state_mutex was taken already we d= efer + * the cleanup work to the unlock flow of the other running context. + */ +static void ice_vfio_pci_reset_done(struct pci_dev *pdev) +{ + struct ice_vfio_pci_device *ice_vdev =3D dev_get_drvdata(&pdev->dev); + + spin_lock(&ice_vdev->reset_lock); + ice_vdev->needs_reset =3D true; + if (!mutex_trylock(&ice_vdev->state_mutex)) { + spin_unlock(&ice_vdev->reset_lock); + return; + } + spin_unlock(&ice_vdev->reset_lock); + ice_vfio_pci_state_mutex_unlock(&ice_vdev->state_mutex); +} + /** * ice_vfio_pci_open_device - VFIO .open_device callback * @vdev: the VFIO device to open @@ -526,7 +589,7 @@ ice_vfio_pci_set_device_state(struct vfio_device *vdev, } } =20 - mutex_unlock(&ice_vdev->state_mutex); + ice_vfio_pci_state_mutex_unlock(&ice_vdev->state_mutex); =20 return res; } @@ -547,7 +610,7 @@ static int ice_vfio_pci_get_device_state(struct vfio_de= vice *vdev, =20 *curr_state =3D ice_vdev->mig_state; =20 - mutex_unlock(&ice_vdev->state_mutex); + ice_vfio_pci_state_mutex_unlock(&ice_vdev->state_mutex); =20 return 0; } @@ -583,6 +646,7 @@ static int ice_vfio_pci_core_init_dev(struct vfio_devic= e *vdev) struct ice_vfio_pci_device *ice_vdev =3D to_ice_vdev(vdev); =20 mutex_init(&ice_vdev->state_mutex); + spin_lock_init(&ice_vdev->reset_lock); =20 vdev->migration_flags =3D VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P; @@ -681,6 +745,7 @@ static const struct pci_device_id ice_vfio_pci_table[] = =3D { MODULE_DEVICE_TABLE(pci, ice_vfio_pci_table); =20 static const struct pci_error_handlers ice_vfio_pci_core_err_handlers =3D { + .reset_done =3D ice_vfio_pci_reset_done, .error_detected =3D vfio_pci_core_aer_err_detected, }; =20 --=20 2.51.0.rc1.197.g6d975e95c9d7