The documentation describes the details of the NVMe hardware
extension to support VFIO live migration.
Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++
1 file changed, 278 insertions(+)
create mode 100644 drivers/vfio/pci/nvme/nvme.txt
diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt
new file mode 100644
index 000000000000..eadcf2082eed
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.txt
@@ -0,0 +1,278 @@
+===========================
+NVMe Live Migration Support
+===========================
+
+Introduction
+------------
+To support live migration, NVMe device designs its own implementation,
+including five new specific admin commands and a capability flag in
+the vendor-specific field in the identify controller data structure to
+support VF's live migration usage. Software can use these live migration
+admin commands to get device migration state data size, save and load the
+data, suspend and resume the given VF device. They are submitted by software
+to the NVMe PF device's admin queue and ignored if placed in the VF device's
+admin queue. This is due to the NVMe VF device being passed to the virtual
+machine in the virtualization scenario. So VF device's admin queue is not
+available for the hypervisor to submit VF device live migration commands.
+The capability flag in the identify controller data structure can be used by
+software to detect if the NVMe device supports live migration. The following
+chapters introduce the detailed format of the commands and the capability flag.
+
+Definition of opcode for live migration commands
+------------------------------------------------
+
++---------------------------+-----------+-----------+------------+
+| | | | |
+| Opcode by Field | | | |
+| | | | |
++--------+---------+--------+ | | |
+| | | | Combined | Namespace | |
+| 07 | 06:02 | 01:00 | Opcode | Identifier| Command |
+| | | | | used | |
++--------+---------+--------+ | | |
+|Generic | Function| Data | | | |
+|command | |Transfer| | | |
++--------+---------+--------+-----------+-----------+------------+
+| |
+| Vendor SpecificOpcode |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Query the |
+| 1b | 10001 | 00 | 0xC4 | | data size |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Suspend the|
+| 1b | 10010 | 00 | 0xC8 | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Resume the |
+| 1b | 10011 | 00 | 0xCC | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Save the |
+| 1b | 10100 | 10 | 0xD2 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Load the |
+| 1b | 10101 | 01 | 0xD5 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+
+Definition of QUERY_DATA_SIZE command
+-------------------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for more details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data size to query |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
+When the NVMe firmware receives the command, it will return the size of NVMe VF internal
+data. The data size depends on how many IO queues are created.
+
+Definition of SUSPEND command
+-----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe specification for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to suspend |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
+this command, it will suspend the NVMe VF controller.
+
+Definition of RESUME command
+----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to resume |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
+it will restart the NVMe VF controller.
+
+Definition of SAVE_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to save |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
+receives this command, it will save the admin queue states, save some registers, drain IO SQs
+and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
+host memory through DMA.
+
+Definition of LOAD_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to load |
++---------+------------------------------------------------------------------------------------+
+| 47:44 | Size: means the size of the device's internal data to be loaded |
++---------+------------------------------------------------------------------------------------+
+| 63:48 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
+command, it will read the device internal's data from the host memory through DMA, restore the
+admin queue states and some registers, and restore every IO queue state.
+
+Extensions of the vendor-specific field in the identify controller data structure
+---------------------------------------------------------------------------------
+
++---------+------+------+------+-------------------------------+
+| | | | | |
+| Bytes | I/O |Admin | Disc | Description |
+| | | | | |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 01:00 | M | M | R | PCI Vendor ID(VID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| ... | ... | ... | ... | ... |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 3072 | O | O | O | Live Migration Support |
++---------+------+------+------+-------------------------------+
+| | | | | |
+|4095:3073| O | O | O | Vendor Specific |
++---------+------+------+------+-------------------------------+
+
+According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
+NVMe device uses the 3072 bytes in the identify controller data structure to indicate
+whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
+live migration is supported, and other values are reserved.
+
+[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
--
2.34.1
On Tue, Dec 06, 2022 at 01:58:16PM +0800, Lei Rao wrote: > The documentation describes the details of the NVMe hardware > extension to support VFIO live migration. This is not a NVMe hardware extension, this is some really strange and half-assed intel-specific extension to nvme, which like any other vendors specific non-standard extensions to nvme we refused to support in Linux. There is a TPAR for live migration building blocks under discussion in the NVMe technical working group. It will still require mediatation of access to the admin queue to deal with the huge amout of state nvme has that needs to be migrated (and which doesn't seem to be covered at all here). In Linux the equivalent would be to implement a mdev driver that allows passing through the I/O qeues to a guest, but it might be a better idea to handle the device model emulation entirely in Qemu (or other userspace device models) and just find a way to expose enough of the I/O queues to userspace. The current TPAR seems to be very complicated for that, as in many cases we'd only need a way to tіe certain namespaces to certain I/O queues and not waste a lot of resources on the rest of the controller.
On Tue, Dec 06, 2022 at 07:26:04AM +0100, Christoph Hellwig wrote: > all here). In Linux the equivalent would be to implement a mdev driver > that allows passing through the I/O qeues to a guest, but it might Definately not - "mdev" drivers should be avoided as much as possible. In this case Intel has a real PCI SRIOV VF to expose to the guest, with a full VF RID. The proper VFIO abstraction is the variant PCI driver as this series does. We want to use the variant PCI drivers because they properly encapsulate all the PCI behaviors (MSI, config space, regions, reset, etc) without requiring re-implementation of this in mdev drivers. mdev drivers should only be considered if a real PCI VF is not available - eg because the device is doing "SIOV" or something. We have several migration drivers in VFIO now following this general pattern, from what I can see they have done it broadly properly from a VFIO perspective. > be a better idea to handle the device model emulation entirely in > Qemu (or other userspace device models) and just find a way to expose > enough of the I/O queues to userspace. This is much closer to the VDPA model which is basically providing a some kernel support to access the IO queue and a lot of SW in qemu to generate the PCI device in the VM. The approach has positives and negatives, we have done both in mlx5 devices and we have a preference toward the VFIO model. VPDA specifically is very big and complicated compared to the VFIO approach. Overall having fully functional PCI SRIOV VF's available lets more uses cases work than just "qemu to create a VM". qemu can always build a VDPA like thing by using VFIO and VFIO live migration to shift control of the device between qemu and HW. I don't think we know enough about this space at the moment to fix a specification to one path or the other, so I hope the TPAR will settle on something that can support both models in SW and people can try things out. Jason
On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote: > In this case Intel has a real PCI SRIOV VF to expose to the guest, > with a full VF RID. RID? > The proper VFIO abstraction is the variant PCI > driver as this series does. We want to use the variant PCI drivers > because they properly encapsulate all the PCI behaviors (MSI, config > space, regions, reset, etc) without requiring re-implementation of this > in mdev drivers. I don't think the code in this series has any chance of actually working. There is a lot of state associated with a NVMe subsystem, controller and namespace, such as the serial number, subsystem NQN, namespace uniqueue identifiers, Get/Set features state, pending AENs, log page content. Just migrating from one device to another without capturing all this has no chance of actually working. > I don't think we know enough about this space at the moment to fix a > specification to one path or the other, so I hope the TPAR will settle > on something that can support both models in SW and people can try > things out. I've not seen anyone from Intel actually contributing to the live migration TPAR, which is almost two month old by now.
> From: Christoph Hellwig <hch@lst.de> > Sent: Tuesday, December 6, 2022 9:09 PM > > > > I don't think we know enough about this space at the moment to fix a > > specification to one path or the other, so I hope the TPAR will settle > > on something that can support both models in SW and people can try > > things out. > > I've not seen anyone from Intel actually contributing to the live > migration TPAR, which is almost two month old by now. Not a NVMe guy but obviously Intel should join. I have forwarded this to internal folks to check the actual status. And I also asked them to prepare a document explaining how this opaque cmd actually works to address your concerns.
>From: Tian, Kevin <kevin.tian@intel.com> >Not a NVMe guy but obviously Intel should join. I have forwarded this >to internal folks to check the actual status. Intel team has been actively participating in the TPAR definition over last month. The key leadership in NVMe WG, Peter and Nick (from Intel), are on top of this TPAR, and help influence the scope of the TPAR to better support IPU/DPUs. IPU/DPU is a new type of device that desperately needs a standard based live migration solution. Soon, you may also see more Intel people to be part of the SW WG for the NVMe live migration standardization effort. >And I also asked them to prepare a document explaining how this >opaque cmd actually works to address your concerns. There is nothing secret. It's the VF states including VF CSR registers, IO QP states, and the AdminQ state. The AdminQ state may also include the pending AER command. As part of the NVMe live migration standardization effort, it's desirable to standardize the structure as much as we can, and then have an extension for any device specific state that may be required to work around specific design HW limitations.
On Fri, Dec 09, 2022 at 04:53:47PM +0000, Li, Yadong wrote: > The key leadership in NVMe WG, Peter and Nick (from Intel), are on top of this TPAR, > and help influence the scope of the TPAR to better support IPU/DPUs. You guys should talk more to each other. I think Peter especially has been vocal to reduce the scope and not include this.
On Tue, Dec 06, 2022 at 02:09:01PM +0100, Christoph Hellwig wrote: > On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote: > > In this case Intel has a real PCI SRIOV VF to expose to the guest, > > with a full VF RID. > > RID? "Requester ID" - PCI SIG term that in Linux basically means you get to assign an iommu_domain to the vfio device. Compared to a mdev where many vfio devices will share the same RID and cannot have iommu_domain's without using PASID. > > The proper VFIO abstraction is the variant PCI > > driver as this series does. We want to use the variant PCI drivers > > because they properly encapsulate all the PCI behaviors (MSI, config > > space, regions, reset, etc) without requiring re-implementation of this > > in mdev drivers. > > I don't think the code in this series has any chance of actually > working. There is a lot of state associated with a NVMe subsystem, > controller and namespace, such as the serial number, subsystem NQN, > namespace uniqueue identifiers, Get/Set features state, pending AENs, > log page content. Just migrating from one device to another without > capturing all this has no chance of actually working. From what I understood this series basically allows two Intel devices to pass a big opaque blob of data. Intel didn't document what is in that blob, so I assume it captures everything you mention above. At least, that is the approach we have taken with mlx5. Every single bit of device state is serialized into the blob and when the device resumes it is indistinguishable from the original. Otherwise it is a bug. Jason
On Tue, Dec 06, 2022 at 09:52:54AM -0400, Jason Gunthorpe wrote: > On Tue, Dec 06, 2022 at 02:09:01PM +0100, Christoph Hellwig wrote: > > On Tue, Dec 06, 2022 at 09:05:05AM -0400, Jason Gunthorpe wrote: > > > In this case Intel has a real PCI SRIOV VF to expose to the guest, > > > with a full VF RID. > > > > RID? > > "Requester ID" - PCI SIG term that in Linux basically means you get to > assign an iommu_domain to the vfio device. Yeah I now the Requester ID, I've just never seen that shortcut for it. > >From what I understood this series basically allows two Intel devices > to pass a big opaque blob of data. Intel didn't document what is in > that blob, so I assume it captures everything you mention above. Which would be just as bad, because it then changes the IDs under the live OS on a restore. This is not something that can be done behind the back of the hypervisors / control plane OS.
On Tue, Dec 06, 2022 at 03:00:02PM +0100, Christoph Hellwig wrote: > > >From what I understood this series basically allows two Intel devices > > to pass a big opaque blob of data. Intel didn't document what is in > > that blob, so I assume it captures everything you mention above. > > Which would be just as bad, because it then changes the IDs under > the live OS on a restore. This is not something that can be done > behind the back of the hypervisors / control plane OS. Sorry, what live OS? In the VFIO restore model there is no "live OS" on resume. The load/resume cycle is as destructive as reset to the vfio device. When qemu operates vfio the destination CPU will not be running until the load/resume of all the VFIO devices is completed. So from the VM perspective it sees a complete no change, so long as the data blob causes the destination vfio device to fully match the source, including all IDs, etc. Jason
On Tue, Dec 06, 2022 at 10:20:26AM -0400, Jason Gunthorpe wrote: > In the VFIO restore model there is no "live OS" on resume. The > load/resume cycle is as destructive as reset to the vfio device. Of course there may be and OS. As soon as the VF is live Linux will by default bind to it. And that's the big problem here, the VF should not actually exist or at least not be usable when such a restore happens - or to say it in NVMe terms, the Secondary Controller better be in offline state when state is loaded into it.
On Tue, Dec 06, 2022 at 03:31:26PM +0100, Christoph Hellwig wrote: > On Tue, Dec 06, 2022 at 10:20:26AM -0400, Jason Gunthorpe wrote: > > In the VFIO restore model there is no "live OS" on resume. The > > load/resume cycle is as destructive as reset to the vfio device. > > Of course there may be and OS. As soon as the VF is live Linux > will by default bind to it. And that's the big problem here, > the VF should not actually exist or at least not be usable when > such a restore happens - or to say it in NVMe terms, the Secondary > Controller better be in offline state when state is loaded into it. Sadly in Linux we don't have a SRIOV VF lifecycle model that is any use. What we do have is a transfer of control from the normal OS driver (eg nvme) to the VFIO driver. Also, remember, that VFIO only does live migration between VFIO devices. We cannot use live migration and end up with a situation where the normal nvme driver is controlling the VF. The VFIO load model is explicitly destructive. We replace the current VF with the loading VF. Both the VFIO variant driver and the VFIO userspace issuing the load have to be aware of this and understand that the whole device will change. From an implementation perspective, I would expect the nvme varient driver to either place the nvme device in the correct state during load, or refuse to execute load if it is in the wrong state. To be compatible with what qemu is doing the "right state" should be entered by completing function level reset of the VF. The Linux/qemu parts are still being finalized, so if you see something that could be changed to better match nvme it would be a great time to understand that. Jason
On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote: > Sadly in Linux we don't have a SRIOV VF lifecycle model that is any > use. Beward: The secondary function might as well be a physical function as well. In fact one of the major customers for "smart" multifunction nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the symmetric dual ported devices are multi-PF as well). So this isn't really about a VF live cycle, but how to manage life migration, especially on the receive / restore side. And restoring the entire controller state is extremely invasive and can't be done on a controller that is in any classic form live. In fact a lot of the state is subsystem-wide, so without some kind of virtualization of the subsystem it is impossible to actually restore the state. To cycle back to the hardware that is posted here, I'm really confused how it actually has any chance to work and no one has even tried to explain how it is supposed to work.
On 12/6/2022 5:01 PM, Christoph Hellwig wrote: > On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote: >> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any >> use. > Beward: The secondary function might as well be a physical function > as well. In fact one of the major customers for "smart" multifunction > nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the > symmetric dual ported devices are multi-PF as well). > > So this isn't really about a VF live cycle, but how to manage life > migration, especially on the receive / restore side. And restoring > the entire controller state is extremely invasive and can't be done > on a controller that is in any classic form live. In fact a lot > of the state is subsystem-wide, so without some kind of virtualization > of the subsystem it is impossible to actually restore the state. ohh, great ! I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread. I guess this means that this is the right way to go. Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way. > > To cycle back to the hardware that is posted here, I'm really confused > how it actually has any chance to work and no one has even tried > to explain how it is supposed to work. I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard.
On 12/11/2022 8:05 PM, Max Gurtovoy wrote: > > On 12/6/2022 5:01 PM, Christoph Hellwig wrote: >> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote: >>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any >>> use. >> Beward: The secondary function might as well be a physical function >> as well. In fact one of the major customers for "smart" multifunction >> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the >> symmetric dual ported devices are multi-PF as well). >> >> So this isn't really about a VF live cycle, but how to manage life >> migration, especially on the receive / restore side. And restoring >> the entire controller state is extremely invasive and can't be done >> on a controller that is in any classic form live. In fact a lot >> of the state is subsystem-wide, so without some kind of virtualization >> of the subsystem it is impossible to actually restore the state. > > ohh, great ! > > I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread. > I guess this means that this is the right way to go. > Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way. > >> >> To cycle back to the hardware that is posted here, I'm really confused >> how it actually has any chance to work and no one has even tried >> to explain how it is supposed to work. > > I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard. Yes, as I wrote in the cover letter, this is a reference implementation to start a discussion and help drive standardization efforts, but this series works well for Intel IPU NVMe. As Jason said, there are two use cases: shared medium and local medium. I think the live migration of the local medium is complicated due to the large amount of user data that needs to be migrated. I don't have a good idea to deal with this situation. But for Intel IPU NVMe, each VF can connect to remote storage via the NVMF protocol to achieve storage offloading. This is the shared medium. In this case, we don't need to migrate the user data, which will significantly simplify the work of live migration. The series tries to solve the problem of live migration of shared medium. But it still lacks dirty page tracking and P2P support, we are also developing these features. About the nvme device state, As described in my document, the VF states include VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the implementation, I found that the device state data is small per VF. So, I decided to use the admin queue of the Primary controller to send the live migration commands to save and restore the VF states like MLX5. Thanks, Lei > >
On 12/11/2022 3:21 PM, Rao, Lei wrote: > > > On 12/11/2022 8:05 PM, Max Gurtovoy wrote: >> >> On 12/6/2022 5:01 PM, Christoph Hellwig wrote: >>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote: >>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any >>>> use. >>> Beward: The secondary function might as well be a physical function >>> as well. In fact one of the major customers for "smart" multifunction >>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the >>> symmetric dual ported devices are multi-PF as well). >>> >>> So this isn't really about a VF live cycle, but how to manage life >>> migration, especially on the receive / restore side. And restoring >>> the entire controller state is extremely invasive and can't be done >>> on a controller that is in any classic form live. In fact a lot >>> of the state is subsystem-wide, so without some kind of virtualization >>> of the subsystem it is impossible to actually restore the state. >> >> ohh, great ! >> >> I read this subsystem virtualization proposal of yours after I sent >> my proposal for subsystem virtualization in patch 1/5 thread. >> I guess this means that this is the right way to go. >> Lets continue brainstorming this idea. I think this can be the way to >> migrate NVMe controllers in a standard way. >> >>> >>> To cycle back to the hardware that is posted here, I'm really confused >>> how it actually has any chance to work and no one has even tried >>> to explain how it is supposed to work. >> >> I guess in vendor specific implementation you can assume some things >> that we are discussing now for making it as a standard. > > Yes, as I wrote in the cover letter, this is a reference > implementation to > start a discussion and help drive standardization efforts, but this > series > works well for Intel IPU NVMe. As Jason said, there are two use cases: > shared medium and local medium. I think the live migration of the > local medium > is complicated due to the large amount of user data that needs to be > migrated. > I don't have a good idea to deal with this situation. But for Intel > IPU NVMe, > each VF can connect to remote storage via the NVMF protocol to achieve > storage > offloading. This is the shared medium. In this case, we don't need to > migrate > the user data, which will significantly simplify the work of live > migration. I don't think that medium migration should be part of the SPEC. We can specify it's out of scope. All the idea of live migration is to have a short downtime and I don't think we can guarantee short downtime if we need to copy few terabytes throw the networking. If the media copy is taking few seconds, there is no need to do live migration of few milisecs downtime. Just do regular migration of a VM. > > The series tries to solve the problem of live migration of shared medium. > But it still lacks dirty page tracking and P2P support, we are also > developing > these features. > > About the nvme device state, As described in my document, the VF > states include > VF CSR registers, Every IO Queue Pair state, and the AdminQ state. > During the > implementation, I found that the device state data is small per VF. > So, I decided > to use the admin queue of the Primary controller to send the live > migration > commands to save and restore the VF states like MLX5. I think and hope we all agree that the AdminQ of the controlling NVMe function will be used to migrate the controlled NVMe function. which document are you refereeing to ? > > Thanks, > Lei > >> >>
On Sun, Dec 11, 2022 at 04:51:02PM +0200, Max Gurtovoy wrote: > I don't think that medium migration should be part of the SPEC. We can > specify it's out of scope. This is the main item in the TPAR in the technical working group, with SQ/CQ state beeing the other one. So instead of arguing here I'd suggest you all get involved in the working group ASAP. > All the idea of live migration is to have a short downtime and I don't > think we can guarantee short downtime if we need to copy few terabytes > throw the networking. You can. Look at the existing qemu code for live migration for image based storage, the same concepts also work for hardware offloads. > If the media copy is taking few seconds, there is no need to do live > migration of few milisecs downtime. Just do regular migration of a VM. The point is of course to not do the data migration during the downtime, but to track newly written LBAs after the start of the copy proces. Again look at qemu for how this has been done for years in software.
On 12/11/2022 10:51 PM, Max Gurtovoy wrote: > > On 12/11/2022 3:21 PM, Rao, Lei wrote: >> >> >> On 12/11/2022 8:05 PM, Max Gurtovoy wrote: >>> >>> On 12/6/2022 5:01 PM, Christoph Hellwig wrote: >>>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote: >>>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any >>>>> use. >>>> Beward: The secondary function might as well be a physical function >>>> as well. In fact one of the major customers for "smart" multifunction >>>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the >>>> symmetric dual ported devices are multi-PF as well). >>>> >>>> So this isn't really about a VF live cycle, but how to manage life >>>> migration, especially on the receive / restore side. And restoring >>>> the entire controller state is extremely invasive and can't be done >>>> on a controller that is in any classic form live. In fact a lot >>>> of the state is subsystem-wide, so without some kind of virtualization >>>> of the subsystem it is impossible to actually restore the state. >>> >>> ohh, great ! >>> >>> I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread. >>> I guess this means that this is the right way to go. >>> Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way. >>> >>>> >>>> To cycle back to the hardware that is posted here, I'm really confused >>>> how it actually has any chance to work and no one has even tried >>>> to explain how it is supposed to work. >>> >>> I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard. >> >> Yes, as I wrote in the cover letter, this is a reference implementation to >> start a discussion and help drive standardization efforts, but this series >> works well for Intel IPU NVMe. As Jason said, there are two use cases: >> shared medium and local medium. I think the live migration of the local medium >> is complicated due to the large amount of user data that needs to be migrated. >> I don't have a good idea to deal with this situation. But for Intel IPU NVMe, >> each VF can connect to remote storage via the NVMF protocol to achieve storage >> offloading. This is the shared medium. In this case, we don't need to migrate >> the user data, which will significantly simplify the work of live migration. > > I don't think that medium migration should be part of the SPEC. We can specify it's out of scope. > > All the idea of live migration is to have a short downtime and I don't think we can guarantee short downtime if we need to copy few terabytes throw the networking. > If the media copy is taking few seconds, there is no need to do live migration of few milisecs downtime. Just do regular migration of a > >> >> The series tries to solve the problem of live migration of shared medium. >> But it still lacks dirty page tracking and P2P support, we are also developing >> these features. >> >> About the nvme device state, As described in my document, the VF states include >> VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the >> implementation, I found that the device state data is small per VF. So, I decided >> to use the admin queue of the Primary controller to send the live migration >> commands to save and restore the VF states like MLX5. > > I think and hope we all agree that the AdminQ of the controlling NVMe function will be used to migrate the controlled NVMe function. Fully agree. > > which document are you refereeing to ? The fifth patch includes the definition of these commands and how the firmware handles these live migration commands. It's the documentation that I referenced. >> >>> >>>
On Tue, Dec 06, 2022 at 04:01:31PM +0100, Christoph Hellwig wrote: > So this isn't really about a VF live cycle, but how to manage life > migration, especially on the receive / restore side. And restoring > the entire controller state is extremely invasive and can't be done > on a controller that is in any classic form live. In fact a lot > of the state is subsystem-wide, so without some kind of virtualization > of the subsystem it is impossible to actually restore the state. I cannot speak to nvme, but for mlx5 the VF is laregly a contained unit so we just replace the whole thing. From the PF there is some observability, eg the VF's MAC address is visible and a few other things. So the PF has to re-synchronize after the migration to get those things aligned. > To cycle back to the hardware that is posted here, I'm really confused > how it actually has any chance to work and no one has even tried > to explain how it is supposed to work. I'm interested as well, my mental model goes as far as mlx5 and hisillicon, so if nvme prevents the VFs from being contained units, it is a really big deviation from VFIO's migration design.. Jason
On Tue, Dec 06, 2022 at 11:28:12AM -0400, Jason Gunthorpe wrote: > I'm interested as well, my mental model goes as far as mlx5 and > hisillicon, so if nvme prevents the VFs from being contained units, it > is a really big deviation from VFIO's migration design.. In NVMe the controller (which maps to a PCIe physical or virtual function) is unfortunately not very self contained. A lot of state is subsystem-wide, where the subsystem is, roughly speaking, the container for all controllers that shared storage. That is the right thing to do for say dual ported SSDs that are used for clustering or multi-pathing, for tentant isolation is it about as wrong as it gets. There is nothing in the NVMe spec that prohibits your from implementing multiple subsystems for multiple functions of a PCIe device, but if you do that there is absolutely no support in the spec to manage shared resources or any other interaction between them.
> -----Original Message----- > From: Christoph Hellwig <hch@lst.de> > Sent: Tuesday, December 6, 2022 7:36 AM > To: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Christoph Hellwig <hch@lst.de>; Rao, Lei <Lei.Rao@intel.com>; > kbusch@kernel.org; axboe@fb.com; kch@nvidia.com; sagi@grimberg.me; > alex.williamson@redhat.com; cohuck@redhat.com; yishaih@nvidia.com; > shameerali.kolothum.thodi@huawei.com; Tian, Kevin <kevin.tian@intel.com>; > mjrosato@linux.ibm.com; linux-kernel@vger.kernel.org; linux- > nvme@lists.infradead.org; kvm@vger.kernel.org; Dong, Eddie > <eddie.dong@intel.com>; Li, Yadong <yadong.li@intel.com>; Liu, Yi L > <yi.l.liu@intel.com>; Wilk, Konrad <konrad.wilk@oracle.com>; > stephen@eideticom.com; Yuan, Hang <hang.yuan@intel.com> > Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device > > On Tue, Dec 06, 2022 at 11:28:12AM -0400, Jason Gunthorpe wrote: > > I'm interested as well, my mental model goes as far as mlx5 and > > hisillicon, so if nvme prevents the VFs from being contained units, it > > is a really big deviation from VFIO's migration design.. > > In NVMe the controller (which maps to a PCIe physical or virtual > function) is unfortunately not very self contained. A lot of state is subsystem- > wide, where the subsystem is, roughly speaking, the container for all > controllers that shared storage. That is the right thing to do for say dual > ported SSDs that are used for clustering or multi-pathing, for tentant isolation > is it about as wrong as it gets. NVMe spec is general, but the implementation details (such as internal state) may be vendor specific. If the migration happens between 2 identical NVMe devices (from same vendor/device w/ same firmware version), migration of subsystem-wide state can be naturally covered, right? > > There is nothing in the NVMe spec that prohibits your from implementing > multiple subsystems for multiple functions of a PCIe device, but if you do that > there is absolutely no support in the spec to manage shared resources or any > other interaction between them. In IPU/DPU area, it seems multiple VFs with SR-IOV is widely adopted. In VFs, the usage of shared resource can be viewed as implementation specific, and load/save state of a VF can rely on the hardware/firmware itself. Migration of NVMe devices crossing vendor/device is another story: it may be useful, but brings additional challenges.
On Tue, Dec 06, 2022 at 06:00:27PM +0000, Dong, Eddie wrote: > NVMe spec is general, but the implementation details (such as internal state) may > be vendor specific. If the migration happens between 2 identical NVMe devices > (from same vendor/device w/ same firmware version), migration of > subsystem-wide state can be naturally covered, right? No. If you want live migration for nvme supported in Linux, it must be speced in the NVMe technical working group and interoperate between different implementations.
© 2016 - 2025 Red Hat, Inc.